Paper deep dive

P^2O: Joint Policy and Prompt Optimization

Xinyu Lu, Kaiqi Zhang, Jinglin Yang, Boxi Cao, Yaojie Lu, Hongyu Lin, Min He, Xianpei Han, Le Sun

Year: 2026Venue: arXiv preprintArea: cs.LGType: PreprintEmbeddings: 49

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 94%

Last extracted: 3/26/2026, 2:35:20 AM

Summary

P^2O is a novel framework that synergizes Prompt Optimization with Policy Optimization to address the exploration bottleneck in Reinforcement Learning with Verifiable Rewards (RLVR). By identifying 'hard samples' and using the Genetic-Pareto (GEPA) algorithm to evolve prompt templates, P^2O guides models toward successful trajectories and uses context distillation to internalize these reasoning gains directly into model parameters, improving performance on complex reasoning tasks.

Entities (5)

P^2O · framework · 100%GEPA · algorithm · 95%GRPO · baseline-method · 95%RLVR · paradigm · 95%Context Distillation · technique · 90%

Relation Signals (3)

P^2O → improves → RLVR

confidence 95% · P2O addresses the exploration bottleneck in RLVR

P^2O → utilizes → GEPA

confidence 95% · P2O identifies hard samples during training iterations and leverages the Genetic-Pareto (GEPA) prompt optimization algorithm

P^2O → implements → Context Distillation

confidence 90% · we employ Context Distillation (Snell et al., 2022) to transfer the reasoning capabilities

Cypher Suggestions (2)

Identify techniques used by P^2O to improve model performance. · confidence 95% · unvalidated

MATCH (p:Framework {name: 'P^2O'})-[:IMPLEMENTS|UTILIZES]->(t) RETURN t.name, labels(t)

Find all frameworks that utilize the GEPA algorithm. · confidence 90% · unvalidated

MATCH (f:Framework)-[:UTILIZES]->(a:Algorithm {name: 'GEPA'}) RETURN f.name

Abstract

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm for enhancing the reasoning capabilities of Large Language Models (LLMs). However, vanilla RLVR suffers from inefficient exploration, particularly when confronting "hard samples" that yield nearzero success rates. In such scenarios, the reliance on sparse outcome rewards typically results in zero-advantage estimates, effectively starving the model of supervision signals despite the high informational value of these instances. To address this, we propose P^2O, a novel framework that synergizes Prompt Optimization with Policy Optimization. P^2O identifies hard samples during training iterations and leverages the GeneticPareto (GEPA) prompt optimization algorithm to evolve prompt templates that guide the model toward discovering successful trajectories. Crucially, unlike traditional prompt engineering methods that rely on input augmentation, P^2O distills the reasoning gains induced by these optimized prompts directly into the model parameters. This mechanism provides denser positive supervision signals for hard samples and accelerates convergence. Extensive experiments demonstrate that P^2O not only achieves superior performance on in-distribution datasets but also exhibits strong generalization, yielding substantial improvements on out-of-distribution benchmarks (+4.7% avg.).

PDF

Open source PDF →Open local PDF →

Full Text

48,635 characters extracted from source content.

Expand or collapse full text

P2O: Joint Policy and Prompt Optimization Xinyu Lu Kaiqi Zhang Jinglin Yang Boxi Cao Yaojie Lu Hongyu Lin Min He Xianpei Han Le Sun Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm for enhancing the reasoning capabilities of Large Language Models (LLMs). However, vanilla RLVR suffers from inefficient exploration, particularly when confronting “hard samples” that yield near-zero success rates. In such scenarios, the reliance on sparse outcome rewards typically results in zero-advantage estimates, effectively starving the model of supervision signals despite the high informational value of these instances. To address this, we propose P2O, a novel framework that synergizes Prompt Optimization with Policy Optimization. P2O identifies hard samples during training iterations and leverages the Genetic-Pareto (GEPA) prompt optimization algorithm to evolve prompt templates that guide the model toward discovering successful trajectories. Crucially, unlike traditional prompt engineering methods that rely on input augmentation, P2O distills the reasoning gains induced by these optimized prompts directly into the model parameters. This mechanism provides denser positive supervision signals for hard samples and accelerates convergence. Extensive experiments demonstrate that P2O not only achieves superior performance on in-distribution datasets but also exhibits strong generalization, yielding substantial improvements on out-of-distribution benchmarks (+4.7% avg.). Reinforcement Learning, Prompt Optimization, RLVR, ICML 1 Introduction Large Language Models (LLMs) have achieved remarkable proficiency across a spectrum of complex reasoning tasks (Jaech et al., 2024; Guo et al., 2025). A defining characteristic of these tasks is the existence of objective verification mechanisms capable of providing deterministic feedback. Capitalizing on this, Reinforcement Learning with Verifiable Rewards (RLVR) (Lambert et al., 2024) has emerged as a dominant paradigm for reasoning alignment. By leveraging outcome-based supervision, RLVR enables models to surpass the limitations of imitation learning, facilitating autonomous exploration of the solution space to optimize reasoning trajectories aligned with rigorous verification signals (Liu et al., 2025a). Despite its promise, the performance of RLVR is severely constrained when the training data intermixes “simple samples” with “hard samples”—specifically, instances where the model achieves near-zero success rates across K rollouts. For hard samples, the model fails to capture reliable gradient feedback due to exploration difficulties and sparse rewards (Song et al., 2025). Consequently, the optimization process heavily relies on the accessible signals from simple samples. This causes the model to overfit to simple instances, trapping it in a suboptimal policy where it excels at trivial tasks but fails to resolve complex reasoning problems. Figure 1: Conceptual Illustration of the P2O Framework. Standard policy optimization often gets trapped in local optima (ρinit _init) due to sparse rewards on hard samples. P2O bridges this exploration gap using optimized prompts (the red arrow) to reach high-reward regions that are inaccessible via standard exploration. Subsequently, the model consolidates these gains (the white arrows) by updating its parameters to master the new region (ρopt _opt), effectively internalizing the prompt-induced capabilities. Common approaches to mitigate exploration challenges include curriculum learning strategies, which progressively introduce harder samples as training advances (Blakeman et al., 2025), and various reward shaping techniques that provide intermediate feedback. However, curriculum learning requires heuristic-based and computationally expensive schedule generation (Bercovich et al., 2025), while reward shaping demands domain-specific expert heuristics. The recent success of prompt optimization methods (Fernando et al., 2023; Opsahl-Ong et al., 2024; Yuksekgonul et al., 2025) offers a compelling way to break this stalemate. These methods demonstrate that even when a model fails to solve a hard problem under a standard policy, a carefully evolved prompt can often elicit the correct reasoning path. This implies that the solution lies within the model’s latent search space but is inaccessible via standard gradient ascent (Zelikman et al., 2022). As illustrated in Figure 1, we visualize this mechanism as the red arrow: optimized prompts act as a bridge, enabling the model to “jump” out of the local optimum (ρinit _init) and cross the reward-sparse valley. However, relying solely on inference-time prompts is insufficient; the ultimate goal is to internalize these capabilities. Therefore, the “jump” must be followed by consolidation—represented by the white arrows ascending the second peak—where the model parameters are updated to master the new region (ρopt _opt). Building on this intuition, we propose P2O (Joint Policy and Prompt Optimization), a novel framework that synergizes adaptive prompt evolution with reinforcement learning to overcome the hard sample bottleneck. Specifically, P2O identifies challenging instances during training and employs the Genetic-Pareto (GEPA) (Agrawal et al., 2025) prompt optimization algorithm to evolve prompts that elicit successful reasoning chains. Rather than relying on inference-time prompting, we utilize these improved trajectories as supervision signals, enabling the policy to internalize reasoning patterns directly into its parameters. This creates a virtuous cycle: as the model improves, previously hard samples become tractable, and GEPA focuses its efforts on the new frontier of difficulty. Finally, to verify the effectiveness of our method, we evaluate P2O on representative reasoning benchmarks, demonstrating its superior performance compared to baselines. The contributions of this paper are summarized as follows: • We propose P2O, a joint optimization framework that alternates between policy updates and prompt evolution. This approach addresses the exploration bottleneck in RLVR by leveraging optimized prompts to discover valid trajectories for hard samples. • We introduce a context distillation strategy that allows the model to learn from prompt-guided trajectories without relying on them at inference. By computing gradients on the original input, the model internalizes the reasoning patterns elicited by the prompts. • We demonstrate through experiments on six benchmarks that P2O consistently outperforms GRPO baselines, showing particularly strong improvements on hard reasoning tasks such as AIME (+12.3% avg.). 2 Preliminaries 2.1 Policy Optimization Problem Definition. In the context of LLMs, policy optimization is formulated as a Reinforcement Learning (RL) problem where the LLM acts as a policy πθ _θ parameterized by weights θ. Given a dataset of queries =xD=\x\, the policy generates a response y consisting of a sequence of tokens. The quality of the generated response is evaluated by a reward function r(x,y)r(x,y), which may be derived from ground-truth verification (e.g., in reasoning tasks) or a preference model. Optimization Objective. The goal of policy optimization is to maximize the expected reward while ensuring the updated policy does not deviate excessively from a reference policy πref _ref. The standard objective function is: JRL(θ)=x∼,y∼πθ(⋅|x)[r(x,y)−βKL(πθ(⋅|x)∥πref(⋅|x))]J_RL(θ)=E_x ,y _θ(·|x) [r(x,y)- _KL( _θ(·|x)\| _ref(·|x)) ] where β is a coefficient controlling the KL-divergence penalty to maintain training stability. Group Relative Policy Optimization (GRPO). Traditional RL methods like PPO (Schulman et al., 2017) require a separate value function to estimate the baseline for advantage computation, which doubles the memory overhead. GRPO (Shao et al., 2024) eliminates the critic by employing a group-based baseline. For each query x, GRPO samples a group of K outputs y1,y2,…,yK\y_1,y_2,…,y_K\ from the current policy. The advantage for the i-th output is computed by normalizing its reward against the group statistics: Ai=r(x,yi)−μgroupσgroupA_i= r(x,y_i)- _group _group where μgroup _group and σgroup _group are the mean and standard deviation of the rewards within the group. GRPO updates the policy by maximizing a surrogate objective based on these relative advantages, significantly improving training efficiency for reasoning tasks. 2.2 Prompt Optimization Problem Definition. Prompt optimization treats the LLM as a frozen black-box function ℳM. The optimization variable is the discrete prompt z from the space of all possible natural language strings P. Given an input x, the model generates an output y^=ℳ(z,x) y=M(z,x). The problem is to find an optimal prompt z∗z^* that maximizes a metric S(y^,y)S( y,y) (e.g., accuracy or F1 score) over the validation dataset valD_val. Optimization Objective. This is a discrete, derivative-free optimization problem. The objective is formally defined as: z∗=arg⁡maxz∈⁡(x,y)∼val[S(ℳ(z,x),y)]z^*= _z E_(x,y) _val [S(M(z,x),y) ] (1) Unlike policy optimization, which adjusts continuous weights θ, prompt optimization searches the discrete semantic space of instructions to elicit better capabilities from the fixed model. GEPA (Genetic-Pareto). To solve this optimization problem efficiently, we consider GEPA (Agrawal et al., 2025), a state-of-the-art evolutionary framework. GEPA conceptualizes prompt optimization as a genetic process driven by language-based reflection. Instead of random mutations, it employs a “Reflection LLM” to analyze error traces from the current prompt’s performance and generate targeted semantic mutations that address specific failure modes. Furthermore, GEPA maintains a population of diverse prompts using a Pareto-based selection mechanism, optimizing for multiple objectives (e.g., accuracy and conciseness) simultaneously. This allows the framework to navigate the discrete prompt space effectively, escaping local optima that trap conventional prompt optimization methods. 3 P2O: Joint Policy and Prompt Optimization 3.1 Overview of P2O Framework Figure 2: Overview of the P2O Framework. The training process is formulated as an alternating maximization procedure between two phases: (1) Policy Optimization with Context Distillation, where the policy πθ _θ is updated to internalize reasoning patterns elicited by augmented inputs x~ x; and (2) Evolutionary Prompt Optimization, where the prompt template set Z is evolved using GEPA to discover successful trajectories for the remaining hard samples (hardD_hard). We propose P2O, a framework designed to overcome the exploration bottleneck in RLVR. Standard RL methods often fail on “hard samples”—instances where the current policy yields zero successful trajectories—leading to vanishing gradients. P2O addresses this by treating the prompt template as a dynamic latent variable that is jointly optimized with the policy parameters. As illustrated in Figure 2, P2O operates as an alternating maximization procedure comprising two distinct phases: (1) Policy Optimization with Context Distillation, where the model internalizes the reasoning capabilities elicited by optimized prompts; and (2) Evolutionary Prompt Optimization, where a genetic algorithm discovers new prompts to crack the remaining hard samples. 3.2 Problem Formulation We build upon the RLVR setting defined in Section 2.1. The standard objective is to maximize the expected reward JRL(θ)=x∼,y∼πθ[r(x,y)]J_RL(θ)=E_x ,y _θ[r(x,y)]. However, in complex reasoning tasks, the solution space is vast and the reward signal is sparse. The Exploration Bottleneck. We formally define a subset of the data, hard⊂D_hard , as “hard samples” where the current policy πθ _θ fails to discover almost any successful trajectory within a reasonable computational budget. Specifically, for x∈hardx _hard, the expected reward is nearly zero: y∼πθ(⋅|x)[r(x,y)]≈0E_y _θ(·|x)[r(x,y)]≈ 0. Under standard policy gradient methods, the gradient for these samples vanishes: ∇θJ(x)≈y∼πθ[(r(x,y)−b)⏟≈0∇θlog⁡πθ(y|x)]≈0 _θJ(x) _y _θ[ (r(x,y)-b)_≈ 0 _θ _θ(y|x)]≈ 0 (2) where b is a baseline function (e.g., a value function V(x)V(x) or moving average) used for variance reduction. Since both the realized reward r and the baseline b approach zero for hard samples, the model ceases to learn from hardD_hard, leading to sample inefficiency and sub-optimal convergence. Prompt as a Latent Variable. To tackle this, we introduce a set of auxiliary prompt templates Z as discrete latent variables. We posit that for a hard sample x∈hardx _hard, while the direct mapping x→y∗x→ y^* is inaccessible to the current policy, there exists a latent instruction z∈z such that the augmented input x~=(x,z) x=T(x,z) (where T is a template insertion function) significantly increases the density of successful trajectories. Therefore, we reformulate the optimization problem. Instead of optimizing θ solely on the raw input x, we seek to maximize a joint objective over the policy parameters θ and the latent template space Z: maxθ,⁡joint=∑x∈maxz∈∪ϵ⁡y∼πθ(⋅|(x,z))[r(x,y)] _θ,ZJ_joint= _x _z ∪\ε\E_y _θ(·|T(x,z))[r(x,y)] (3) where ϵε denotes the empty template (original input). This formulation implies a bilevel optimization challenge: we must identify the optimal latent prompt z∗z^* to unlock the gradient for x, and simultaneously update θ to maximize the likelihood of high-reward trajectories. Since directly optimizing the discrete Z and continuous θ is intractable, P2O decouples this into an iterative alternating maximization process. 3.3 Phase 1: Policy Optimization with Context Distillation In this phase, we fix the template set Z and update the policy parameters θ. A naïve approach would be to train the policy to map the augmented input x~ x to the correct output y. However, this creates a dependency on inference-time prompting. Instead, we employ Context Distillation (Snell et al., 2022) to transfer the reasoning capabilities triggered by z directly into the parameters of πθ _θ for the original input x. Hard Sample Mining. At epoch t, we dynamically identify the set of hard samples hard(t+1)D_hard^(t+1). For each query xix_i, we perform K rollouts. We define a hardness metric based on the empirical success rate: xi∈hard(t+1)⇔1K∑k=1Kr(xi,yik)<τx_i _hard^(t+1) 1K _k=1^Kr(x_i,y_ik)<τ (4) where τ is a threshold (typically nearly zero). These hard samples are collected to serve as the seed data for the subsequent Prompt Optimization phase. Trajectory Augmentation & Distillation. For samples identified as hard in the immediately preceding epoch, standard exploration is insufficient. We leverage the template set (t)Z^(t) evolved in the immediately preceding epoch. For a hard sample x∈hard(t)x _hard^(t), we sample a template z∼(t)z ^(t) and generate trajectories using the augmented context x~=(x,z) x=T(x,z). Crucially, while the generation is conditioned on x~ x, the policy update is calculated with respect to the original input x. Let y~ y be an improved trajectory generated via prompt augmentation. The gradient update is computed as: ∇θJ≈1N∑i=1N∑y~∈~A(x,y~)∇θlog⁡πθ(y~|x) _θJ≈ 1N _i=1^N _ y∈ YA(x, y) _θ _θ( y|x) (5) where ~ Y denotes the set of improved trajectories with input x~ x. By decoupling the rollout context (x~ x) from the gradient context (x), P2O forces the model to learn the reasoning path y as an intrinsic capability, independent of the auxiliary prompt z. This effectively distills the “teacher” distribution π(⋅|x~)π(·| x) into the “student” distribution π(⋅|x)π(·|x). 3.4 Phase 2: Evolutionary Prompt Optimization In the second phase, we fix the policy πθ _θ and optimize the template set (t+1)Z^(t+1) to address the newly identified hard set hard(t+1)D_hard^(t+1). We employ GEPA (Algorithm 3) for this discrete optimization task. Reflective Evolution. Unlike traditional genetic algorithms that rely on random mutation, GEPA utilizes the reference model πinit _init as a mutation operator to “perform gradient descent in semantic space” (Yang et al., 2024). For each template z in the current Pareto front frontZ_front, we sample a mini-batch miniD_mini from hardtrainD_hard^train and generate error feedback ℱF containing failed predictions and ground truth. The mutation step then produces an improved candidate: z′←πinit(“Propose Improvement”∣z,ℱ)z ← _init(``Propose Improvement′ z,F) (6) where “Propose Improvement” denotes the meta-prompt for proposing new candidate templates. We adopt the same meta-prompt as in Agrawal et al. (2025). Candidates that demonstrate improvement on miniD_mini are evaluated on the development set harddevD_hard^dev and added to the template pool Z. Pareto Selection. Using only the best template is insufficient to cover the diversity of failure modes in reasoning tasks. GEPA maintains a population of templates and employs Pareto optimization on harddevD_hard^dev to identify nondominated candidates (Algorithm 4). This approach ensures that Z remains diverse and generalizable across different error patterns. After evolution, we apply a greedy set cover strategy (Algorithm 2) to select a minimal Pareto set coveredZ_covered and assign sample-specific templates to each instance in hardD_hard for the next training epoch. This mechanism enables targeted intervention while maintaining template diversity. The complete training procedure is summarized in Algorithm 1. By iteratively refining the policy to absorb prompt-induced capabilities and evolving prompts to target the policy’s remaining weaknesses, P2O establishes a virtuous cycle of self-improvement. Algorithm 1 P2O 0: Training Data D, Initial Template Set (0)=∅Z^(0)= 0: Policy Model πθ _θ, Reference Model πinit _init 0: Epochs NepochN_epoch, Rollout Num K, Threshold τ 0: Optimized Policy πθ∗ _θ^* 1: for t=0t=0 to Nepoch−1N_epoch-1 do 2: Initialize next hard set hard(t+1)←∅D_hard^(t+1)← 3: // Phase 1: Policy Optimization 4: for all batch (xi,y^i)∈(x_i, y_i) do 5: for k=1k=1 to K do 6: xik←xix_ik← x_i 7: if xi∈hard(t)x_i _hard^(t) and (t)≠∅Z^(t)≠ then 8: Sample template z∼(t)z ^(t) 9: xik←(xik,z)x_ik (x_ik,z) 10: // T is a template insertion function 11: end if 12: Sample yik∼πθ(⋅|xik)y_ik _θ(·|x_ik) 13: rik←RewardFunc(yik,y^i)r_ik← RewardFunc(y_ik, y_i) 14: end for 15: r¯i←1K∑k=1Krik r_i← 1K _k=1^Kr_ik 16: if r¯i<τ r_i<τ then 17: hard(t+1)←hard(t+1)∪xiD_hard^(t+1) _hard^(t+1)∪\x_i\ 18: end if 19: Compute advantages AikA_ik 20: Update πθ _θ using xi,yik,Aik\x_i,y_ik,A_ik\ 21: end for 22: // Phase 2: Prompt Optimization 23: if hard(t+1)≠∅D_hard^(t+1)≠ then 24: (t+1)←Gepa(hard(t+1),πθ,πinit)Z^(t+1)← Gepa(D_hard^(t+1), _θ, _init) // Alg. 3 25: else 26: (t+1)←∅Z^(t+1)← 27: end if 28: end for Algorithm 2 GreedyPromptAssignment 0: Template set with scores =(z1,(r1,1,…,r1,N)),…,(zM,(rM,1,…,rM,N))Z=\(z_1,(r_1,1,…,r_1,N)),…,(z_M,(r_M,1,…,r_M,N))\, Hard samples hardD_hard, Rollout Num K // Each (ri,1,…,ri,N)(r_i,1,…,r_i,N) contains scores of template ziz_i on N dev samples 0: Sample-specific template assignments (xj,j)\(x_j,z_j)\ // Step 1: Greedy set cover to find minimal Pareto set 1: Initialize Template Pareto Set covered←∅Z_covered← and Covered Sample Set covered←∅S_covered← 2: while True do 3: Find z∗←arg⁡maxzi∉covered⁡|(zi)∖covered|z^*← _z_i _covered|C(z_i) _covered| 4: // (zi)=xj∣ri,j=1C(z_i)=\x_j r_i,j=1\ is the set of dev samples solved by ziz_i 5: if |(z∗)∖covered|=0|C(z^*) _covered|=0 then 6: break 7: end if 8: covered←covered∪z∗Z_covered _covered∪\z^*\ 9: covered←covered∪(z∗)S_covered _covered (z^*) 10: end while 11: // Step 2: Assign K templates to each hard sample 12: Compute weights: r¯i←1N∑n=1Nri,n r_i← 1N _n=1^Nr_i,n for each zi∈coveredz_i _covered 13: Initialize assignment: ←A←\\ 14: for xjx_j in hardD_hard do 15: if xj∈coveredx_j _covered then 16: j←zi∈covered∣xj∈(zi)Z_j←\z_i _covered x_j (z_i)\ 17: // Sample from templates that solve this sample 18: else 19: j←coveredZ_j _covered 20: end if 21: Sample j=(zj,1,…,zj,K)z_j=(z_j,1,…,z_j,K) from jZ_j weighted for K times by r¯i\ r_i\ 22: ←∪(xj,j)A ∪\(x_j,z_j)\ 23: end for 24: RETURN A Table 1: Comparative Evaluation on Mathematical Reasoning Benchmarks. We report the accuracy (%) of P2O against baselines. P2O consistently outperforms the baselines, achieving the most significant gains on challenging benchmarks such as AIME24 and AIME25. Best results are highlighted in bold. Model AIME24 AIME25 AMC MATH500 Minerva Olympiad AVG Base Models Qwen3-0.6B 2.1 1.9 30.2 53.8 12.5 19.4 19.6 Qwen3-1.7B 13.1 10.2 47.7 72.8 22.1 38.4 34.1 Qwen3-4B-Base 9.2 5.8 37.7 67.4 32.4 35.6 31.9 Qwen3-4B 23.8 19.4 68.0 82.8 31.6 49.9 45.9 DeepMath-5K GRPO 33.8 29.0 79.5 87.6 41.5 57.5 54.8 Qwen3-4B-P2OSelf-Ref_Self-Ref 52.1 39.8 85.3 90.2 41.9 60.9 61.7 Qwen3-4B-P2OTeacher-Ref_Teacher-Ref 44.6 34.4 81.9 90.2 36.0 60.0 57.9 DeepScaler-5K GRPO 46.9 37.7 88.1 90.4 41.5 58.2 60.5 Qwen3-4B-P2OSelf-Ref_Self-Ref 51.5 42.1 88.1 91.0 40.1 61.8 62.4 Qwen3-4B-P2OTeacher-Ref_Teacher-Ref 59.8 49.4 92.2 91.4 36.4 62.2 65.2 4 Experiments 4.1 Settings Datasets. We conduct experiments on two distinct datasets, each comprising N=5,000N=5,000 samples. The first dataset is a subset randomly sampled from DeepScaler (Luo et al., 2025), while the second is derived from DeepMath (He et al., 2025) by selecting samples with a difficulty level of ≥7≥ 7. Method Configuration. We combine GRPO with a one-step off-policy strategy, excluding the KL divergence penalty. The training hyperparameters include a maximum learning rate of 1×10−61× 10^-6, a global batch size of 128, and a maximum generation length of 12k tokens. During the rollout phase, we utilize a temperature of T=0.6T=0.6 and sample K=6K=6 trajectories per prompt. All models are trained for 5 epochs. We use Qwen3-4B (Yang et al., 2025) as the training backbone. For GEPA, we use a validation set of 300 examples. Unlike the original GEPA setting (Agrawal et al., 2025), we use beam size W to parallelize the Pareto search for improved computational efficiency. The candidate selection beam size is set to W=16W=16 (yielding ∼40 40 templates per iteration). Regarding the reflection model in GEPA, we evaluate two variants: P2OSelf-Ref_Self-Ref, which utilizes the reference model (Qwen3-4B) as the mutation operator, and P2OTeacher-Ref_Teacher-Ref, which employs Kimi-K2 (Team et al., 2025) as a stronger external teacher to provide high-quality feedback and prompt mutations. Reward. We employ a strict binary reward (r∈0,1r∈\0,1\). A reward of 1 is granted solely when the response follows the format and matches the ground truth. Evaluation Protocol. We employ the open-source evaluation suite111https://github.com/QwenLM/Qwen2.5-Math/tree/main/evaluation provided by Qwen for all mathematical benchmarks. To ensure statistical robustness and encourage exploration on smaller datasets (fewer than 100 samples), we report the average performance across 16 rollouts per question, generated with a temperature of 0.60.6. For larger datasets, we adopt greedy sampling. 4.2 Main Results Figure 3: Training Dynamics of P2O. The plots illustrate the dynamics of Training Reward (Left) and Validation Accuracy (Right) throughout the optimization process using the Teacher-Ref variant on the DeepScaler-5K dataset. P2O maintains a higher reward compared to the baseline by dynamically cracking hard samples. Crucially, this reward advantage translates into robust gains in validation accuracy, confirming that the context distillation mechanism effectively transfers prompt-dependent success into intrinsic model capability. Figure 4: Prompt Optimization Effectiveness. Left: Optimized prompts (triangles) consistently yield gains over the standard policy (circles) on both Pass@1 and Pass@6 metrics, demonstrating that GEPA effectively assists the model in bridging the performance gap. Right: With this assistance, the model continuously conquers hard samples, resulting in a steady decline in the number of intractable instances throughout the training epochs. Data points are derived from the Teacher-Ref variant on the DeepScaler-5K dataset. Table 1 summarizes the performance of models trained on DeepMath-5K and DeepScaler-5K across six challenging mathematical benchmarks, comparing P2O against Qwen3 base models and the GRPO baselines. Comparison with Baselines. As shown in Table 1, P2O consistently outperforms both the backbone model and the GRPO baseline across most mathematical benchmarks. On the DeepScaler-5K dataset, our best configuration achieves an average accuracy of 65.2%, surpassing the GRPO baseline by 4.7%. Similar performance gains are observed on the DeepMath-5K dataset, where P2O yields a 6.9% improvement over GRPO, demonstrating robustness across different training distributions. The benefits of P2O are most pronounced in high-difficulty reasoning tasks that demand extensive exploration. Notably, on the AIME24 and AIME25 benchmarks under the DeepScaler-5K setting, P2O improves upon GRPO by 12.9% and 11.7% respectively. These results empirically validate our hypothesis: prompt evolution bridges the exploration gap for hard samples, recovering vital training signals from instances that would otherwise remain intractable. Impact of Reflection Models on Prompt Evolution. Our comparison reveals that the optimal reflection source is task-dependent: on DeepScaler-5K, the Teacher-Reflection variant dominates (65.2% vs. 62.4% avg.), whereas on DeepMath-5K, P2OSelf-Ref_Self-Ref (61.7% avg.) surprisingly outperforms the Teacher-Reflection variant (57.9% avg.). This divergence suggests that for certain specialized domains, a policy model may benefit more from self-generated refinements that strictly align with its own reasoning capacity, whereas an external teacher might introduce complex priors that are harder for the policy to internalize. Crucially, all P2O variants consistently surpass the GRPO baseline across both datasets, confirming that the structural exploration enabled by GEPA is a robust driver of performance gains regardless of the reflection source. 4.3 Training Dynamics Analysis of Learning Curves. As shown in Figure 3, the continuous integration of optimized prompts maintains a training reward consistently superior to the baseline. Crucially, this advantage translates into significant gains in validation accuracy, confirming that the model effectively internalizes the elicited reasoning patterns to achieve robust in-distribution generalization. Analysis of Prompt Optimization Process. Figure 4 elucidates the mechanism driving P2O’s performance. As the model’s intrinsic capability grows during training, GEPA effectively assists in continuously conquering “hard samples”, evidenced by the steady decline in the number of intractable instances (Right). Furthermore, the performance breakdown (Left) reveals that prompt optimization yields consistent gains across both Pass@1 and Pass@6 metrics. This demonstrates that the evolved templates do not merely facilitate a single lucky guess; rather, they robustly enhance the model’s solution space, boosting both the deterministic success rate and the broader exploration coverage required for effective distillation. 4.4 Ablations Table 2: Ablation Study of P2O Components. Comparison of performance on mathematical benchmarks when excluding context distillation or group prompt diversity. Data points are derived from the Teacher-Ref variant on the DeepScaler-5K dataset. Model AIME24 AIME25 AMC MATH500 Minerva Olympiad AVG Qwen3-4B 23.8 19.4 68.0 82.8 31.6 49.9 45.9 GRPO 46.9 37.7 88.1 90.4 41.5 58.2 60.5 Qwen3-4B-P2OTeacher-Ref_Teacher-Ref 59.8 49.4 92.2 91.4 36.4 62.2 65.2 Qwen3-4B-P2Ow/o context distillation_w/o context distillation 36.5 31.5 81.1 89.6 40.1 54.8 55.6 Qwen3-4B-P2Osame template in group_same template in group 57.3 44.6 88.9 90.8 40.4 63.0 64.2 Importance of Context Distillation. We investigate the necessity of context distillation by ablating the supervision signal. Instead of calculating the policy gradient on the original query x using trajectories generated from the augmented input x~=(x,z) x=T(x,z), the ablated model (P2Ow/o context distillation_w/o context distillation) is trained directly on x~ x. As shown in Table 2, excluding this component results in a severe performance degradation. Specifically, the average accuracy drops significantly from 65.2% (P2OTeacher-Ref_Teacher-Ref) to 55.6%. Crucially, the ablated model not only fails to match the proposed method’s performance but also falls behind the GRPO baseline (60.5%) by 4.9%. This result suggests that training directly on prompt-augmented data leads to a “dependency” effect, where the model learns to rely on the auxiliary templates to solve the task rather than internalizing the reasoning logic. Consequently, when these templates are absent during evaluation, the model fails to generalize. Thus, context distillation is not merely an enhancement but a prerequisite for effectively transferring the capabilities from the teacher reference into the model’s intrinsic parameters. Impact of Group Prompt Diversity. We further investigate the impact of prompt diversity within each rollout group during training. In the standard P2O framework, we utilize a Pareto-based selection strategy and assign a diverse set of templates to each hard sample (i.e., different templates may be used across the K rollouts for the same query). To evaluate the necessity of this diversity, we compare it against a baseline variant, P2Osame template in group_same template in group, where the model samples a single template z∼(t)z ^(t) per query xix_i and applies that fixed template to all K rollouts in the group. As shown in Table 2, the diversity-driven approach (P2OTeacher-Ref_Teacher-Ref) achieves an average accuracy of 65.2%, outperforming the single-template baseline (64.2%). The gain is particularly pronounced in high-difficulty benchmarks such as AIME24 (+2.5%) and AIME25 (+4.8%). This empirical evidence confirms that employing a diverse set of Pareto-optimal prompts within the same rollout group provides broader coverage of the reasoning space. By eliciting a wider variety of successful trajectories for a single hard sample, the model receives a denser and more robust supervision signal, which facilitates more effective distillation of complex reasoning capabilities into the model’s parameters. 5 Related Works Reinforcement Learning with Verifiable Rewards (RLVR) and Group Relative Policy Optimization (GRPO) are widely used for LLM reasoning alignment, stabilizing training via group baselines and verification, but still struggle with exploration in sparse or misleading landscapes. To address these scalability and stability issues, recent works have focused on algorithmic advances. For instance, DAPO (Yu et al., 2025) prunes prompts with accuracy equal to 11 or 0 to improve training stability. Despite these improvements, these methods primarily focus on optimizing the policy on a fixed manifold, leaving the initial task prompts untouched. To further aid the model in traversing complex reasoning paths, various “hint-based” or guidance-augmented strategies have been proposed. Approaches such as strong model guidance (Nath et al., 2025; Liu et al., 2025b), experience replay data (Zhan et al., 2025), and expert solutions (Zhang et al., 2025b; Yan et al., 2025; Li et al., 2025) attempt to scaffold the reasoning process. Similarly, Critique-GRPO (Zhang et al., 2025a) utilizes feedback signals to guide the policy. Crucially, however, these methods typically rely on heuristically obtained hints or expert trajectory fragments derived from the dataset. There is a notable absence of “optimization” regarding the guidance itself; the prompts or hints are treated as static constants rather than dynamic variables. This reliance on predefined or oracle-dependent guidance limits the model’s ability to generalize beyond the scope of the provided hints. In contrast, P2O treats the prompt as an optimizable parameter, jointly evolving the task topology with the policy to achieve bi-level self-improvement. 6 Conclusion In this paper, we proposed P2O, a framework that synergizes genetic prompt evolution with reinforcement learning to address the exploration bottleneck in reasoning tasks. By leveraging optimized prompts to unlock successful trajectories for hard samples and subsequently internalizing these capabilities through context distillation, P2O effectively recovers valid training signals for previously intractable instances, thereby preventing the policy from collapsing into local optima. Extensive experiments on mathematical benchmarks demonstrate that P2O significantly outperforms standard GRPO baselines, validating the effectiveness of jointly optimizing the prompt space and parameter space to achieve robust, autonomous self-improvement. Impact Statement This paper presents work whose goal is to advance the field of Machine Learning, specifically in the area of reinforcement learning and reasoning post-training for Large Language Models. Our proposed method, P2O, aims to automate the discovery of effective reasoning strategies to assist policy model training, thereby potentially reducing the barrier to developing capable reasoning models. There are many potential societal consequences of our work, primarily related to the general advancement of AI reasoning capabilities and their downstream applications. However, we feel that none of these consequences must be specifically highlighted here beyond the standard ethical considerations of the field. References L. A. Agrawal, S. Tan, D. Soylu, N. Ziems, R. Khare, K. Opsahl-Ong, A. Singhvi, H. Shandilya, M. J. Ryan, M. Jiang, et al. (2025) Gepa: reflective prompt evolution can outperform reinforcement learning. arXiv preprint arXiv:2507.19457. Cited by: §1, §2.2, §3.4, §4.1. A. Bercovich, I. Levy, I. Golan, M. Dabbah, R. El-Yaniv, O. Puny, I. Galil, Z. Moshe, T. Ronen, N. Nabwani, et al. (2025) Llama-nemotron: efficient reasoning models. arXiv preprint arXiv:2505.00949. Cited by: §1. A. Blakeman, A. Grattafiori, A. Basant, A. Gupta, A. Khattar, A. Renduchintala, A. Vavre, A. Shukla, A. Bercovich, A. Ficek, et al. (2025) Nemotron 3 nano: open, efficient mixture-of-experts hybrid mamba-transformer model for agentic reasoning. arXiv preprint arXiv:2512.20848. Cited by: §1. C. Fernando, D. Banarse, H. Michalewski, S. Osindero, and T. Rocktäschel (2023) Promptbreeder: self-referential self-improvement via prompt evolution. arXiv preprint arXiv:2309.16797. Cited by: §1. D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al. (2025) Deepseek-r1: incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948. Cited by: §1. Z. He, T. Liang, J. Xu, Q. Liu, X. Chen, Y. Wang, L. Song, D. Yu, Z. Liang, W. Wang, et al. (2025) Deepmath-103k: a large-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning. arXiv preprint arXiv:2504.11456. Cited by: §4.1. A. Jaech, A. Kalai, A. Lerer, A. Richardson, A. El-Kishky, A. Low, A. Helyar, A. Madry, A. Beutel, A. Carney, et al. (2024) Openai o1 system card. arXiv preprint arXiv:2412.16720. Cited by: §1. N. Lambert, J. Morrison, V. Pyatkin, S. Huang, H. Ivison, F. Brahman, L. J. V. Miranda, A. Liu, N. Dziri, S. Lyu, et al. (2024) Tulu 3: pushing frontiers in open language model post-training. arXiv preprint arXiv:2411.15124. Cited by: §1. J. Li, H. Lin, H. Lu, K. Wen, Z. Yang, J. Gao, Y. Wu, and J. Zhang (2025) Questa: expanding reasoning capacity in llms via question augmentation. arXiv preprint arXiv:2507.13266. Cited by: §5. M. Liu, S. Diao, X. Lu, J. Hu, X. Dong, Y. Choi, J. Kautz, and Y. Dong (2025a) Prorl: prolonged reinforcement learning expands reasoning boundaries in large language models. arXiv preprint arXiv:2505.24864. Cited by: §1. Z. Liu, C. Gong, X. Fu, Y. Liu, R. Chen, S. Hu, S. Zhang, R. Liu, Q. Zhang, and D. Tu (2025b) Ghpo: adaptive guidance for stable and efficient llm reinforcement learning. arXiv preprint arXiv:2507.10628. Cited by: §5. M. Luo, S. Tan, J. Wong, X. Shi, W. Y. Tang, M. Roongta, C. Cai, J. Luo, L. E. Li, R. A. Popa, and I. Stoica (2025) DeepScaleR: surpassing o1-preview with a 1.5b model by scaling rl. Note: Notion Blog Cited by: §4.1. V. Nath, E. Lau, A. Gunjal, M. Sharma, N. Baharte, and S. Hendryx (2025) Adaptive guidance accelerates reinforcement learning of reasoning models. arXiv preprint arXiv:2506.13923. Cited by: §5. K. Opsahl-Ong, M. J. Ryan, J. Purtell, D. Broman, C. Potts, M. Zaharia, and O. Khattab (2024) Optimizing instructions and demonstrations for multi-stage language model programs. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Y. Al-Onaizan, M. Bansal, and Y. Chen (Eds.), Miami, Florida, USA, p. 9340–9366. External Links: Link, Document Cited by: §1. J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §2.1. Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. Li, Y. Wu, et al. (2024) Deepseekmath: pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300. Cited by: §2.1. C. Snell, D. Klein, and R. Zhong (2022) Learning by distilling context. arXiv preprint arXiv:2209.15189. Cited by: §3.3. Y. Song, J. Kempe, and R. Munos (2025) Outcome-based exploration for llm reasoning. arXiv preprint arXiv:2509.06941. Cited by: §1. K. Team, Y. Bai, Y. Bao, G. Chen, J. Chen, N. Chen, R. Chen, Y. Chen, Y. Chen, Y. Chen, et al. (2025) Kimi k2: open agentic intelligence. arXiv preprint arXiv:2507.20534. Cited by: §4.1. J. Yan, Y. Li, Z. Hu, Z. Wang, G. Cui, X. Qu, Y. Cheng, and Y. Zhang (2025) Learning to reason under off-policy guidance. arXiv preprint arXiv:2504.14945. Cited by: §5. A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. (2025) Qwen3 technical report. arXiv preprint arXiv:2505.09388. Cited by: §4.1. C. Yang, X. Wang, Y. Lu, H. Liu, Q. V. Le, D. Zhou, and X. Chen (2024) Large language models as optimizers. In ICLR, External Links: Link Cited by: §3.4. Q. Yu, Z. Zhang, R. Zhu, Y. Yuan, X. Zuo, Y. Yue, W. Dai, T. Fan, G. Liu, L. Liu, et al. (2025) Dapo: an open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476. Cited by: §5. M. Yuksekgonul, F. Bianchi, J. Boen, S. Liu, P. Lu, Z. Huang, C. Guestrin, and J. Zou (2025) Optimizing generative ai by backpropagating language model feedback. Nature 639, p. 609–616. Cited by: §1. E. Zelikman, Y. Wu, J. Mu, and N. Goodman (2022) Star: bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems 35, p. 15476–15488. Cited by: §1. R. Zhan, Y. Li, Z. Wang, X. Qu, D. Liu, J. Shao, D. F. Wong, and Y. Cheng (2025) ExGRPO: learning to reason from experience. arXiv preprint arXiv:2510.02245. Cited by: §5. X. Zhang, H. Sun, Y. Zhang, K. Feng, C. Lu, C. Yang, and H. Meng (2025a) Critique-grpo: advancing llm reasoning with natural language and numerical feedback. arXiv preprint arXiv:2506.03106. Cited by: §5. X. Zhang, Z. Huang, Y. Li, C. Ni, J. Chen, and S. Oymak (2025b) BREAD: branched rollouts from expert anchors bridge sft & rl for reasoning. arXiv preprint arXiv:2506.17211. Cited by: §5. Appendix A Details on GEPA We provide the comprehensive pseudocode for the Genetic-Pareto (GEPA) prompt optimization process used in Phase 2 of our framework. Please refer to Algorithm 3 for the detailed execution of the main evolutionary loop, and Algorithm 4 for the specific implementation of the Pareto-based frontier selection strategy. Algorithm 3 Gepa 0: Hard Data hardD_hard, Policy Model πθ _θ, Reference Model πinit _init 0: Budget CtotalC_total, Mini-batch B, Width W 0: Output Template Set Z 1: Split hardD_hard into hardtrainD_hard^train and harddevD_hard^dev 2: Initialize ←(ϵ,Eval(πθ,ϵ,harddev))Z←\(ε, Eval( _θ,ε,D_hard^dev))\ 3: // ϵε means empty template. 4: // Eval() will return reward of every sample in the given dataset by applying the given template 5: Cleft←CtotalC_left← C_total 6: while Cleft>0C_left>0 do 7: front←SelectParetoFront(,W)Z_front← SelectParetoFront(Z,W) 8: for all z∈frontz _front do 9: Sample mini-batch mini⊂hardtrainD_mini _hard^train of size B 10: r¯old←mean⁡(Eval(πθ,z,mini)) r_old ( Eval( _θ,z,D_mini)) 11: Generate error feedback ℱF from rollouts and rewards on miniD_mini 12: z′←πinit(“Propose Improvement”,z,ℱ)z ← _init(``Propose Improvement′,z,F) 13: r¯new←mean⁡(Eval(πθ,z′,mini)) r_new ( Eval( _θ,z ,D_mini)) 14: Update Cost: Cleft←Cleft−2BC_left← C_left-2B 15: if r¯new>r¯old r_new> r_old then 16: ←∪(z′,Eval(πθ,z′,harddev))Z ∪\(z , Eval( _θ,z ,D_hard^dev))\ 17: Cleft←Cleft−|harddev|C_left← C_left-|D_hard^dev| 18: end if 19: end for 20: end while 21: ←GreedyPromptAssignment(,hard)Z← GreedyPromptAssignment(Z,D_hard) 22: RETURN Z Algorithm 4 SelectParetoFront 0: Template set with scores =(z1,(r1,1,…,r1,N)),…,(zM,(rM,1,…,rM,N))Z=\(z_1,(r_1,1,…,r_1,N)),…,(z_M,(r_M,1,…,r_M,N))\, Width W 1: // Each (ri,1,…,ri,N)(r_i,1,…,r_i,N) contains scores of template ziz_i on N dev samples 1: Selected templates frontZ_front 2: // Step 1: Identify Pareto-optimal templates 3: front←∅Z_front← 4: for i=1i=1 to M do 5: dominated←Falsedominated← False 6: for j=1j=1 to M do 7: if j≠ij≠ i and zjz_j dominates ziz_i then 8: // zjz_j dominates ziz_i if ∃n:rj,n>ri,n∃ n:r_j,n>r_i,n and ∀n′:rj,n′≥ri,n′∀ n :r_j,n ≥ r_i,n 9: dominated←Truedominated← True 10: break 11: end if 12: end for 13: if not dominated then 14: front←front∪(zi,(ri,1,…,ri,N))Z_front _front∪\(z_i,(r_i,1,…,r_i,N))\ 15: end if 16: end for 17: // Step 2: Select W templates from Pareto front 18: if |front|≤W|Z_front|≤ W then 19: front←z∣(z,(r1,…,rN))∈frontZ_front←\z (z,(r_1,…,r_N)) _front\ 20: else 21: // Sample based on mean dev scores 22: Compute weights: r¯i←1N∑n=1Nri,n r_i← 1N _n=1^Nr_i,n for each (zi,(ri,1,…,ri,N))∈front(z_i,(r_i,1,…,r_i,N)) _front 23: Sample W templates from frontZ_front weighted by r¯i\ r_i\ 24: front←sampled templatesZ_front←\sampled templates\ 25: end if 26: RETURN frontZ_front Appendix B Case Study Figure 5: Qualitative Analysis: Overcoming Local Optima in Geometric Reasoning. To further demonstrate how P2O overcomes the exploration bottleneck, we analyze a representative “hard sample” from the training process in Figure 5. The problem asks for the radius of the smallest sphere enclosing four unit spheres. Failure of the Base Policy. Without guidance, the model falls into a common cognitive trap: it defaults to a visually intuitive but mathematically suboptimal configuration. As shown in the generated trace, the model attempts to place “3 spheres on a plane… with the 4th sphere resting on top.” This corresponds to a localized packing that yields a parent sphere radius of 1+21+ 2, failing to minimize the volume. Mechanism of the Optimized Prompt. The template evolved by GEPA acts as a targeted intervention in the model’s latent search space. It does not merely encourage the model to “think harder”; rather, it injects specific domain knowledge—specifically the concept of the regular tetrahedron and the precise mathematical constant for its centroid-to-vertex distance (3/8 3/8). This prompt effectively prunes the invalid search space (planar configurations) and steers the reasoning trajectory toward the global optimum (tetrahedral packing), allowing the model to derive the correct radius of 1+621+ 62.