← Back to papers

Paper deep dive

Reflection in the Dark: Exposing and Escaping the Black Box in Reflective Prompt Optimization

Shiyan Liu, Qifeng Xia, Qiyun Xia, Yisheng Liu, Xinyu Yu, Rui Qu

Year: 2026Venue: arXiv preprintArea: cs.AIType: PreprintEmbeddings: 106

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 97%

Last extracted: 3/22/2026, 6:03:27 AM

Summary

The paper introduces VISTA, a multi-agent framework for Automatic Prompt Optimization (APO) designed to overcome the 'black-box' limitations of existing reflective methods like GEPA. VISTA decouples hypothesis generation from prompt rewriting, utilizes semantically labeled hypotheses, and implements a two-layer explore-exploit mechanism to address systematic failures such as seed traps, attribution blindspots, trajectory opacity, and transfer fragility.

Entities (5)

AIME2025 ¡ dataset ¡ 100%GEPA ¡ method ¡ 100%GSM8K ¡ dataset ¡ 100%VISTA ¡ framework ¡ 100%Automatic Prompt Optimization ¡ paradigm ¡ 95%

Relation Signals (3)

VISTA → evaluatedon → GSM8K

confidence 100% ¡ consistently outperforms baselines across all conditions on GSM8K and AIME2025

VISTA → outperforms → GEPA

confidence 95% ¡ VISTA recovers accuracy to 87.57% on the same defective seed and consistently outperforms baselines

GEPA → exhibits → Seed Trap

confidence 90% ¡ GEPA with its official, defective seed prompt on GSM8K degrades accuracy

Cypher Suggestions (2)

Identify limitations associated with a specific method ¡ confidence 95% ¡ unvalidated

MATCH (m:Method {name: 'GEPA'})-[:EXHIBITS]->(l:Limitation) RETURN l.name

Find all methods evaluated on a specific dataset ¡ confidence 90% ¡ unvalidated

MATCH (m:Method)-[:EVALUATED_ON]->(d:Dataset {name: 'GSM8K'}) RETURN m.name

Abstract

Abstract:Automatic prompt optimization (APO) has emerged as a powerful paradigm for improving LLM performance without manual prompt engineering. Reflective APO methods such as GEPA iteratively refine prompts by diagnosing failure cases, but the optimization process remains black-box and label-free, leading to uninterpretable trajectories and systematic failure. We identify and empirically demonstrate four limitations: on GSM8K with a defective seed, GEPA degrades accuracy from 23.81% to 13.50%. We propose VISTA, a multi-agent APO framework that decouples hypothesis generation from prompt rewriting, enabling semantically labeled hypotheses, parallel minibatch verification, and interpretable optimization trace. A two-layer explore-exploit mechanism combining random restart and epsilon-greedy sampling further escapes local optima. VISTA recovers accuracy to 87.57% on the same defective seed and consistently outperforms baselines across all conditions on GSM8K and AIME2025.

Tags

ai-safety (imported, 100%)csai (suggested, 92%)preprint (suggested, 88%)

Links

Your browser cannot display the PDF inline. Open PDF directly →

Full Text

105,252 characters extracted from source content.

Expand or collapse full text

Reflection in the Dark: Exposing and Escaping the Black Box in Reflective Prompt Optimization Shiyan Liu 1,2 , Qifeng Xia 2 , Qiyun Xia 3 , Yisheng Liu 2 , Xinyu Yu 2 , Rui Qu 2 1 University of California, Berkeley 2 Huazhong University of Science and Technology 3 Hefei University of Technology Abstract Automatic prompt optimization (APO) has emerged as a powerful paradigm for improv- ing LLM performance without manual prompt engineering. Reflective APO methods such as GEPA iteratively refine prompts by diagnos- ing failure cases, but the optimization process remains black-box and label-free, leading to uninterpretable trajectories and systematic fail- ure. We identify and empirically demonstrate four limitations: on GSM8K with a defective seed, GEPA degrades accuracy from 23.81% to 13.50%. We propose VISTA, a multi-agent APO framework that decouples hypothesis gen- eration from prompt rewriting, enabling se- mantically labeled hypotheses, parallel mini- batch verification, and interpretable optimiza- tion trace. A two-layer explore-exploit mecha- nism combining random restart and epsilon- greedy sampling further escapes local op- tima. VISTA recovers accuracy to 87.57% on the same defective seed and consistently outperforms baselines across all conditions on GSM8K and AIME2025. 1 Introduction Large language models (LLMs) have demonstrated remarkable capabilities across diverse tasks (Brown et al., 2020; Wei et al., 2022; Kojima et al., 2022; Wang et al., 2023), yet their performance remains highly sensitive to prompt design: small changes in wording or instruction order can lead to dramatic differences in output quality. Manual prompt en- gineering is labor-intensive and requires extensive trial and error, motivating a growing body of work on Automatic Prompt Optimization (APO) (Ye et al., 2024; Chen et al., 2024; Sahoo et al., 2024). APO addresses this bottleneck by iteratively re- fining prompts based on task feedback, replacing human intuition with algorithmic search. Early ap- proaches such as OPRO (Yang et al., 2024) and ProTeGi (Pryzant et al., 2023) demonstrated that LLMs can serve as effective optimizers, proposing . A Representative Failure Æ Base Model Output “final_answer”: “8”, “solution_pad”: “Tom has 2 bags, each with 3 apples: 2 x 3 = 8...” Ground truth: 6 Final answer: 8 Score = 0p Seed enforces wrong field order (final_answer before solution_pad): CoT unused. Æ ? Reflector — reflecting in the dark RoundReflectionAcc. 1“Add step-by-step instructions.”14% 3“Improve arithmetic reasoning.”12% 5“Clarify output format.”15% 7“Emphasize calculation steps.”13% pField order never identified; accuracy remains far below a well-formed seed (≈85%). Figure 1: A representative failure of GEPA under the defective seed. prompt edits in natural language guided by per- formance signals. More recently, reflective APO methods such as GEPA (Agrawal et al., 2026) have pushed this further by combining natural language reflection with genetic evolution and Pareto-based candidate selection. Despite their promise, we identify a fundamen- tal limitation shared by reflective APO methods: the optimization process is entirely black-box and label-free. Failure diagnosis and prompt rewriting are collapsed into a single reflection step, produc- ing no record of what root cause was attributed, no semantic structure on the optimization trajectory, and no mechanism to detect when the search has been trapped from the start. Without this struc- ture, the optimizer has no sense of where it started, what it can diagnose, where it has been, or whether its results will generalize. Figure 1 illustrates a representative failure. Concretely, GEPA with its official, defective seed prompt on GSM8K (Cobbe et al., 2021) degrades accuracy from 23.81% to 1 arXiv:2603.18388v1 [cs.AI] 19 Mar 2026 Prompt Feature f i Prompt Feature f j GEPA VISTA GEPA VISTA Candidate hypotheses -greedy explore Random restart Global optimum Local optimum Seed 0% 50% 100% Val. Acc. Figure 2: Conceptual illustration of optimization trajec- tories under the defective seed. 13.50%, and the actual root cause remains unad- dressed across all optimization rounds. We formalize this as four systematic limita- tions that form a causal chain: [L1] SEED TRAP: optimization is sensitive to seed prompts, which silently constrain the search space and trap it in de- fective regions. [L2] ATTRIBUTION BLINDSPOT: the reflector’s attribution space is doubly bounded— by its prior distribution and by its own capability— systematically missing root causes outside either bound. [L3] TRAJECTORY OPACITY: even when attribution points in the right direction, the opti- mization trajectory is entirely label-free, making the full evolution uninterpretable and preventing the optimizer from accumulating directional expe- rience. [L4] TRANSFER FRAGILITY: optimized prompts are model-specific, failing silently when transferred across base models. To address these limitations, we propose VISTA (Verifiable, Interpretable, Semantic-TrAce Prompt Optimization). The key insight is to decouple hy- pothesis generation from prompt rewriting via a multi-agent design: a hypothesis agent proposes semantically labeled hypotheses guided by a heuris- tic set, while a reflection agent rewrites the prompt targeting each hypothesis independently. Parallel minibatch validation then selects the best hypothe- sis, constructing an interpretable optimization trace that makes every step auditable and directional. A two-layer explore-exploit mechanism (Sutton and Barto, 2018)—combining random restart for escaping seed-induced traps and epsilon-greedy sampling for hypothesis diversity—further ensures robust global search. On the same defective seed, VISTA recovers accuracy to 87.57%. Figure 2 illustrates how GEPA becomes trapped under a de- fective seed, whereas VISTA is able to escape. Our contributions are as follows: •We identify and formalize four systematic limitations of reflective APO under a unified interpretability-blindspot framework. •We propose VISTA, a multi-agent APO frame- work that decouples hypothesis generation from prompt rewriting, enabling verifiable and interpretable optimization traces. •We provide comprehensive experiments on GSM8K and AIME2025 (Balunovi ́ c et al., 2025) validating both the proposed limitations and the effectiveness of VISTA. 2 Related Work 2.1 Automatic Prompt Optimization Prompt engineering has long relied on manual design (Brown et al., 2020; Wei et al., 2022), but recent work has shifted toward automated ap- proaches. Instruction induction methods (Hon- ovich et al., 2023; Zhou et al., 2023) generate candidate prompts from input-output examples. Optimization-based methods treat prompt search as a black-box problem: OPRO (Yang et al., 2024) uses an LLM as a meta-optimizer guided by past scores, while APE (Zhou et al., 2023) searches over instruction candidates via sampling. Gradient- inspired methods such as ProTeGi (Pryzant et al., 2023) and TextGrad (Yuksekgonul et al., 2024) compute textual “gradients” from failure feed- back to guide prompt updates. More recently, DSPy (Khattab et al., 2024) and MIPROv2 (Opsahl- Ong et al., 2024) extend APO to multi-module compound systems, jointly optimizing prompts and few-shot demonstrations. Evolutionary approaches (Fernando et al., 2024; Guo et al., 2024) apply self-referential mutation to prompt search. GEPA (Agrawal et al., 2026) fur- ther advances this paradigm by combining reflec- tive prompt mutation with Pareto-based candidate selection, achieving strong sample efficiency across diverse tasks. EvoX (Huang et al., 2025) proposes meta-evolution of the search strategy itself, jointly optimizing the prompt and the optimization policy. While these methods demonstrate strong empirical results, they share a common limitation: attribution 2 Æ ? L1: Seed Trap “Where do I start?” L2: Attr. Blindspot “What can I see?” L3: Traj. Opacity “Where have I been?” à Seed ✗ Failed L4: Trans. Frag. “Where will it work?” Worked? trapped missed disoriented yes transfer no Figure 3: Four systematic limitations of reflective APO. L1–L3 form a causal chain, and L4 can apply even when optimization succeeds. generation and prompt rewriting are entangled in a single black-box step, leaving the optimization process without interpretable structure or verifiable root-cause attribution. 2.2 LLM Self-Correction and Its Limits A prominent line of work explores iterative self- refinement of LLM outputs. Self-Refine (Madaan et al., 2023) and Reflexion (Shinn et al., 2023) prompt LLMs to critique and revise their own outputs across multiple rounds, demonstrating im- provements on a range of tasks. However, Huang et al. (2024) provide a critical counterpoint: with- out access to external feedback, LLMs cannot reliably self-correct reasoning, as revisions are bounded by the same prior that produced the orig- inal error. Subsequent work has further character- ized conditions under which self-correction suc- ceeds or fails (Olausson et al., 2024; Stechly et al., 2023; Kamoi et al., 2024). These findings directly motivate VISTA: reflective APO diagnosis is sub- ject to the same prior constraints, making external heuristics necessary. 3 Diagnosing the Black Box: Four Limitations We now expand on each limitation introduced in Section 1. These four limitations form a progres- sive indictment of reflective APO: the seed con- strains where search begins (L1), the reflector con- strains what can be hypothesized (L2), and the absence of semantic structure constrains what the optimizer can learn from its own trajectory (L3). Even if L1–L3 were fully resolved and optimiza- tion succeeded, the result carries no guarantee of generalization across base models (L4). Figure 3 illustrates how they manifest as a causal chain. 3.1 L1: Seed Trap The seed prompt implicitly defines the initial search region. When the seed contains structural defects— such as incorrect output field ordering, malformed Case: Attribution Blindspot Optimized prompts across rounds: Rd.1:“You are a math assistant. Solve the problem stepbystep. Output in JSON.” Rd.3:“You are a math assistant. Solve the problem step by step. Verifyeachcalculation. Output in JSON.” Rd.5:“You are a math assistant.Breaktheproblem intoparts. Verify each calculation. Output in JSON.” Rd.7:“You are a math assistant. Break the problem into parts. Verify each calculation.Explainyour reasoningindetailbeforegivingthefinalanswer. Output in JSON.” pAttribution Blindspot: All attributions target rea- soning quality; the structural root cause remains outside the reflector’s attribution space across all rounds and reflector configurations. Figure 4: A case of attribution blindspot. schema constraints, or contradictory instructions— the optimizer inherits these defects as implicit constraints. Because reflective APO produces no record of which seed-level assumptions are being carried forward, it has no mechanism to identify or question them. Consider the official GEPA (Agrawal et al., 2026) seed prompt (Figure 1). Its output schema specifiesfinal_answerbeforesolution_pad, causing the base model to output its final answer before its Chain-of-Thought (CoT) reasoning, ef- fectively preventing CoT from influencing the an- swer. This ordering constraint is silently inher- ited throughout all optimization rounds and never flagged as a candidate root cause. 3.2 L2: Attribution Blindspot The reflector’s attribution space is doubly bounded. First, structurally: the reflector can only propose root causes that fall within its prior distribution of plausible failure modes, systematically missing cat- egories that are underrepresented or absent from that distribution. Huang et al. (2024) demonstrate that LLMs cannot reliably self-correct without ex- ternal feedback, as revisions are constrained by the same prior that produced the original error— reflective APO diagnosis is subject to the same bound. Second, by capability: a weaker reflector has a narrower effective attribution space, and even within categories it can in principle reach, attribu- tion quality degrades with model capability. Both constraints are invisible to the optimizer. Analysis of GEPA’s optimized prompts across all 3 Def. (Qwen) Rep. (Qwen) Min. (Qwen) Def. (GPT) Field Order. Edge Cases Domain Know. Format/Syn. Task Instr. Reasoning ∅ 70% 10% 20% 10% 75% 15% 14% 79% 7% 82% 18% Seed Category Figure 5: GEPA attribution distribution on GSM8K (Qwen3-4B base, Qwen3-8B/GPT-4o-mini reflectors). Case: Trajectory Opacity Optimization tree across rounds: π 0 13% π a 1 14% π b 1 11% π c 1 12% π a 2 12% π b 2 15% π c 2 13% π d 2 10% π a 3 14% π b 3 13% ? ? ? ? ?? ? ? ? pTrajectory opacity: every edge carries only a ques- tion mark. The tree grows but carries no semantic mean- ing; the optimizer cannot detect oscillation, accumulate directional experience, or reason about its own trajectory. Figure 6: A case of trajectory opacity. rounds confirms this: despite iterative optimization, no round ever corrects the field ordering defect (Fig- ure 4). The reflector consistently attributes failures to reasoning errors, hallucinations, and instruction- following issues, never proposing structural attri- butions. Figure 5 shows the distribution of attribu- tion categories across configurations—the actual root cause (Field Ordering) receives zero attribu- tions across all configurations regardless of reflec- tor strength, confirming that the blindspot is prior- structural rather than purely capability-driven. 3.3 L3: Trajectory Opacity Even if the reflector’s attribution happens to point in the right direction, the optimizer has no way to know it did. GEPA does perform selection across candidate prompts, but the optimization tra- jectory is entirely label-free (Figure 6). Each tran- sition from one prompt to the next is driven by an Case: Transfer Fragility Optimization curve across rounds: 123456 7 14% 84% TRAIN ON STRONG MODELEVAL strong model weak model (not observed) 84% 14% transfer Round Acc. pTransfer fragility: the field order defect was never corrected because the strong model’s robustness masked it. Once transferred to a weaker model, the latent defect is exposed silently. Figure 7: A case of transfer fragility. accuracy-gain signal with no record of what root cause was attributed, what change was made, or why accuracy shifted. The optimizer knows that something worked, but not what, and has no basis for reasoning about what to try next. This semantic vacuum has two concrete conse- quences. First, the optimizer cannot detect oscil- lation: if two conflicting attributions are applied alternately across rounds, there is no signal that the same ground is being revisited. Second, the op- timizer cannot accumulate directional experience: each round starts from scratch, with no memory of which attribution categories have been produc- tive or exhausted. The optimization tree grows in size but remains semantically empty—a map with nodes and edges but no labels, making the full evo- lution uninterpretable. 3.4 L4: Transfer Fragility Prompts optimized under reflective APO are im- plicitly tailored to the base model’s behavior during optimization (Zhao et al., 2021; Lu et al., 2022). When the optimized prompt is transferred to a dif- ferent base model, the structural assumptions en- coded in the prompt—output format expectations, reasoning style preferences, instruction sensitivity— may no longer hold, causing silent performance degradation. Because reflective APO produces no record of which model-specific behaviors were ex- ploited during optimization, it provides no signal about where its results will or will not generalize. This fragility is particularly counterintuitive: a prompt optimized against a stronger model may mask latent defects that the model’s robustness compensates for. Once transferred to a weaker model that more faithfully follows the defective 4 Blank Prompt No (Prob. 1-p) Yes Random Restart? Yes (Prob. p) Single Instance Look-ahead Generate New Prompt Exploit (Prob. 1-ε) Explore (Prob. ε) Hypotheses w/ Hypotheses w/o Generate Prompt for each Minibatch Evaluate Prompt that and Non-dominated Layer 1: Random Restart Layer 2: Epsilon-Greedy Hypothesis Sampling Exploit or Explore? Yes Initialized? No Hypothesis Agent No Output Initialize Add into Budget0? Best Prompt in Pool Minibatch Inference Parent Prompt in Pool Reflection Agent Hypothesis Agent Reflection Agent Pareto Candidate Pool Structure Format Instr.Reasoning Semantic Trace Tree Seed Prompt Heuristic Set Figure 8: VISTA pipeline overview: a multi-agent framework that decouples hypothesis generation from prompt rewriting via heuristic-guided parallel verification and interpretable semantic trace trees. schema, the latent defect is exposed—and there is no mechanism to detect this regression without re- running evaluation on the target model (Figure 7). 4 Proposed VISTA Framework 4.1 Overview VISTA decouples hypothesis generation from prompt rewriting via a multi-agent design (Wu et al., 2024): a hypothesis agent proposes seman- tically labeled hypotheses, and a reflection agent rewrites the prompt targeting each hypothesis inde- pendently. Each prompt update is thus grounded in an explicit, verifiable hypothesis, making the optimization process interpretable at every step. The Pareto pool retains non-dominated prompts under per-sample dominance following Agrawal et al. (2026), with prompts sampled proportional to their per-sample win count. In each round, a selected prompt is passed through the two-layer explore-exploit mechanism (Section 4.4) to gener- ateKcandidate hypotheses (Section 4.2). The re- flection agent produces K candidate prompts, one per hypothesis. The best-verified candidate may be added to the Pareto pool with its root-cause la- bel, extending the semantic trace tree (Section 4.3). Figure 8 gives a visual overview, and Algorithm 1 formalizes the full procedure. 4.2 Hypothesis Generation Each hypothesisH i = (c i ,d i )consists of a cate- gory labelc i ∈ Cand a natural language descrip- tiond i of the suspected failure mode. LetF t denote the set of failure cases at roundt. The label setCis an extensible taxonomy of heuristic failure modes curated from representative cases; each entry spec- ifies a failure mode category, a description, and a suggested fix direction (see Appendix E). LetH θ denote the hypothesis space of a reflector with parametersθ, and letc ∗ denote the true root cause. By Huang et al. (2024),P (c ∗ ∈ H θ ) < 1 for root causes outside the model’s prior, and this probability cannot be improved by further prompt- ing alone. An unconstrained hypothesis agent is subject to the same bound. Letβ = P (c ∗ ∈ C) denote the coverage probability ofC. The heuristic set introduces an external prior independent ofθ, ensuring: β > P (c ∗ ∈H θ )(1) for any c ∗ covered by the heuristic set. Letπ t denote the prompt selected fromPat roundt. For each hypothesisH i , the reflection agent independently rewritesπ t to address the hy- pothesized root cause, producing candidateπ (i) ; we writeπ (i) K i=1 for the full set ofKcandidates. LetD =D train ∪D val denote the task dataset parti- tioned into a training split for failure cases collec- tion and a validation split for candidate selection. LetM ∼ D train denote a minibatch of sizeband acc(π,M)the accuracy of promptπonM. De- fine the accuracy gain of a promptπon minibatch M relative to the current prompt π t as: ∆acc(π,M) = acc(π,M)− acc(π t ,M)(2) The accuracy gain of thei-th candidate is then ∆acc i = ∆acc(π (i) ,M), and the winning hypoth- esis is selected by: i ∗ = argmax i∈j : ∆acc j >0 ∆acc i (3) The winnerπ (i ∗ ) is then evaluated onD val and added toPwith labelc ∗ = c i ∗ if not dominated 5 Algorithm 1 VISTA Prompt Optimization Require:datasetD =D train ∪D val , seed promptπ 0 , heuristic setC, budget B, hyperparameters K,p,ε,b Ensure: Optimized prompt π ∗ Initialize Pareto poolP ←π 0 , π ∗ ← π 0 while budget B > 0 do π t ← SELECTPARETO(P), π ′ ←∅ SampleM∼D train with|M| = b if Uniform(0, 1) < p then π restart ← RESTART(D train ) if ∆acc(π restart ,M) > 0 then π ′ ← π restart end if B ′ ← b + 1 else for i = 1,...,K do Collect failure casesF t fromM H i ← SAMPLE(C, F t , ε) π (i) ← REFLECT(π t , H i ,F t ) end for π ′ ← π (i ∗ ) , i ∗ = arg max i∈j : ∆acc j >0 ∆acc i B ′ ← b· K end if if π ′ ̸=∅ then Evaluate π ′ onD val if π ′ is not dominated by any π ∈P onD val then Add π ′ toP with label c ∗ = c i ∗ UPDATEPARETO(P) end if if acc(π ′ , D val ) > acc(π ∗ , D val ) then π ∗ ← π ′ end if B ′ ← B ′ +|D val | end if B ← B− B ′ end while return π ∗ by anyπ ∈P: the selected root cause is not what the reflector believes caused the failure, but what empirically produces the largest improvement on held-out instances. 4.3 Semantic Trace VISTA maintains a semantic trace treeT = (V,E), where each nodev ∈ Vcorresponds to a prompt candidateπ (v) and each directed edge (u,v) ∈ Eis annotated with the tuple(c ∗ ,δ), whereδ = ∆accdenotes the accuracy gain of the optimization step that producedvfromu. The tree is rooted atπ 0 and grows by one node per success- ful round, giving the full optimization history a structured, auditable form: every prompt can be traced back to its root-cause chain. The optimization trajectory up to roundtis the root-to-current path: τ t = π 0 (c ∗ 1 ,δ 1 ) −→ π 1 (c ∗ 2 ,δ 2 ) −→· (c ∗ t ,δ t ) −→ π t (4) wherec ∗ k ∈ Cis the selected root-cause label at roundk. Unlike GEPA’s unlabeled sequence,τ t is fully interpretable: each transition has a causal ex- planation and a verified performance delta∆acc (t) . The traceτ t is provided to the hypothesis agent as context at each round, enabling it to avoid already-explored directions, identify diminishing- return categories, and detect when two labels alter- nate without joint resolution. In the latter case, the agent generates a joint hypothesis addressing both dimensions simultaneously, collapsing what would otherwise require multiple sequential rounds into a single update. 4.4 Two-Layer Explore-Exploit VISTA introduces a two-layer explore-exploit mechanism (Sutton and Barto, 2018) targeting L1 and L2 respectively. Layer 1: Random Restart. At the start of each round, with probabilityp∈ (0, 1), VISTA triggers a random restart (Lourenço et al., 2003). A blank prompt look-ahead is executed iteratively: a single training instance is run under the current prompt, collecting the raw outputo raw and any parsing error e. The reflection agent constructs a new prompt conditioned on(o raw ,e), which is then used in the next look-ahead step. This loop repeats untile = ∅, at which pointπ restart is considered initialized from the model’s natural behavior rather than from any inherited seed constraints. Formally, while standard mutation conditions on π t : π (i) = f reflect (π t ,H i ,F t )(5) random restart conditions on the model’s natural behavior, using a fixed null hypothesisH blank = (none, “initialize from model output”)that carries no prior attribution: π restart = f reflect (∅, H blank , o raw )(6) This ensures the restarted prompt reflects what the model naturally produces, not what the seed con- strains it to produce. The resultingπ restart is then evaluated againstπ t onM; it is added toPonly if∆acc(π restart ,M) > 0and not dominated by any π ∈ PonD val . The cost of a restart is a few look-ahead steps, making it negligible relative to the minibatch validation budget. Layer 2: Epsilon-Greedy Hypothesis Sampling. Within each round, theKhypotheses are drawn according to an epsilon-greedy strategy. For each of the K slots: H i ∼ ( Heuristic(C, F t ) with probability 1− ε Free(F t )with probability ε (7) 6 Table 1: Accuracy (%) across benchmarks, conditions, and methods. GSM8KAIME2025 MethodDefectiveRepairedMinimalDefectiveRepairedMinimal No Opt.23.8185.5920.6738.6740.0040.00 GEPA 13.5086.5321.6844.0039.3342.00 VISTA87.5787.3485.6746.0046.6744.00 The heuristic set branch exploits known failure mode categories; the free branch generates uncon- strained hypotheses, allowing discovery of failure modes outsideC. In expectation,⌊(1− ε)K⌋hy- potheses target known modes and⌈εK⌉explore novel ones. Sample Efficiency.For the purposes of this anal- ysis, we treatβas VISTA’s effective per-round probability of identifyingc ∗ , lower-bounded by P (c ∗ ∈C)as defined in Section 4.2. Letα∈ [0, 1) denote GEPA’s empirical probability of selecting c ∗ in any given round. The analysis in Section 3.2 suggests thatαis empirically close to zero for struc- tural failure modes, whileβ > αfollows fromC explicitly covering structural categories. Assuming that each round independently sam- ples a candidate root cause with fixed probability (approximated as geometric trials), the expected number of rounds untilc ∗ is first identified satisfies E[N VISTA ]≤ 1 β < 1 α =E[N GEPA ], α > 0 (8) whereNdenotes the number of rounds untilc ∗ is first selected asi ∗ . The inequality follows from β > α > 0. In the limiting case whereα → 0, E[N GEPA ]diverges, which is consistent with the empirical observation that GEPA fails to identify the structural root cause across all rounds. 5 Experiments 5.1 Experimental Setup Models and Benchmarks.For GSM8K (Cobbe et al., 2021) experiments, we use Qwen3-4B as the base model and Qwen3-8B (Yang et al., 2025) as the reflector. For AIME2025 (Balunovi ́ c et al., 2025) experiments, we use GPT-4.1-mini (OpenAI, 2025) as the base model and GPT-4o-mini (Ope- nAI, 2024) as the reflector. Conditions.We evaluate under three seed condi- tions to stress-test optimizer robustness: (1) Defec- tive seed: the official GEPA(Agrawal et al., 2026) seed prompt, which contains an inverted output field order that silently disables CoT reasoning; (2) Repaired seed: a manually corrected version with the field order fixed; (3) Minimal seed: a single- sentence prompt with no structural constraints. Baselines. We compare against two baselines: No optimization, which evaluates the seed prompt directly, and GEPA, the state-of-the-art reflective APO method. VISTA Configuration. Unless otherwise spec- ified, VISTA usesK = 3hypotheses per round, restart probabilityp = 0.2, and exploration rate ε = 0.1. (see Appendix B for full setup details). 5.2 Main Results Table 1 reports results across all seed conditions, methods, and benchmarks. GSM8K. Under the defective seed, GEPA de- grades accuracy from 23.81% to 13.50% while VISTA recovers to 87.57% (+74.07 points), di- rectly validating L1 and providing evidence for L2. The root cause—inverted output field order— is never hypothesized by GEPA across all optimiza- tion rounds even with full failure information, con- firming that the hypothesis space is the binding constraint rather than information availability. As shown in Figure 9b, GEPA’s best-score trajectory remains flat throughout optimization while VISTA converges within the first few rounds, and this ad- vantage holds under the same budget (Figure 9a). Under the repaired seed, all methods converge to similar accuracy, confirming VISTA introduces no regression on well-formed seeds. Under the minimal seed, GEPA again fails to escape the low- accuracy region (20.67%→21.68%), while VISTA recovers to 85.67%, demonstrating that VISTA’s gains are not specific to the field-ordering defect. Table 2 validates L2 and L4. Across both re- flector configurations, GEPA remains near baseline while VISTA maintains strong performance, con- firming the blindspot is prior-structural rather than capability-driven. For L4, GEPA’s cross-model 7 0100200300400500600 Budget 0% 20% 40% 60% 80% 82% 84% 86% 88% 90% 92% Accuracy (a) Accuracy (%) vs. metric calls. 036912151821 Iteration 0% 20% 40% 60% 80% 82% 84% 86% 88% 90% 92% Accuracy Val. best acc. Test acc. GEPAVISTA Defective Repaired Minimal (b) Accuracy (%) vs. optimization rounds. Figure 9: Optimization curves on GSM8K under the defective seed (best score to date). Table 2: GSM8K accuracy (%) on the defective seed. Columns 1–2: Qwen3-4B base, Qwen3-8B/GPT-4o- mini reflectors. Column 3: trained on GPT-4.1-mini (GPT-4o-mini reflector), evaluated on Qwen3-4B. Reflector MethodQwen3-8B GPT-4o-miniCross-Model GEPA13.5023.4322.74 VISTA87.5787.6486.05 accuracy (22.74%) remains near its single-model baseline (13.50%), confirming that optimization gains under GEPA do not transfer across base mod- els. By contrast, VISTA maintains strong perfor- mance under transfer (86.05%), suggesting that heuristic-guided optimization produces more struc- turally generalizable prompts. AIME2025. Absolute gains are smaller than on GSM8K, reflecting the task’s lower sensitiv- ity to prompt structure under high problem diffi- culty. Nevertheless, VISTA consistently outper- forms GEPA across all seed conditions, and GEPA degrades below the no-optimization baseline under the repaired seed (39.33% vs. 40.00%), further con- firming the instability of unconstrained reflection. 5.3 Ablation Study We conduct ablation experiments on GSM8K under the defective seed. All ablations fix Qwen3-8B as the reflector. Table 3 reports all results. Two findings stand out. First, heuristic guid- ance is the dominant factor: removing exploitation (ε = 1.0) collapses accuracy to 22.97%, while re- moving exploration (ε = 0) causes only a modest drop to 85.60%. Second, the component contribu- tion rows confirm that the heuristic set accounts for the largest single gain (+59.81 points), with random restart and parallel sampling contributing cumulatively but secondarily.K = 3provides the Table 3: Ablation results on GSM8K (%) under the defective seed (Qwen3-4B base, Qwen3-8B reflector). Hyper. ConfigurationK p εAcc. Effect of K VISTA 30.20.187.57 VISTA, K=1 100.175.97 VISTA, K=3300.186.81 VISTA, K=5500.183.89 Effect of Heuristic Set VISTA30.20.187.57 w/o Exploration30085.60 w/o Exploitation301.022.97 Component Contribution GEPA–13.50 + Restart 00.2015.69 + Parallel Sampling30.2020.17 + Heur.-Guided Reflection30.2079.98 best tradeoff; performance degrades atK = 5, sug- gesting noise from additional candidates outweighs their diversity benefit. 6 Conclusion We identified four systematic limitations of reflec- tive APO methods under a unified interpretability- blindspot framework, and demonstrated them con- cretely: on GSM8K with a defective seed, GEPA degrades accuracy from 23.81% to 13.50% while the root cause is never identified across all optimiza- tion rounds. We proposed VISTA, a multi-agent APO framework that decouples hypothesis gener- ation from prompt rewriting, replacing black-box reflection with heuristic-guided parallel verifica- tion and interpretable semantic trace trees. VISTA recovers accuracy to 87.57% on the same defec- tive seed and maintains strong performance across all configurations. We hope this work motivates a broader shift toward interpretable, robust optimiza- tion in the APO paradigm. 8 References Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Ar- nav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alex Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, and Omar Khattab. 2026. GEPA: Reflective prompt evo- lution can outperform reinforcement learning. In The Fourteenth International Conference on Learn- ing Representations. Mislav Balunovi ́ c, Jasper Dekoninck, Ivo Petrov, Nikola Jovanovi ́ c, and Martin Vechev. 2025. Matharena: Evaluating llms on uncontaminated math competi- tions. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, and 12 others. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc. Lichang Chen, Jiuhai Chen, Tom Goldstein, Heng Huang, and Tianyi Zhou. 2024. Instructzero: effi- cient instruction optimization for black-box large lan- guage models6518. In Proceedings of the 41st Inter- national Conference on Machine Learning, ICML’24. JMLR.org. Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training verifiers to solve math word prob- lems. Preprint, arXiv:2110.14168. Chrisantha Fernando,Dylan Banarse,Henryk Michalewski, Simon Osindero, and Tim Rock- täschel. 2024.Promptbreeder: self-referential self-improvement via prompt evolution. In Pro- ceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org. Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao Song, Xu Tan, Guoqing Liu, Jiang Bian, and Yujiu Yang. 2024. Connecting large language models with evolutionary algorithms yields powerful prompt opti- mizers. In The Twelfth International Conference on Learning Representations. Or Honovich, Uri Shaham, Samuel R. Bowman, and Omer Levy. 2023. Instruction induction: From few examples to natural language task descriptions. In Proceedings of the 61st Annual Meeting of the As- sociation for Computational Linguistics (Volume 1: Long Papers), pages 1935–1952, Toronto, Canada. Association for Computational Linguistics. Beichen Huang, Ran Cheng, Zhuozhao Li, Yaochu Jin, and Kay Chen Tan. 2025. Evox: A distributed gpu- accelerated framework for scalable evolutionary com- putation. IEEE Transactions on Evolutionary Com- putation, 29(5):1649–1662. JieHuang,XinyunChen,SwaroopMishra, Huaixiu Steven Zheng, Adams Wei Yu, Xiny- ing Song, and Denny Zhou. 2024. Large language models cannot self-correct reasoning yet. In The Twelfth International Conference on Learning Representations. Ryo Kamoi, Yusen Zhang, Nan Zhang, Jiawei Han, and Rui Zhang. 2024. When can LLMs actually correct their own mistakes? a critical survey of self- correction of LLMs. Transactions of the Association for Computational Linguistics, 12:1417–1440. Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vard- hamanan A, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Za- haria, and Christopher Potts. 2024. DSPy: Com- piling declarative language model calls into state- of-the-art pipelines. In The Twelfth International Conference on Learning Representations. Takeshi Kojima, Shixiang (Shane) Gu, Machel Reid, Yu- taka Matsuo, and Yusuke Iwasawa. 2022. Large lan- guage models are zero-shot reasoners. In Advances in Neural Information Processing Systems, volume 35, pages 22199–22213. Curran Associates, Inc. Helena R. Lourenço, Olivier C. Martin, and Thomas Stützle. 2003. Iterated Local Search, pages 320–353. Springer US, Boston, MA. Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. 2022. Fantastically ordered prompts and where to find them: Overcoming few- shot prompt order sensitivity. In Proceedings of the 60th Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 8086–8098, Dublin, Ireland. Association for Compu- tational Linguistics. Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdan- bakhsh, and Peter Clark. 2023. Self-refine: Itera- tive refinement with self-feedback. In Advances in Neural Information Processing Systems, volume 36, pages 46534–46594. Curran Associates, Inc. Theo X. Olausson, Jeevana Priya Inala, Chenglong Wang, Jianfeng Gao, and Armando Solar-Lezama. 2024. Is self-repair a silver bullet for code genera- tion? In The Twelfth International Conference on Learning Representations. OpenAI. 2024. Gpt-4o mini: advancing cost-efficient intelligence. 9 OpenAI. 2025. Introducing GPT-4.1 in the API. Krista Opsahl-Ong, Michael J Ryan, Josh Purtell, David Broman, Christopher Potts, Matei Zaharia, and Omar Khattab. 2024. Optimizing instructions and demon- strations for multi-stage language model programs. In Proceedings of the 2024 Conference on Empiri- cal Methods in Natural Language Processing, pages 9340–9366, Miami, Florida, USA. Association for Computational Linguistics. Reid Pryzant, Dan Iter, Jerry Li, Yin Lee, Chenguang Zhu, and Michael Zeng. 2023. Automatic prompt op- timization with “gradient descent” and beam search. In Proceedings of the 2023 Conference on Empiri- cal Methods in Natural Language Processing, pages 7957–7968, Singapore. Association for Computa- tional Linguistics. Herbert Robbins. 1952. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58(5):527–535. Pranab Sahoo, Ayush Kumar Singh, Sriparna Saha, Vinija Jain, Samrat Mondal, and Aman Chadha. 2024. A systematic survey of prompt engineering in large language models: Techniques and applications. arXiv e-prints, pages arXiv–2402. Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Re- flexion: language agents with verbal reinforcement learning. In Advances in Neural Information Process- ing Systems, volume 36, pages 8634–8652. Curran Associates, Inc. Kaya Stechly, Matthew Marquez, and Subbarao Kamb- hampati. 2023. GPT-4 doesn’t know it’s wrong: An analysis of iterative prompting for reasoning prob- lems. In NeurIPS 2023 Foundation Models for Deci- sion Making Workshop. Richard Sutton and Andrew Barto. 2018. Reinforcement Learning: An Introduction. MIT Press. Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. 2022. Chain-of-thought prompt- ing elicits reasoning in large language models. In Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Associates, Inc. Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W White, Doug Burger, and Chi Wang. 2024. Autogen: Enabling next-gen LLM applications via multi-agent conversation. In ICLR 2024 Workshop on Large Language Model (LLM) Agents. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Day- iheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, and 41 others. 2025. Qwen3 technical report. Preprint, arXiv:2505.09388. Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and Xinyun Chen. 2024. Large language models as optimizers. In The Twelfth International Conference on Learning Representations. Qinyuan Ye, Mohamed Ahmed, Reid Pryzant, and Fereshte Khani. 2024. Prompt engineering a prompt engineer. In Findings of the Association for Com- putational Linguistics: ACL 2024, pages 355–385, Bangkok, Thailand. Association for Computational Linguistics. Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. 2024. Textgrad: Automatic "differentiation" via text. Preprint, arXiv:2406.07496. Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. 2021. Calibrate before use: Improv- ing few-shot performance of language models. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 12697–12706. PMLR. Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. 2023. Large language models are human-level prompt engineers. In The Eleventh International Conference on Learning Representations. 10 Appendices Table of Contents A DiscussionA1 A.1 Remaining Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A1 A.2 Data-Driven Heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A2 A.3 Broader Semantic Trace Utility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A2 A.4 Adaptive Explore-Exploit Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A2 B Experimental Setup (Contd.)A2 C Computational CostA3 D Optimization TreesA3 E VISTA Prompt TemplatesA6 E.1 Hypothesis Agent Prompt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A6 E.2 Reflection Agent Prompt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A7 F Seed PromptsA7 F.1 Defective Seed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A8 F.2 Repaired Seed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A8 F.3 Minimal Seed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A8 G Optimization TracesA8 G.1 Defective Seed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A8 G.2 Repaired Seed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A15 G.3 Minimal Seed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A19 A Discussion A.1 Remaining Limitations VISTA addresses L1, L2, and L3 directly, and partially mitigates L4: heuristic-guided optimization produces structurally grounded prompts that transfer more reliably across base models than unconstrained reflection, as evidenced by Table 2. However, transfer fragility remains open in general—VISTA provides no explicit signal about gen- eralization, and its transfer advantage may not hold across model families with larger capability gaps. Incorporating multi-model minibatch validation, where hypothesis selection is based on accuracy gains averaged across a distribution of base models, is a natural extension. The heuristic set also has finite coverage. Failure modes outside its scope can only be discovered via theε-exploration branch, which has lower expected sample efficiency than heuristic-guided search. This motivates more principled exploration strategies beyond epsilon-greedy sampling. A1 Finally, VISTA introduces three hyperparameters (K,p,ε) that interact with the rollout budget in non-obvious ways. Our ablations identifyK = 3,p = 0.2,ε = 0.1as a robust default, but optimal configurations likely vary across tasks and budget constraints. A.2 Data-Driven Heuristics The current heuristic set is manually curated from representative failure cases and modes. While this ensures coverage of known out-of-prior failure modes, the approach does not scale and may miss task- specific failure patterns underrepresented in the literature. A more scalable alternative is data-driven construction: collecting large-scale optimization trajectories across diverse tasks and reflectors, clustering failure cases by semantic similarity, and distilling recurring patterns into heuristic categories automatically. Frequently-winning free hypotheses from theε-exploration branch are a natural seed for this process—each successful free hypothesis that recurs across multiple runs is a candidate for promotion, enabling the heuristic set to grow with accumulated experience. This process has a bootstrap dependency, however: early trajectories are collected under the manually curated heuristic set and may underrepresent failure modes outside its scope. Seeding with diverse random-restart trajectories, wherep = 1throughout, provides a less biased initialization for the data- driven construction process. A.3 Broader Semantic Trace Utility The semantic trace tree already enables interpretable optimization histories and oscillation detection, but its potential extends further. The trace encodes a causal graph of which hypotheses led to which improvements, providing rich information that could actively guide future decisions—for instance, informing the System-Aware Merge (Agrawal et al., 2026) step by preferring merges between candidates whose root-cause label sequences are complementary, or prioritizing hypothesis categories that have historically been productive. We view the semantic trace as a general-purpose interpretability substrate for APO, and expect its utility to grow as more sophisticated uses are developed. The semantic trace also has potential for cross-task transfer: hypothesis categories that consistently produce largeδgains across multiple tasks constitute a task-agnostic prior that can warm-start optimization on new tasks, reducing the rounds needed to identify the dominant failure mode. A.4 Adaptive Explore-Exploit Scheduling The current two-layer mechanism uses fixedpandεthroughout optimization. In practice, the optimal balance shifts as optimization progresses: early rounds benefit from broader exploration to identify the dominant failure mode, while later rounds benefit from focused exploitation once a productive direction has been found. The semantic trace provides a natural signal for adaptive scheduling. When the trace shows consistent improvement in a single hypothesis category,εcould be reduced to concentrate resources on exploitation. When oscillation is detected or improvement plateaus,pcould be temporarily increased to trigger more aggressive restart behavior. Formalizing this as a bandit problem (Robbins, 1952) over the heuristic set—where the reward signal for each categoryc ∈ Cis its empirical mean∆accacross historical selections—is a principled direction for future work. B Experimental Setup (Contd.) Hardware.All experiments were conducted on a single server equipped with one NVIDIA RTX 4090 (24 GB). Local inference for Qwen3-4B and Qwen3-8B was served via Ollama; GPT-4.1-mini and GPT-4o-mini were accessed through the OpenAI API. Experimental Parameters. All experiments share the following configuration: minibatch sizeb = 8, budgetT = 500, training/validation sizes of 50 each, maximum parallelism of 4 workers, and random seed 0. Evaluation uses exact-match accuracy. For GEPA, we use the default hyperparameters from the original implementation. For VISTA, the default configuration isK = 3,p = 0.2, andε = 0.1; ablation groups vary one parameter at a time. A2 Datasets. For GSM8K, we use the officialopenai/gsm8kdataset; working training and validation sets are both sampled from the official training split and fixed for each run. The full official test split (1,319 examples) serves as the test set. For AIME, working training and validation sets are sampled from AI-MO/aimo-validation-aime(problems from AIME 2022–2024) and fixed for each run. The test set consists of the 30 problems fromMathArena/aime_2025(AIME 2025 I & I), each repeated 5 times to reduce evaluation variance, yielding 150 test instances in total. Model Parameters. All local models are loaded in bfloat16 precision without quantiza- tion. The base Qwen-3 model usestemperature=0.6,top-p=0.95,top-k=20,min-p=0, and presence_penalty=1.5, with thinking mode disabled (reasoning_effort=none). Hypothesis and reflection agents use default sampling parameters. Maximum generation length is set to the model default for all roles. C Computational Cost All local model inference (Qwen3-4B, Qwen3-8B) incurs no API cost. For experiments involving OpenAI models, total expenditure across all reported groups amounts to approximately $34.82–51.92. On GSM8K with a local base model (Qwen3-4B) and GPT-4o-mini reflector, a single VISTA run (T =500,K=3) costs $0.20 and a single GEPA run costs $0.12. On AIME2025 with GPT-4.1-mini as base model, a VISTA run costs $4.1–6.0 and a GEPA run costs $3.7–5.6. The marginal cost of the hypothesis agent relative to GEPA is $0.08–0.40 per run, confirming that VISTA’s diagnostic capability comes at negligible additional expense. D Optimization Trees Figures D.1 and D.2 show the optimization trees for GEPA and VISTA under the defective seed (Qwen3- 4B base, Qwen3-8B reflector, GSM8K), where each nodev ∈ Vcorresponds to a prompt candidate and each edge(u,v)∈ Erecords the root-cause label and accuracy gainδof the optimization step that producedvfromu. We denote the initial base prompt asπ 0 . For all subsequent nodes, we denote accepted prompts asπ (k) i and rejected candidates asr (k) i , where the subscriptiis the iteration index in which the candidate was generated, and the superscriptkis its candidate index within that iteration. GEPA’s tree carries no semantic labels on any edge; VISTA’s tree annotates every transition with(c ∗ ,δ). Red and blue edges indicate accepted updates within an iteration, for GEPA and VISTA, respectively. π 0 26% r (1) 1 1/8 π (1) 2 36% r (1) 3 0/8 r (1) 6 1/8 π (1) 7 28% π (1) 8 22% π (1) 9 28% π (1) 4 38% π (1) 10 24% r (1) 5 4/8 ? ? ? ? ? ? ? ?? ? Figure D.1: GEPA optimization tree under the defective seed. Every edge carries only a question mark; no root-cause label or accuracy gainδis recorded for any transition. The structural root cause is never identified and optimization stagnates at 38%. A3 π 0 30% r (2) 1 2/8 π (1) 1 78% π (1) 3 86% π (3) 3 34% r (3) 1 2/8 r (2) 3 3/8 r (1) 4 7/8 r (2) 4 7/8 π (1) 10 84% r (3) 4 7/8 r (2) 10 7/8 r (3) 10 7/8 task_instruction_clarity cot_field_ordering +48 p cot_field_ordering +56 p reasoning_strategy +4 p reasoning_strategy format_and_syntax task_instruction_clarity reasoning_strategy reasoning_strategy +6 p missing_domain_knowledge task_instruction_clarity edge_case_handling Figure D.2: VISTA optimization tree under the defective seed. Every edge indicates a hypothesis annotated with a root-cause label andδ. Iteration 1 immediately identifiescot_field_orderingas the structural root cause, achieving a +48p jump to 78% accuracy; iteration 3 further reaches 86% via the same diagnosis. Figures D.3 and D.4 show the optimization trees for GEPA and VISTA under the repaired seed, following the same experimental setting and notation as above. π 0 86% r (1) 1 6/8 r (1) 2 3/8 r (1) 3 0/8 r (1) 4 7/8 π (1) 8 88% r (1) 5 0/8 r (1) 6 5/8 r (1) 7 8/8 r (1) 9 7/8 π (1) 20 88% r (1) 14 5/8 r (1) 17 8/8 r (1) 19 7/8 r (1) 10 7/8 r (1) 11 8/8 r (1) 12 7/8 π (1) 15 86% r (1) 13 8/8 r (1) 16 7/8 r (1) 18 8/8 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Figure D.3: GEPA optimization tree under the repaired seed. The trace follows a single-candidate mutation path; two independent branches successfully reach 88% accuracy, while one subsequent update on the left branch regresses to 86%. A4 π 0 84% r (1) 1 8/8 r (1) 2 6/8 π (1) 3 82% r (2) 3 4/8 π (3) 3 86% r (1) 4 7/8 r (2) 4 7/8 r (3) 4 7/8 r (1) 5 1/8 r (1) 8 7/8 π (1) 9 84% π (2) 9 88% π (3) 9 88% r (1) 7 8/8 π (1) 10 88% π (2) 10 84% r (3) 10 7/8 task_instruction_clarity − 2 p reasoning_strategy edge_case_handling +2 p task_instruction_clarity format_and_syntax missing_domain_knowledge task_instruction_clarity +0 p reasoning_strategy +4 p missing_domain_knowledge +4 p task_instruction_clarity +2 p edge_case_handling − 2 p reasoning_strategy Figure D.4: VISTA optimization tree under the repaired seed. The run branches early through parallel hypotheses in iterations 3, 4, and 9, reaching 88% peak accuracy via multiple distinct, semantically tagged paths. Unlabeled edges indicate iterations where no failure cases were collected on the minibatch of its parent, falling back to a single-mutation step as in GEPA. Figures D.5 and D.6 show the optimization trees for GEPA and VISTA under the minimal seed, following the same experimental setting and notation as above. π 0 28% r (1) 1 1/8 r (1) 2 0/8 π (1) 3 30% r (1) 5 0/8 π (1) 4 30% r (1) 6 1/8 r (1) 7 2/8 π (1) 8 26% r (1) 9 2/8 r (1) 11 1/8 r (1) 10 3/8 r (1) 12 4/8 π (1) 13 32% π (1) 14 26% ? ? ? ? ? ? ? ? ? ? ? ? ? ? Figure D.5: GEPA optimization tree under the minimal seed. The main chain reaches a peak of 32% at iteration 13 before regressing to 26% in the next update. A5 π 0 26% r (1) 1 0/8 r (2) 1 2/8 r (3) 1 1/8 π (1) 2 76% r (2) 2 1/8 r (3) 2 2/8 π (1) 3 32% π (2) 3 30% r (3) 3 0/8 r (1) 4 6/8 r (2) 4 5/8 r (3) 4 5/8 π (1) 5 86% π (2) 5 72% r (3) 5 0/8 π (1) 6 74% π (2) 6 76% π (3) 6 80% cot_field_ordering format_and_syntax task_instruction_clarity cot_field_ordering +50 p task_instruction_clarity missing_domain_knowledge format_and_syntax +6 p task_instruction_clarity +4 p missing_domain_knowledge format_and_syntax task_instruction_clarity reasoning_strategy format_and_syntax +10 p reasoning_strategy − 4 p task_instruction_clarity format_and_syntax +2 p reasoning_strategy +4 p edge_case_handling +8 p Figure D.6:VISTA optimization tree under the minimal seed.Iteration 2 immediately identifies cot_field_orderingas a high-impact root cause (+50p), then improves further through semantically tagged branches. E VISTA Prompt Templates VISTA uses two prompt templates: the hypothesis prompt instructs the hypothesis agent to generate semantically labeled root-cause hypotheses guided by the heuristic set; the reflection prompt instructs the reflection agent to rewrite the seed prompt targeting a specific hypothesis. E.1 Hypothesis Agent Prompt Hypothesis Agent Prompt You are an expert prompt engineer analyzing why a system prompt causes failures on certain inputs. CURRENT SYSTEM PROMPT: curr_instructions ERROR TAXONOMY: error_taxonomy FAILED SAMPLES: failed_samples TASK: Analyze the failed samples and generate exactly num_hypotheses diverse root-cause hypotheses. For EACH hypothesis: 1. Select the most fitting category from the Error Taxonomy above (use the exact id field). 2. Provide a concise description of the specific root cause you identified. 3. Suggest a concrete fix direction for the prompt. IMPORTANT: - Each hypothesis MUST address a DIFFERENT aspect of the failures. - Try to cover as many different taxonomy categories as possible. - Be specific about what exactly in the current prompt causes the failure. You MUST respond using EXACTLY the format below. Do NOT write any other text, analysis, or explanation outside of this format. Do NOT use markdown headers or bullet points. Just output exactly num_hypotheses blocks in A6 this format: [HYPOTHESIS 1] TAG: taxonomy_id DESCRIPTION: <one or two sentences describing the specific root cause> FIX: <one or two sentences describing how to fix the prompt> [HYPOTHESIS 2] TAG: taxonomy_id DESCRIPTION: <one or two sentences describing the specific root cause> FIX: <one or two sentences describing how to fix the prompt> [HYPOTHESIS 3] TAG: taxonomy_id DESCRIPTION: <one or two sentences describing the specific root cause> FIX: <one or two sentences describing how to fix the prompt> Start your response immediately. Do not include any preamble. Heuristic Set Prompt - id: cot_field_ordering name: CoT / Output Field Ordering Defect description: The output schema requires the final answer before the reasoning steps, preventing chain-of-thought from influencing the result. - id: format_and_syntax name: Format / Syntax Defect description: The prompt does not strictly enforce output schema, key set, or syntax validity. - id: task_instruction_clarity name: Task Instruction / Constraint Defect description: Task goals or constraints are ambiguous, contradictory, or incomplete. - id: reasoning_strategy name: Reasoning Strategy / Logic Defect description: The prompt implies a flawed or suboptimal reasoning procedure for the task. - id: missing_domain_knowledge name: Missing Domain Knowledge Gap description: The prompt lacks necessary domain facts or definitions required for solving. - id: edge_case_handling name: Edge Case / Boundary Defect description: The prompt handles common inputs but fails on boundary or atypical cases. - id: unclassified_custom name: Unclassified / Custom Discovery description: None of the predefined categories fit; discover and justify a latent failure mode. E.2 Reflection Agent Prompt Reflection Agent Prompt You are a prompt optimization expert. Given a prompt, a diagnosed root cause, and a set of failure cases, your task is to rewrite the prompt to fix the identified issue. Root cause label: label Hypothesis: hypothesis Suggested fix: suggestion Current prompt: prompt Failure cases: failure_cases Rewrite the prompt to address the identified root cause. Follow these rules: - Make targeted edits only. Do not change parts of the prompt unrelated to the root cause. - Preserve the output schema and JSON format unless the root cause is structure. - Output only the rewritten prompt, with no explanation or preamble. F Seed Prompts All experiments use one of three seed prompts. The defective seed is the official GEPA seed for GSM8K, which contains the field order defect described in Section 3.1. The repaired seed corrects this defect A7 manually. The minimal seed strips all task-specific instructions to isolate the effect of seed content on optimization. F.1 Defective Seed Defective Seed You are an AI assistant that solves mathematical word problems. You will be given a question and you need to provide a step-by-step solution to the problem. Finally, you will provide the answer to the question. When outputting the final answer, make sure there are no other text or explanations included, just the answer itself. The expected output must be a JSON object with the following format: "final_answer": <the final answer to the question> , "solution_pad": <the step-by-step solution to the problem> Strictly follow the format provided above and ensure that your output is a valid JSON object. Any deviation from this format will result in an error. p Field order defect: final_answer precedes solution_pad, causing CoT to be generated after the answer. F.2 Repaired Seed Repaired Seed You are an AI assistant that solves mathematical word problems. You will be given a question and you need to provide a step-by-step solution to the problem. Finally, you will provide the answer to the question. When outputting the final answer, make sure there are no other text or explanations included, just the answer itself. The expected output must be a JSON object with the following format: "solution_pad": <the step-by-step solution to the problem>, "final_answer": <the final answer to the question> Strictly follow the format provided above and ensure that your output is a valid JSON object. Any deviation from this format will result in an error. ✓ Field order corrected: solution_pad precedes final_answer. F.3 Minimal Seed Minimal Seed Solve and output in a single json: "final_answer": <answer> . Minimal instructions: no task-specific guidance, minimal format constraint. G Optimization Traces We present optimization traces for GEPA and VISTA under all three seed conditions—defective, repaired, and minimal—using the Qwen3-4B base model and the Qwen3-8B reflector on GSM8K.Yellowhigh- lighting marks changes introduced relative to the parent prompt. G.1 Defective Seed G.1.1 GEPA Despite iterative optimization, the field order defect (final_answerbeforesolution_pad) is preserved in every round. All modifications target reasoning quality. A8 Defective Seed, GEPA, Iteration 1Parent: Seed New subsample score 1.0 is not better than old score 2.0, skipping Defective Seed, GEPA, Iteration 2Parent: Seed You are an AI assistant that solves mathematical word problems. You will be given a question and you need to provide a step-by-step solution to the problem. Finally, you will provide the answer to the question. When outputting the final answer, make sure there are no other text or explanations included, just the answer itself. The expected output must be a JSON object with the following format: "final_answer": <the final answer to the question>, "solution_pad": <the step-by-step solution to the problem> Strictly follow the format provided above and ensure that your output is a valid JSON object. Any deviation from this format will result in an error. Keyguidelinestoavoiderrors: 1.**Parseproblemtermsprecisely**: *"Roundtrip"meansbothupanddown(doubletheone-waydistance). *"Morethanhalf"impliesaddingtohalfofavalue(e.g.,"6morethanhalfofX"=(X/2)+6). *"Discounts"applysequentially(e.g.,20%offfollowedby25%offthediscountedprice). *"Tips"arecalculatedasapercentageoftheoriginalcost,notthediscountedprice. 2.**Useexactarithmetic**: *Verifyallcalculations(e.g.,250+375+320=945,not940). *Ensuremultiplicationandadditionstepsarecorrect(e.g.,945 ×2=1890,not1980). 3.**Includeallstepsinthesolution_pad**: *Showintermediateresultsexplicitly(e.g.,"Ella’sscore=40-4=36"). *Clarifyreductions(e.g.,"Marion’sscore=(36/2)+6=24"). *Accountforallconditions(e.g.,"Pupseaten=56 ×2=112;survivingpups=336-112=224"). 4.**Validatethefinalanswer**: *Ensuretheanswermatchestheproblem’srequirements(e.g.,"Totalmice=adults+survivingpups=56+ 224=280"). *Avoidroundingerrorsormisinterpretationsofpercentages(e.g.,20%of$15=$3,not$2.50). 5.**FormattheJSONstrictly**: *DonotincludeanytextorexplanationsoutsidetheJSONobject. *Ensurethefinalanswerisanumber(e.g.,‘280‘,not‘"280"‘). Defective Seed, GEPA, Iteration 3Parent: Seed New subsample score 0.0 is not better than old score 3.0, skipping Defective Seed, GEPA, Iteration 4Parent: Iteration 2 You are an AI assistant that solves mathematical word problems. You will be given a question and you need to provide a step-by-step solution to the problem. Finally, you will provide the answer to the question. When outputting the final answer, make sure there are no other text or explanations included, just the answer itself. The expected output must be a JSON object with the following format: "final_answer": <the final answer to the question>, "solution_pad": <the step-by-step solution to the problem> Strictly follow the format provided above and ensure that your output is a valid JSON object. Any deviation from this format will result in an error. Key guidelines to avoid errors: 1. **Parse problem terms precisely**: * "Round trip" means both up and down (double the one-way distance). * "More than half" implies adding to half of a value (e.g., "6 more than half of X" = (X/2) + 6). * "Discounts" apply sequentially (e.g., 20% off followed by 25% off the discounted price). * "Tips" are calculated as a percentage of the original cost, not the discounted price. *"Timedifferences"requirecalculatingarrivalvs.departuretimes(e.g.,"missedthebusbyXminutes" meansarrivaltime-departuretime). 2. **Use exact arithmetic**: * Verify all calculations (e.g., 250 + 375 + 320 = 945, not 940). * Ensure multiplication and addition steps are correct (e.g., 945 × 2 = 1890, not 1980). *Avoidroundingerrors(e.g.,20%of$15=$3,not$2.50). *Forpercentagereductions,calculatebasedontheoriginalvalue(e.g.,"1/10less"=original-(original A9 ×1/10)). 3. **Include all steps in the solution_pad**: * Show intermediate results explicitly (e.g., "Ella’s score = 40 - 4 = 36"). * Clarify reductions (e.g., "Marion’s score = (36/2) + 6 = 24"). * Account for all conditions (e.g., "Pups eaten = 56 adults × 2 = 112; surviving pups = 336 - 112 = 224"). *Fortime-basedproblems,calculatearrivalvs.departuretimesexplicitly(e.g.,"Arrivedat8:20,busleft at8:00 →missedby20minutes"). 4. **Validate the final answer**: * Ensure the answer matches the problem’s requirements (e.g., "Total mice = adults + surviving pups = 56 + 224 = 280"). *Avoidmisinterpretationsofpercentagesorfractions(e.g.,"1/10less"=original ×9/10). 5. **Format the JSON strictly**: * Do not include any text or explanations outside the JSON object. * Ensure the final answer is a number (e.g., ‘280‘, not ‘"280"‘). *UseonlyvalidJSONsyntax(e.g.,avoidexpressionslike‘20+18+54‘inthefinalanswer;computethe valueinstead). Defective Seed, GEPA, Iteration 5Parent: Iteration 4 New subsample score 4.0 is not better than old score 4.0, skipping Defective Seed, GEPA, Iteration 6Parent: Seed New subsample score 1.0 is not better than old score 1.0, skipping Defective Seed, GEPA, Iteration 7Parent: Seed You are an AI assistant that solves mathematical word problems. You will be given a question and you need to provide a step-by-step solution to the problem. Finally, you will provide the answer to the question. When outputting the final answer, make sure there are no other text or explanations included, just the answer itself. The expected output must be a JSON object with the following format: "final_answer": <the final answer to the question>, "solution_pad": <the step-by-step solution to the problem> Strictly follow the format provided above and ensure that your output is a valid JSON object. Any deviation from this format will result in an error. **KeyRequirements:** 1.**ProblemBreakdown:** *Identifythekeyvariablesandrelationshipsintheproblem. *Useprecisemathematicaloperations(e.g.,fractions,percentages,arithmeticprogression). *Ensurecalculationsalignwiththeproblem’swording(e.g.,"100morethanhalfasmany"requireshalving first,thenadding). 2.**SolutionSteps:** *Clearlyoutlineeachstepinplaintext,avoidingmarkdown. *Includeintermediatecalculations(e.g.,"Halfof3000is1500"). *Verifythatalloperationsarelogicallyderivedfromtheproblem’sconstraints. 3.**FinalAnswer:** *Providetheexactnumericalanswer(e.g.,"2600")withouttextorunits. *Ensuretheanswermatchesthecorrectcalculation,ashighlightedinfeedback. 4.**EdgeCases:** *Handleunitconversions(e.g.,miles,kg)andspatialconstraints(e.g.,spacingbetweenobjects). *Accountforroundingrules(e.g.,integerresultsforphysicalquantities). 5.**Validation:** *Cross-checkstepstoavoiderrors(e.g.,inExample7,ensuresubtractionofmilesiscorrect). *Usetheproblem’scontexttovalidatefeasibility(e.g.,maximumweightcapacityinExample6). Defective Seed, GEPA, Iteration 8Parent: Seed You are an AI assistant that solves mathematical word problems. You will be given a question and you need to provide a step-by-step solution to the problem. Finally, you will provide the answer to the question. When outputting the final answer, make sure there are no other text or explanations included, just the answer itself. The expected output must be a JSON object with the following format: "final_answer": "<thefinalanswertothequestion>", A10 "solution_pad":"<thestep-by-stepsolutiontotheproblem>" Strictly follow the format provided above and ensure that your output is a valid JSON object. Any deviation from this format will result in an error. ###KeyGuidelinesforAccuracy: 1.**BreakDowntheProblem**:Identifyallcomponentsoftheproblem(e.g.,numbers,operations,relationships) andsolvethemsequentially. 2.**UseClearArithmetic**:Performcalculationsstep-by-step,explicitlyshowingintermediateresults(e.g., "Total=30+20=50"). 3.**CheckforMisinterpretations**: *Fortimeproblems,ensurecorrectsubtraction/addition(e.g.,"Ifthebusleavesat8:00andtheperson arrivesat8:20,thedelayis20minutes"). *Forpercentagediscounts,applydiscountssequentially(e.g.,"Firstapply20%off,then25%offthe discountedprice"). *Foraverages,sumallvaluesanddividebythecount(e.g.,"Totalbirds=35+25+80=140;Average= 140/20=7"). 4.**AvoidCommonErrors**: *Donotassume"fivetimesmore"means"original+5x"(ittypicallymeans"5x"). *Ensureunitsareconsistent(e.g.,convertminutestohoursifrequired). 5.**FinalAnswerFormat**: *The‘final_answer‘mustbeastring(e.g.,"20","228","6.5")andmatchtheexactnumericalvaluefrom thesolution. *Donotincludeunits,text,orexplanationsinthe‘final_answer‘. ExampleofCorrectJSONOutput: "final_answer":"20", "solution_pad":"Delaneyleavesat7:50a.m.andtakes30minutestoreachthepick-uppoint,arrivingat 8:20a.m.Thebusleavesat8:00a.m.,sohemisseditby20minutes." Defective Seed, GEPA, Iteration 9Parent: Seed YouareanAIassistanttaskedwithsolvingmathematicalwordproblems. Yourresponsemuststrictlyadheretothefollowingformat: "final_answer":<theexactnumericalanswertotheproblem,formattedasanumberorexpression>, "solution_pad":<astep-by-stepexplanationofthesolution,witheachstepclearlynumberedandcontaining allintermediatecalculations.Usemarkdownformattingforclarity(e.g.,‘1.Calculate...‘,‘2.Substitute values...’).Ensurecalculationsareexplicitlyshown(e.g.,‘24/2=12’)andthefinalanswerisexplicitly statedattheend.> **KeyRequirements:** 1.**Accuracy:**Verifyallcalculations,includingintermediatesteps,toensurethefinalansweriscorrect. 2.**Clarity:**Breakdowntheproblemintologicalsteps,explicitlystatingeachoperation(e.g.,addition, subtraction,multiplication,division)anditsresult. 3.**Formatting:** *UseJSONsyntaxstrictly(commas,quotes,properbrackets). *DonotincludeanytextoutsidetheJSONobject. *Ensurethe‘final_answer‘fieldcontains**only**thefinalresult,withoutexplanationsorformatting (e.g.,‘34‘,not‘34‘). 4.**Problem-SpecificDetails:** *Identifyvariablesandtheirrelationshipsexplicitly(e.g.,"LetV=Veteran’sParktrashcans"). *Formulti-stepproblems,ensureeachaction(e.g.,"movingtrashcans")isaccountedforinthesolution. *Forwordproblemsinvolvingrates,time,orage,usepreciseformulas(e.g.,distance=speed ×time). 5.**Validation:**Cross-checkthefinalansweragainsttheproblem’sconstraintstoensureitalignswiththe context(e.g.,totalitems,weightlimits,oragerelationships). **Example:** Foraproblemlike"Ellagot4incorrectanswersoutof40,andMariongot6morethanhalfofElla’sscore," thesolution_padshouldinclude: 1."Ella’sscore=40-4=36." 2."HalfofElla’sscore=36/2=18." 3."Marion’sscore=18+6=24." 4."Finalanswer:24." A11 Defective Seed, GEPA, Iteration 10Parent: Iteration 8 You are an AI assistant that solves mathematical word problems. You will be given a question and you need to provide a step-by-step solution to the problem. Finally, you will provide the answer to the question. When outputting the final answer, make sure there are no other text or explanations included, just the answer itself. The expected output must be a JSON object with the following format: "final_answer": "<the final answer to the question>", "solution_pad": "<the step-by-step solution to the problem>" Strictly follow the format provided above and ensure that your output is a valid JSON object. Any deviation from this format will result in an error. ### Key Guidelines for Accuracy: 1. **Break Down the Problem**: * Identify all components (numbers, operations, relationships) and solve sequentially. *Forexample:"Barryhas$10.00worthofdimes →10/0.10=100dimes." 2. **Use Clear Arithmetic**: * Perform calculations step-by-step, explicitly showing intermediate results (e.g., "Total = 30 + 20 = 50"). * Avoid assumptions like "five times more" meaning "original + 5x" (it typically means "5x"). 3. **Check for Misinterpretations**: * **Time problems**: Ensure correct subtraction/addition (e.g., "Delay = 8:20 - 8:00 = 20 minutes"). * **Percentage discounts**: Apply sequentially (e.g., "First apply 20% off, then 25% off the discounted price"). * **Averages**: Sum all values and divide by the count (e.g., "Total = 35 + 25 + 80 = 140; Average = 140 / 20 = 7"). 4. **Avoid Common Errors**: * Ensure units are consistent (e.g., convert minutes to hours if required). *Formulti-stepproblems,verifyeachstepalignswiththeproblem’sdescription(e.g.,"Roundtrips=2x distancepertrip"). 5. **Final Answer Format**: * The ‘final_answer‘ must be a string (e.g., "20", "228", "6.5") and match the exact numerical value from the solution. * Do not include units, text, or explanations in the ‘final_answer‘. ###AdditionalDomain-SpecificNotes: ***CurrencyCalculations**:Alwaysuseexactdecimalprecision(e.g.,"Tax=$270*0.10=$27.00"). ***Discounts/Markups**:Applypercentagestothecorrectbasevalue(e.g.,"Discount=OriginalPrice × 0.20"). ***WordProblemInterpretation**: *"Halfasmany"meansdividingby2(e.g.,"DanhashalfthedimesasBarry →100/2=50"). *"Roundtrips"implydoubletheone-waydistance(e.g.,"Roundtrip=2 ×30,000feet"). ***Verification**:Cross-checkcalculationswiththeproblem’scontext(e.g.,"Totaljars=600-35=565"). Example of Correct JSON Output: "final_answer": "20", "solution_pad": "Delaney leaves at 7:50 a.m. and takes 30 minutes to reach the pick-up point, arriving at 8:20 a.m. The bus leaves at 8:00 a.m., so he missed it by 20 minutes." pThe field order defect (final_answerbeforesolution_pad) is preserved across all 10 iterations. All modifications target reasoning quality, problem-solving guidelines, and domain-specific heuristics; the structural root cause is never identified. G.1.2 VISTA VISTA generatesK = 3parallel hypotheses per round, verifies them via minibatch evaluation, and selects the best-verified hypothesis for prompt update. Defective Seed, VISTA, Iteration 1Parent: Seed Strategy: heuristics Hyp.Label∆acc H1 cot_field_ordering+6.0✓ H2 task_instruction_clarity0.0 H3 reasoning_strategy0.0 Selected: H1 [cot_field_ordering] — Candidate Accepted A12 "role": "You are an AI assistant that solves mathematical word problems.", "task": "You will be given a question and you need to provide a step-by-step solution to the problem. Finally, you will provide the answer to the question.", "output_format": "strict":true, "required_fields":["solution_pad","final_answer"], "instructions":"Generatethesolution_padfirst,ensuringcompletechain-of-thoughtreasoningbefore finalizingthefinal_answer.TheJSONmustcontainonlythetwokeysinthisorder:solution_padfollowedby final_answer.DonotincludeanyadditionaltextorexplanationsoutsidetheJSONstructure." , "example": "input":"Marion’sbikecost$356.Stephanie’sbikeisworthtwiceasmuch.Whatisthetotalpriceof theirbikes?", "output": "solution_pad":"Marion’sbikecosts$356.Stephanie’sbikeisworthtwiceasmuch,so356*2=712. Thetotalpriceoftheirbikesis356+712=1068.", "final_answer":"1068" Defective Seed, VISTA, Iteration 2Parent: Iteration 1 Strategy: random_restart Random restart completed with no improvement. No proposal returned. Validation: old_sum=8.0000, new_sum=5.0000, improved=False. Defective Seed, VISTA, Iteration 3Parent: Seed Strategy: heuristics Hyp.Label∆acc H1 cot_field_ordering+2.0✓ H2 format_and_syntax0.0 H3 reasoning_strategy+1.0 Selected: H1 [cot_field_ordering] — Candidate Accepted You are an AI assistant that solves mathematical word problems. You will be given a question and you need to provide a step-by-step solution to the problem. Finally, you will provide the answer to the question. When outputting the final answer, make sure there are no other text or explanations included, just the answer itself. The expected output must be a JSON object with the following format: "solution_pad":<thestep-by-stepsolutiontotheproblem>, "final_answer":<thefinalanswertothequestion> Strictly follow the format provided above and ensure that your output is a valid JSON object. Any deviation from this format will result in an error. Defective Seed, VISTA, Iteration 4Parent: Iteration 1 Strategy: heuristics Hyp.Label∆acc H1 task_instruction_clarity0.0 H2 reasoning_strategy0.0 H3 missing_domain_knowledge0.0 No multi-hypothesis candidate improved over parent. No proposal returned. A13 Defective Seed, VISTA, Iteration 5Parent: Iteration 1 Strategy: No failed samples found. Falling back to single mutation. Result: Candidate subsample score 0.0 is not better than old score 8.0, skipping. Defective Seed, VISTA, Iteration 6Parent: Iteration 3 Strategy: random_restart Random restart completed with no improvement. No proposal returned. Validation: old_sum=7.0000, new_sum=4.0000, improved=False. Defective Seed, VISTA, Iteration 7Parent: Iteration 3 Strategy: No failed samples found. Falling back to single mutation. Result: Candidate subsample score 8.0 is not better than old score 8.0, skipping. Defective Seed, VISTA, Iteration 8Parent: Iteration 1 Strategy: No failed samples found. Falling back to single mutation. Result: Candidate subsample score 8.0 is not better than old score 8.0, skipping. Defective Seed, VISTA, Iteration 9Parent: Seed Strategy: random_restart Random restart completed with no improvement. No proposal returned. Validation: old_sum=5.0000, new_sum=5.0000, improved=False. Defective Seed, VISTA, Iteration 10Parent: Iteration 1 Strategy: heuristics Hyp.Label∆acc H1 reasoning_strategy+1.0✓ H2 task_instruction_clarity0.0 H3 edge_case_handling0.0 Selected: H1 [reasoning_strategy] — Candidate Accepted "role": "You are an AI assistant that solves mathematical word problems.", "task": "You will be given a question and you need to provide a step-by-step solution to the problem. Finally, you will provide the answer to the question.", "output_format": "strict": true, "required_fields": ["solution_pad", "final_answer"], "instructions": "Generate the solution_pad first, ensuring complete chain-of-thought reasoning before finalizing the final_answer. Whencalculatingthenumberofobjectsthatcanfitinaspacewithrequired spacingbetweenobjectsandedges,firstsubtractthefinalspacingfromthetotalwidthbeforedividing bythespaceperobject. The JSON must contain only the two keys in this order: solution_pad followed by final_answer. Do not include any additional text or explanations outside the JSON structure." , "example": "input": "Marion’s bike cost $356. Stephanie’s bike is worth twice as much. What is the total price of their bikes?", "output": "solution_pad": "Marion’s bike costs $356. Stephanie’s bike is worth twice as much, so 356 * 2 = 712. The total price of their bikes is 356 + 712 = 1068.", "final_answer": "1068" A14 Defective Seed, VISTA, Iteration 11Parent: Iteration 3 Strategy: No failed samples found. Falling back to single mutation. Result: Candidate subsample score 8.0 is not better than old score 8.0, skipping. G.2 Repaired Seed G.2.1 GEPA Repaired Seed, GEPA, Iteration 1Parent: Seed New subsample score 6.0 is not better than old score 8.0, skipping Repaired Seed, GEPA, Iteration 2Parent: Seed New subsample score 3.0 is not better than old score 6.0, skipping Repaired Seed, GEPA, Iteration 3Parent: Seed New subsample score 0.0 is not better than old score 5.0, skipping Repaired Seed, GEPA, Iteration 4Parent: Seed New subsample score 7.0 is not better than old score 7.0, skipping Repaired Seed, GEPA, Iteration 5Parent: Seed New subsample score 0.0 is not better than old score 8.0, skipping Repaired Seed, GEPA, Iteration 6Parent: Seed New subsample score 5.0 is not better than old score 8.0, skipping Repaired Seed, GEPA, Iteration 7Parent: Seed New subsample score 8.0 is not better than old score 8.0, skipping Repaired Seed, GEPA, Iteration 8Parent: Seed YouareanAIassistanttaskedwithsolvingmathematicalwordproblems. Whengivenaproblem,followthesesteps: 1.**Parsetheproblemcarefully**toidentifyallnumbers,operations,andrelationships. 2.**Breaktheproblemintologicalsteps**,ensuringeachstepisexplicitlystatedandmathematicallyprecise. 3.**Usedomain-specificterminologycorrectly**,suchas: -"Morethan"(e.g.,"fivetimesmore"means5xtheoriginalvalue,not6x). -"Discounts"(e.g.,"20%off"reducesthepriceby20%,notto20%). -"Tripling"or"halving"(explicitlymultiply/dividebythestatedfactor). 4.**Performcalculationsstep-by-step**,showingintermediateresultsandavoidingerrorsinarithmeticor unitconversions. 5.**Verifythefinalanswer**againsttheproblem’scontexttoensureitmatchesthequestion’srequirements (e.g.,"howmuchmore"requiresadifference,notatotal). 6.**OutputstrictlyinJSONformat**with: -‘solution_pad‘:Astringcontainingthestep-by-stepexplanation,usingnewline-separatedstepsforclar ity. -‘final_answer‘:Thenumericalresultasastring(e.g.,"20",not20). 7.**AvoidanytextorexplanationsoutsidetheJSON**.Ensurethefinalanswerisunambiguousandmatchesthe problem’sexpectedunits(e.g.,dollars,minutes). Example:For"Alicedoublesher$2000,"calculate2000*2=4000.For"Bobmakesfivetimesmore,"interpret thisas2000+(2000*5)=12000.Alwaysclarifyambiguousphrasingandensurecalculationsalignwithstandard mathematicalconventions. Repaired Seed, GEPA, Iteration 9Parent: Seed New subsample score 7.0 is not better than old score 7.0, skipping A15 Repaired Seed, GEPA, Iteration 10Parent: Iteration 8 New subsample score 7.0 is not better than old score 7.0, skipping Repaired Seed, GEPA, Iteration 11Parent: Iteration 8 New subsample score 8.0 is not better than old score 8.0, skipping Repaired Seed, GEPA, Iteration 12Parent: Iteration 8 New subsample score 7.0 is not better than old score 7.0, skipping Repaired Seed, GEPA, Iteration 13Parent: Iteration 8 New subsample score 8.0 is not better than old score 8.0, skipping Repaired Seed, GEPA, Iteration 14Parent: Seed New subsample score 5.0 is not better than old score 6.0, skipping Repaired Seed, GEPA, Iteration 15Parent: Iteration 8 You are an AI assistant tasked with solving mathematical word problems. When given a problem, follow these steps: 1. **Parse the problem carefully**, identifying all numbers, operations, and relationships. Payspecialatten tiontoambiguousphrasinglike"morethan"or"timesmore,"whichoftenindicateadditionormultiplication (e.g.,"fivetimesmore"means5xtheoriginalvalue,not6x). 2. **Break the problem into logical steps**, explicitly stating each step and ensuring mathematical precision. Forexample,interpret"8moresolarsystemsforeveryplanet"as8additionalsystemsperplanet(totaling9 systemsperplanet). 3. **Use domain-specific terminology correctly**, such as: - "Discounts" (e.g., "20% off" reduces the price by 20%, not to 20%). - "Tripling" or "halving" (explicitly multiply/divide by the stated factor). - "More than" (e.g., "five times more" means 5x the original value, not 6x). 4. **Perform calculations step-by-step**, showing intermediate results and avoiding arithmetic errors.For example,iftheprobleminvolvesmultiplestages(e.g.,loadingcrateswithweightlimits),calculatetotal weight,capacity,andexcessseparately. 5. **Verify the final answer** against the problem’s context to ensure it matches the question’s requirements. For instance, "how much more" requires a difference, not a total. 6. **Output strictly in JSON format** with: - ‘solution_pad‘: step-by-step explanation. - ‘final_answer‘: numerical result as a string (e.g., "20", not 20). 7. **Avoid any text or explanations outside the JSON**. Ensure the final answer is unambiguous and matches the problem’s expected units (e.g., dollars, minutes). **Keyclarificationstoavoidcommonmistakes**: -"Fivetimesmore"means5xtheoriginalvalue(not6x). -"Twiceasmuch"means2xtheoriginalvalue. -"Morethan"oftenindicatesaddition(e.g.,"8moreforeveryplanet"means8addedtothebasequantity). -Alwayscheckforhiddenconstraints(e.g.,weightlimits,subdivisionsofitems). -Useequationstomodelrelationships(e.g.,fordiscounts:originalprice-discount=finalprice). Repaired Seed, GEPA, Iteration 16Parent: Iteration 8 New subsample score 7.0 is not better than old score 7.0, skipping Repaired Seed, GEPA, Iteration 17Parent: Seed New subsample score 8.0 is not better than old score 8.0, skipping Repaired Seed, GEPA, Iteration 18Parent: Iteration 8 New subsample score 8.0 is not better than old score 8.0, skipping Repaired Seed, GEPA, Iteration 19Parent: Seed New subsample score 7.0 is not better than old score 7.0, skipping A16 Repaired Seed, GEPA, Iteration 20Parent: Seed YouareanAIassistanttaskedwithsolvingmathematicalwordproblems.Whengivenaproblem,youmustgenerate aJSONobjectwithtwokeys:"solution_pad"and"final_answer". For"solution_pad",provideaclear,step-by-stepexplanationofthesolution.Breakdowneachcalculation explicitly,usingarithmeticoperations(e.g.,"0.4*60=24")andlogicalsteps(e.g.,"Subtracttofindthe remainingquantity").Ensureallintermediatestepsareshown,evenforsimpleoperations.Avoidmarkdownand useplaintext. For"final_answer",outputonlythenumericalresultoftheproblem,withoutanytext,units,orexplanations. Ensuretheansweriscorrectlyformattedasanumber(e.g.,36,77.00,565). Keyrequirements: 1.**StrictJSONformat**:EnsuretheoutputisavalidJSONobjectwithnoextratext,comments,orformatting. 2.**Correctness**:Verifyallcalculationsareaccurateandalignwiththeproblem’scontext(e.g.,percent ages,fractions,costtotals,averages). 3.**Domain-specifichandling**:Accountforproblem-specificdetails(e.g.,"eachcartonhas20jars,""10% lessthantheaverage"). 4.**Generalizablestrategy**:Applylogicalstepslikeidentifyinggivenvalues,determiningoperations,and solvingsequentially. Example:Iftheproblemis"Apieisslicedinto8pieces.1/2isgiventoJoe,1/4toCarl.Howmanyslices remain?",thesolution_padshoulddetaileachstep(e.g.,"1/2of8=4slices;1/4of8=2slices;8-4-2 =2")andthefinal_answeris"2". Alwaysvalidatethatthefinalanswermatchestheproblem’srequirementsexactly,includingunitsordecimal precisionifspecified. ✓The correct field order (solution_padbeforefinal_answer) is preserved across all 20 iterations. Optimizations progressively improve reasoning guidelines, domain-specific heuristics, and disambiguation strategies without disrupting the structural format. G.2.2 VISTA Repaired Seed, VISTA, Iteration 1Parent: Seed Strategy: No failed samples found. Falling back to single mutation. Result: Candidate subsample score 8.0 is not better than old score 8.0, skipping. Repaired Seed, VISTA, Iteration 2Parent: Seed Strategy: No failed samples found. Falling back to single mutation. Result: Candidate subsample score 6.0 is not better than old score 8.0, skipping. Repaired Seed, VISTA, Iteration 3Parent: Seed Strategy: heuristics Hyp.Label∆acc H1 task_instruction_clarity+1.0✓ H2 reasoning_strategy-1.0 H3 edge_case_handling+1.0 Selected: H1 [task_instruction_clarity] — Candidate Accepted "role": "You are an AI assistant that solves mathematical word problems.", "task": "You will be given a question and you need to provide a step-by-step solution to the problem. Finally, you will provide the answer to the question. When outputting the final answer, make sure there are no other text or explanations included, just the answer itself.", "clarification":"Whenaproblemstatesthatthereare‘Xmore’ofsomethingforeveryY,interpretthis asXadditionalunitsperY,andincludeboththeYandtheXunitsinthetotalcount.Iftheproblemuses phraseslike‘Xtimesmore’or‘Xtimesasmany,’clarifythatthesetypicallymeanXtimestheoriginal amount,notXtimesplustheoriginal.", "output_format":"TheexpectedoutputmustbeaJSONobjectwiththefollowingformat: ̈ solution_pad ̈ :<the step-by-stepsolutiontotheproblem>, ̈ final_answer ̈ :<thefinalanswertothequestion>.Strictlyfollowthe formatprovidedaboveandensurethatyouroutputisavalidJSONobject.Anydeviationfromthisformatwill resultinanerror.", A17 "schema":"TheJSONoutputmustincludetwofields:‘solution_pad’(astringcontainingthestep-by-step reasoning)and‘final_answer’(anumericalvaluerepresentingthefinalresult).Ensurethatthefinalanswer iscorrectlyformattedasanumber,notatextstring." Repaired Seed, VISTA, Iteration 4Parent: Seed Strategy: heuristics Hyp.Label∆acc H1 task_instruction_clarity0.0 H2 format_and_syntax0.0 H3 missing_domain_knowledge0.0 No multi-hypothesis candidate improved over parent. No proposal returned. Repaired Seed, VISTA, Iteration 5Parent: Seed Strategy: No failed samples found. Falling back to single mutation. Result: Candidate subsample score 1.0 is not better than old score 8.0, skipping. Repaired Seed, VISTA, Iteration 6Parent: Iteration 3 Strategy: random_restart Random restart completed with no improvement. No proposal returned. Validation: old_sum=8.0000, new_sum=6.0000, improved=False. Repaired Seed, VISTA, Iteration 7Parent: Iteration 3 Strategy: No failed samples found. Falling back to single mutation. Result: Candidate subsample score 8.0 is not better than old score 8.0, skipping. Repaired Seed, VISTA, Iteration 8Parent: Seed Strategy: No failed samples found. Falling back to single mutation. Result: Candidate subsample score 7.0 is not better than old score 8.0, skipping. Repaired Seed, VISTA, Iteration 9Parent: Seed Strategy: heuristics Hyp.Label∆acc H1 task_instruction_clarity+1.0 H2 reasoning_strategy+2.0✓ H3 missing_domain_knowledge+1.0 Selected: H2 [reasoning_strategy] — Candidate Accepted You are an AI assistant that solves mathematical word problems. You will be given a question and you need to provide a step-by-step solution to the problem. Finally, you will provide the answer to the question. When outputting the final answer, make sure there are no other text or explanations included, just the answer itself. The expected output must be a JSON object with the following format: "solution_pad": <the step-by-step solution to the problem>, "final_answer": <the final answer to the question> Strictly follow the format provided above and ensure that your output is a valid JSON object. Any deviation from this format will result in an error. Whensolvingproblemsinvolvingtime,workrates,orquantitieswithpauses,explicitlyaccountfor interruptionsbyfirstcalculatingtheeffectiveworkingtime(totaltimeminusbreakduration)before A18 computingtotaloutput.Ensureallstepsareclearlyarticulatedinthesolution_pad,includinganyadjustments forpausesorbreaks.Maintainthegeneralstructureofthesolution,butprioritizeaccuracyintime-based calculationsbyadheringtothisprinciple. Repaired Seed, VISTA, Iteration 10Parent: Iteration 3 Strategy: heuristics Hyp.Label∆acc H1 task_instruction_clarity+1.0✓ H2 edge_case_handling+1.0 H3 reasoning_strategy0.0 Selected: H1 [task_instruction_clarity] — Candidate Accepted You are an AI assistant that solves mathematical word problems. You will be given a question and you need to provide a step-by-step solution to the problem. Finally, you will provide the final answer to the question. When outputting the final answer, make sure there are no other text or explanations included, just the answer itself. When solving problems involving relative speed (e.g., one entity catching up to another), calculate the time based on the relative speed (difference in speeds) and the initial distance covered by the head start. Ensure units are consistent (e.g., convert minutes to hours if necessary). Forproblemsinvolvingspacingorpositioningofentitiesinaconfinedspace(e.g.,boats,vehicles,objects), explicitlyaccountforedgecasesbysubtractingtheclearancerequiredforthelastentityfromthetotal dimensionbeforedividingbytheper-entityspace.Thisensuresaccuratecalculationsforscenarioswherethe finalentity’sclearanceoverlapswithboundaries. For all other problems, follow standard arithmetic or algebraic reasoning. The expected output must be a JSON object with the following format: "solution_pad": <the step-by-step solution to the problem>, "final_answer": <the final answer to the question> Strictly follow the format provided above and ensure that your output is a valid JSON object. Any deviation from this format will result in an error. G.3 Minimal Seed G.3.1 GEPA Minimal Seed, GEPA, Iteration 1Parent: Seed New subsample score 1.0 is not better than old score 1.0, skipping Minimal Seed, GEPA, Iteration 2Parent: Seed New subsample score 0.0 is not better than old score 2.0, skipping Minimal Seed, GEPA, Iteration 3Parent: Seed Solvethegivenmathproblemstep-by-step,ensuringallcalculationsarecorrect.OutputonlyasingleJSON objectwiththekey"final_answer"andthecorrectnumericalvalueasthevalue.Donotincludeanyexplanations, equations,ortextinsidetheJSONvalue.EnsuretheJSONisvalidandfreeofsyntaxerrors.Iftheanswer requiresmultiplesteps,computetheresultaccuratelyandpresentitasasinglenumber. Minimal Seed, GEPA, Iteration 4Parent: Iteration 3 Solve the given math problem step-by-step, ensuring all calculations are correct. Output only a single JSON object with the key "final_answer" and the correct numerical value as the value. Do not include any explanations, equations, or text inside the JSON value. Ensure the JSON is valid and free of syntax errors. If the answer requires multiple steps, compute the result accurately and present it as a single number. Verifythatthefinal answerisafullycalculatednumericvalue(e.g.,42,not2+2)andmatchestheexactexpectedresult.Avoidany expressionsorintermediatestepsintheJSONvalue.Iftheprobleminvolvesaverages,totals,orcomparisons, ensurethefinalanswerreflectsthecorrectaggregationordifferenceaspertheproblem’srequirements. A19 Minimal Seed, GEPA, Iteration 5Parent: Seed New subsample score 0.0 is not better than old score 2.0, skipping Minimal Seed, GEPA, Iteration 6Parent: Iteration 3 New subsample score 1.0 is not better than old score 1.0, skipping Minimal Seed, GEPA, Iteration 7Parent: Iteration 3 New subsample score 2.0 is not better than old score 2.0, skipping Minimal Seed, GEPA, Iteration 8Parent: Iteration 3 Solve the given math problem step-by-step, ensuring all calculations are correct. Output only a single JSON object with the key "final_answer" and the correct numerical value as the value. Do not include any explanations, equations, or text inside the JSON value. Ensure the JSON is valid and free of syntax errors. If the answer requires multiple steps, compute the result accurately and present it as a single number.Fortime-based problems,calculatedurationsusingabsolutetimedifferences(e.g.,8:20-8:00=20minutes).Forpercentage discounts,applyeachdiscountsequentiallytotheupdatedprice(e.g.,20%offthen25%off).Foraverages, sumallvaluesanddividebythetotalcountofitems.Alwaysverifyintermediatestepstoavoiderrorsin arithmeticorlogicalsequencing. Minimal Seed, GEPA, Iteration 9Parent: Iteration 3 New subsample score 2.0 is not better than old score 2.0, skipping Minimal Seed, GEPA, Iteration 10Parent: Iteration 4 New subsample score 3.0 is not better than old score 3.0, skipping Minimal Seed, GEPA, Iteration 11Parent: Iteration 3 New subsample score 1.0 is not better than old score 2.0, skipping Minimal Seed, GEPA, Iteration 12Parent: Iteration 4 New subsample score 4.0 is not better than old score 4.0, skipping Minimal Seed, GEPA, Iteration 13Parent: Iteration 4 Solve the given math problem step-by-step, ensuring all calculations are correct. Output only a single JSON object with the key "final_answer" and the correct numerical value as the value. Do not include any explanations, equations, or text inside the JSON value. Ensure the JSON is valid and free of syntax errors. If the answer requires multiple steps, compute the result accurately and present it as a single number. Verify that the final answer is a fully calculated numeric value (e.g., 42, not 2+2) and matches the exact expected result. Avoid any expressions or intermediate steps in the JSON value. If the problem involves averages, totals, or comparisons, ensure the final answer reflects the correct aggregation or difference as per the problem’s requirements. **KeyDetailstoIncludeinYourReasoning:** 1.**Step-by-StepBreakdown:**Explicitlyoutlineallrequiredcalculations,includingintermediatesteps(e.g., addingquantities,applyingrates,calculatingspacing). 2.**Domain-SpecificRules:**Accountfornicheconstraints(e.g.,spacingbetweenobjects,rateadjustments, cumulativeweightincrements). 3.**Verification:**Double-checkarithmetic,ensureunitsareconsistent,andconfirmthatallproblem-spe cificconditionsaresatisfied. 4.**EdgeCases:**Addressscenarioslikepartialspaces(e.g.,lastboatrequiringreducedspacing),rounding rules,orhiddentotals(e.g.,summingmultiplecategories). 5.**FinalValidation:**Ensuretheanswerisasinglenumericvalue,notaformulaortext,andmatchesthe problem’sexactrequirements(e.g.,"howmanymoremiles,""totalexpenses"). **ExamplesofCorrect/IncorrectPatterns:** -**Correct:**"final_answer":8(e.g.,riverspacingproblemwhere42ftriverallows8boats). -**Incorrect:**"final_answer":4(sameproblembutmissingadjustmentforfinalboatspacing). -**Correct:**"final_answer":86(e.g.,testscoreproblemwhereMarco’s81+5=86). -**Incorrect:**"final_answer":81(sameproblembutomittingMargaret’s5-pointaddition). A20 Minimal Seed, GEPA, Iteration 14Parent: Iteration 13 Solve the given math problem step-by-step, ensuring all calculations are correct. Output only a single JSON object with the key "final_answer" and the correct numerical value as the value. Do not include any explanations, equations, or text inside the JSON value. Ensure the JSON is valid and free of syntax errors. If the answer requires multiple steps, compute the result accurately and present it as a single number. Verify that the final answer is a fully calculated numeric value (e.g., 42, not 2+2) and matches the exact expected result. Avoid any expressions or intermediate steps in the JSON value. If the problem involves averages, totals, or comparisons, ensure the final answer reflects the correct aggregation or difference as per the problem’s requirements. **Key Details to Include in Your Reasoning:** 1. **Step-by-Step Breakdown:** Explicitly outline all required calculations, including intermediate steps (e.g., adding quantities, applying rates, calculating spacing). 2.**Domain-SpecificRules:**Accountfornicheconstraints(e.g.,spacingbetweenobjects,rateadjustments, cumulativeweightincrements).Forexample: -Inspacingproblems,subtractthefinalobject’sspaceifpartialspacingisrequired(e.g.,lastboat reducesspacing). -Inpercentageproblems,ensure"lessthan"or"morethan"isappliedcorrectly(e.g.,1/10less=90%of original). 3. **Verification:** Double-check arithmetic, ensure units are consistent, and confirm that all problem-specific conditions are satisfied. 4.**EdgeCases:**Addressscenarioslikepartialspaces,roundingrules,orhiddentotals(e.g.,summing multiplecategories).Forexample: -Ifaprobleminvolvesmultiplecategories,ensureallaresummedexplicitly. -Fortipcalculations,addthetiptothebasecost. 5. **Final Validation:** Ensure the answer is a single numeric value, not a formula or text, and matches the problem’s exact requirements (e.g., "how many more miles," "total expenses"). **ExamplesofCorrect/IncorrectPatterns:** -**Correct:**"final_answer":18(e.g.,taxicostwith20%tip:$15+$3=$18). -**Incorrect:**"final_answer":26(sameproblembutmissingtipcalculation). -**Correct:**"final_answer":296(e.g.,totalmarks:70+63+83+80=296). -**Incorrect:**"final_answer":160(sameproblembutomittingSocialStudiesorArtsscores). -**Correct:**"final_answer":1890(e.g.,totalexpenses:(250+375+320)*2=1890). -**Incorrect:**"final_answer":2000(sameproblembutincorrectflowercount). pThe missingsolution_padfield is never introduced across all 14 iterations.The JSON output remains "final_answer": <number>throughout. All modifications target reasoning quality and domain-specific heuristics; the structural incompleteness is never addressed. G.3.2 VISTA Minimal Seed, VISTA, Iteration 1Parent: Seed Strategy: heuristics Hyp.Label∆acc H1 cot_field_ordering-2.0 H2 format_and_syntax0.0 H3 task_instruction_clarity-1.0 No multi-hypothesis candidate improved over parent. No proposal returned. Minimal Seed, VISTA, Iteration 2Parent: Seed Strategy: heuristics Hyp.Label∆acc H1 cot_field_ordering+6.0✓ H2 task_instruction_clarity-1.0 H3 missing_domain_knowledge0.0 Selected: H1 [cot_field_ordering] — Candidate Accepted Solvethemathematicalwordproblembyfirstgeneratingadetailed"solution_pad"fieldthatoutlinesall intermediatestepsandcalculations,thenprovidingthefinalanswerinthe"final_answer"field.TheJSON outputmuststrictlyfollowthisstructure:"solution_pad":"<step-by-stepreasoningwithcalculations>", "final_answer":"<correctfinalanswer>". Forproblemsinvolvingmultiplesteps,ensureeachcalculationisexplicitlywritteninthesolution_pad (e.g.,"3/4*40,000=«3/4*40000=30000»30000"Forpercentageorratioproblems,showallconversionsteps. A21 Formulti-partproblems,breakdowneachcomponentintoseparatecalculations.Onlyafterfullydocumenting thereasoningprocessshouldthefinal_answerbeprovided.Thefinal_answermustmatchtheexactnumerical valueandformattingspecifiedintheproblem’sexpectedsolution. Minimal Seed, VISTA, Iteration 3Parent: Seed Strategy: heuristics Hyp.Label∆acc H1 format_and_syntax+3.0✓ H2 task_instruction_clarity+2.0 H3 missing_domain_knowledge-1.0 Selected: H1 [format_and_syntax] — Candidate Accepted SolvethemathematicalproblemandoutputastrictlyformattedJSONobjectwiththefollowingstructure: "final_answer":<numeric_value> **Rules**: 1.**NumericFinalAnswer**:Thevalueinthe"final_answer"fieldmustbeasinglenumericvalue(e.g.,42, 3.14,-5).Donotincludeexpressions,calculations,ortext(e.g.,5+3,10*2). 2.**JSONValidation**:EnsuretheoutputisvalidJSONwithpropersyntax(e.g.,commas,brackets). 3.**NoExtraFields**:Onlyincludethe"final_answer"key.Donotaddadditionalfields,explanations,or formatting(e.g.,markdown,codeblocks). 4.**Accuracy**:Computetheexactnumericalresultoftheproblem,ensuringalignmentwiththeproblem’s contextandmathematicalprinciples. **Example**: Foraproblemlike"Whatis10+20?",thecorrectoutputis: "final_answer":30 Minimal Seed, VISTA, Iteration 4Parent: Iteration 2 Strategy: heuristics Hyp.Label∆acc H1 format_and_syntax0.0 H2 task_instruction_clarity-1.0 H3 reasoning_strategy-1.0 No multi-hypothesis candidate improved over parent. No proposal returned. Minimal Seed, VISTA, Iteration 5Parent: Iteration 2 Strategy: heuristics Hyp.Label∆acc H1 format_and_syntax+2.0✓ H2 reasoning_strategy+2.0 H3 task_instruction_clarity-5.0 Selected: H1 [format_and_syntax] — Candidate Accepted Solve the mathematical word problem by first generating a detailed "solution_pad" field that outlines all intermediate steps and calculations, then providing the final answer in the "final_answer" field. The JSON output must strictly follow this structure: "solution_pad": "<step-by-step reasoning with calculations>", "final_answer": "<correct final answer>". For problems involving multiple steps, ensure each calculation is explicitly written in the solution_pad (e.g., "3/4 * 40,000 = «3/4*40000=30000»30000"). For percentage or ratio problems, show all conversion steps. For multi-part problems, break down each component into separate calculations. Only after fully documenting the reasoning process should the final_answer be provided. The final_answer must match the exact numerical value and formatting specified in the problem’s expected solution. **StrictJSONFormattingRequirements**: A22 1.Allstringsmustbeenclosedindoublequotes("),notsinglequotes(’). 2.Specialcharacters(e.g.,mathematicalsymbols,parentheses)mustbeproperlyescapedifneeded. 3.ValidatetheJSONstructurebeforeoutputtingtoensurenosyntaxerrors(e.g.,missingcommas,unmatched braces). 4.Usetheexactfieldnames"solution_pad"and"final_answer"intheJSONoutput. 5.AvoidanytrailingcommasorincompleteJSONfragments. ExampleofvalidJSON: "solution_pad":"Step1:Calculate1/3of18.18*(1/3)=«6»Step2:Subtractfromtotal.18-6=«12»12. "final_answer":"12" Minimal Seed, VISTA, Iteration 6Parent: Iteration 5 Strategy: heuristics Hyp.Label∆acc H1 format_and_syntax+3.0✓ H2 reasoning_strategy+2.0 H3 edge_case_handling+2.0 Selected: H1 [format_and_syntax] — Candidate Accepted Solve the mathematical word problem by first generating a detailed "solution_pad" field that outlines all intermediate steps and calculations, then providing the final answer in the "final_answer" field. The JSON output must strictly follow this structure: "solution_pad": "<step-by-step reasoning with calculations>", "final_answer": "<correct final answer>". For problems involving multiple steps, ensure each calculation is explicitly written in the solution_pad (e.g., "3/4 * 40,000 = «3/4*40000=30000»30000"). For percentage or ratio problems, show all conversion steps. For multi-part problems, break down each component into separate calculations. Only after fully documenting the reasoning process should the final_answer be provided. The final_answer must match the exact numerical value and formatting specified in the problem’s expected solution, including decimal places, currency symbols, or unit labels. **Critical Clarifications**: 1. **Arithmetic Order**: Always perform operations in the correct sequence (PEMDAS/BODMAS). For subtraction-based problems (e.g., discounts, differences), explicitly write the operation as "original value - final value" to avoid reversed calculations. Example: "Discount = Original Price - Paid Price". 2. **Explicit Steps**: For any subtraction, addition, multiplication, or division, write the full expression with operands and operator first, then the result in the placeholder format. Example: "22 - 16 = «22-16=6»6". 3. **Unit Consistency**: Ensure all units are explicitly converted and documented in the solution_pad if required (e.g., minutes to hours). 4. **Equation Solving**: For algebraic problems, isolate variables step-by-step, showing each transformation (e.g., "22 - x = 16 → x = 22 - 16"). **AdditionalRequirements**: -TheJSONoutputmustbesyntacticallyvalid,withallopeningandclosingbracesproperlymatchedandno trailingcommas. -The"solution_pad"and"final_answer"fieldsmustbeenclosedindoublequotesandseparatedbyacomma. -Thefinal_answermustexactlymatchtheproblem’sexpectedsolution,includingformatting(e.g.,currency symbols,decimalplaces,unitlabels). The final_answer must match the exact numerical value and formatting specified in the problem’s expected solution, including decimal places, currency symbols, or unit labels. 23