Paper deep dive

SEMAG: Self-Evolutionary Multi-Agent Code Generation

Yulin Peng, Haowen Hou, Xinxin Zhu, Ying Tiffany He, F. Richard Yu

Year: 2026Venue: arXiv preprintArea: cs.SEType: PreprintEmbeddings: 74

Abstract

Abstract:Large Language Models (LLMs) have made significant progress in handling complex programming tasks. However, current methods rely on manual model selection and fixed workflows, which limit their ability to adapt to changing task complexities. To address this, we propose SEMAG, a Self-Evolutionary Multi-Agent code Generation framework that mimics human coding practices. It decomposes programming tasks into stages, including planning, coding, debugging, and discussion, while adapting workflows to task difficulty. Its self-evolutionary agents can access the latest models in real time and automatically upgrade the backbone model. SEMAG sets new state-of-the-art Pass@1 accuracy across benchmarks. Using identical backbone models, SEMAG outperforms prior methods by 3.3% on CodeContests. When augmented with self-evolutionary model selection that automatically identifies optimal backbones, SEMAG reaches 52.6%, showcasing both framework effectiveness and adaptability to evolving LLM capabilities.

PDF

Open source PDF →Open local PDF →

Intelligence

Status: not_run | Model: - | Prompt: - | Confidence: 0%

Entities (0)

No extracted entities yet.

Relation Signals (0)

No relation signals yet.

Cypher Suggestions (0)

No Cypher suggestions yet.

Full Text

73,906 characters extracted from source content.

Expand or collapse full text

SEMAG: Self-Evolutionary Multi-Agent Code Generation Yulin Peng 1 , Haowen Hou 2 , Xinxin Zhu 1, 2 , Ying Tiffany He 1 , F. Richard Yu 3 1 College of Computer Science and Software Engineering, Shenzhen University, China 2 Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), China 3 School of Information Technology, Carleton University, Canada Abstract Large Language Models (LLMs) have made significant progress in handling complex pro- gramming tasks. However, current methods rely on manual model selection and fixed work- flows, which limit their ability to adapt to changing task complexities. To address this, we propose SEMAG, a Self-Evolutionary Multi- Agent code Generation framework that mim- ics human coding practices. It decomposes programming tasks into stages, including plan- ning, coding, debugging, and discussion, while adapting workflows to task difficulty. Its self- evolutionary agents can access the latest mod- els in real time and automatically upgrade the backbone model. SEMAG sets new state-of- the-art Pass@1 accuracy across benchmarks. Using identical backbone models, SEMAG out- performs prior methods by 3.3% on CodeCon- tests. When augmented with self-evolutionary model selection that automatically identifies optimal backbones, SEMAG reaches 52.6%, showcasing both framework effectiveness and adaptability to evolving LLM capabilities. 1 Introduction Large Language Models (LLMs) have demon- strated substantial progress in code generation and completion, driven by large-scale pretraining on diverse codebases. The GPT series (Achiam et al., 2023; Hurst et al., 2024), CodeLLaMA-2 (Roziere et al., 2023), Qwen2.5-Coder (Hui et al., 2024), and DeepSeek-v3 (Liu et al., 2024) exhibit strong coding capabilities, unlocking new avenues for automated software development. In parallel, multi-agent frameworks and debugging-enhanced methodologies—such as planning-centric work- flows (Lei et al., 2024), self-debugging paradigms (Chen et al., 2023), and collaborative agent systems (Zhong et al., 2024)—have shown promising per- formance on standard benchmarks. Nonetheless, real-world scenarios present open-ended tasks, con- strained computational budgets, and evolving spec- Action:website_search Links with Snippet Models Model Selecting Agent Task Type Link Selecting Agent Links Selected Link Summarizing Agent Model Selecting Agent Model Decided Website Searching Discriminating Agent Web Summary Figure 1: Overview workflow of Self-Evolution Agents. Agents integrate insights from recent research, news, and community discussions, dynamically identify and deploy the most suitable models. ifications, revealing critical limitations in current approaches. First, frameworks such as Self-Debugging (Chen et al., 2023), LDB (Zhong et al., 2024) typically adopt a fixed reasoning depth. On simple tasks, they introduce unnecessarily complex workflows, leading to redundant computation and excessive token usage, while on difficult tasks, the shallow reasoning depth results in poor success rates. Al- though hierarchical prompting has been shown to mitigate unnecessary reasoning (Budagam et al., 2025), these approaches still lack a principled mechanism to adapt reasoning depth dynamically to task complexity. Second, current pipelines utilize a single debug- ging iteration. When initial outputs diverge signifi- cantly from the target, systems are prone to local minima. Though advanced reasoning paradigms such as Chain-of-Thought (Wei et al., 2022), Tree- of-Thoughts (Yao et al., 2023a), and parallel candi- date exploration (Li et al., 2025) enhance complex reasoning, they lack explicit discussion–decision arXiv:2603.15707v1 [cs.SE] 16 Mar 2026 Problem Description Return a string containing space-delimited numbers starting from 0 upton inclusive. Hidden Test Evaluation Debate Stage Self-refine Visible Test Evaluation Repeat until max attempts reached Planning Agent Plan Verifying Agent Plan Generated Plan Verified Coding Agent Program Embedding Trace Statement Code Explaining Agent Suggesting Agent Debugging Agent Program Refined Debating Agents Discriminating Agent Program Analysis Algorithms & Parameters Algorithms & Parameters Selected Refine Suggestion Use tools optionally Program with Trace Task Type Code task Self-Evolution Agents LLMs Selected Debug Stage Figure 2: Overview of SEMAG. (1) Self-Evolve: Agents dynamically select optimal backbone LLMs per task requirements. (2) Plan: Planning Agent creates solution plans validated by Plan Verifying Agent through I/O simulation. (3) Debug: Coding Agent generates code; upon failure, specialized agents (Embedding Trace, Code Explaining, Suggesting, Debugging) collaboratively refine using trace logs. (4) Debate: When debugging stalls, Debating Agents propose alternatives with Discriminating Agent selecting the optimal configuration. phases that aggregate diverse reasoning trajectories for improved synthesis. Third, most systems are tightly coupled to a sin- gle backbone model. Frameworks built on GPT (Achiam et al., 2023; Hurst et al., 2024), Gemini (Team et al., 2023, 2024), or Claude (Anthropic, 2024) typically depend on a static model through- out execution. As task characteristics shift or new models emerge, backbone switching often requires manual intervention, limiting adaptability and scal- ability. To address these challenges, we propose SEMAG, a Self-Evolutionary Multi-Agent code Generation framework. Our contributions are sum- marized as follows: • Adaptive hierarchical prompting: We pro- pose a dynamic strategy that adjusts reasoning depth based on task complexity. •Collaborative self-evolution: We introduce discussion–decision module enabling escape from local optima and adaptive backbone switching. •Empirical gains: Achieves state-of-the-art performance on seven benchmarks. With con- trolled backbone comparison, SEMAG im- proves 3.3% over the previous best method on CodeContests; with self-evolutionary model selection, it further reaches 52.6%. We evaluate SEMAG across seven text-to-code benchmarks, including four foundational datasets (HumanEval, MBPP, HumanEval-ET, MBPP-ET) and three competition-level benchmarks (APPS, LiveCode, CodeContests). Experimental results show that SEMAG achieves new state-of-the-art performance, including 98.8% Pass@1(Chen et al., 2021; Dong et al., 2024) on HumanEval, 87.6% on MBPP, and 65.0% on LiveCode. Most no- tably, on the most challenging dataset CodeCon- tests, SEMAG achieves 38.0% Pass@1 accuracy with GPT-4o (3.3% improvement over LPW un- der the same backbone). When augmented with self-evolutionary model selection that automati- cally identifies the optimal backbone, SEMAG further reaches 52.6%. These results demonstrate that SEMAG achieves superior performance and re- source efficiency, while offering strong adaptability to evolving programming tasks. 2 Related Work 2.1 Traditional Approaches to Program Synthesis Program synthesis has a long-standing research foundation in artificial intelligence (Waldinger and Lee, 1969; Manna and Waldinger, 1971). Tra- ditional methods leverage search strategies and data flow analysis (McCarthy, 1978). Early ef- forts aimed to advance automatic programming and to identify viable approaches (BALZER, 1985; Soloway, 1986) or explore large program spaces through domain-specific languages (Mernik et al., 2005; Gu et al., 2021). These approaches struggle with generalization and scalability due to search space complexity. 2.2 Large Language Models for Code Synthesis Pretrained language models have enhanced code synthesis, with specialized models such as Qwen2.5-Coder (Hui et al., 2024), CodeLLaMA-2 (Roziere et al., 2023), Mistral (Jiang et al., 2024a), and DeepSeek-v3 (Liu et al., 2024) excelling in pro- gramming tasks. General-purpose models, includ- ing GPT (Achiam et al., 2023; Hurst et al., 2024), Gemini (Team et al., 2023, 2024), and Claude (An- thropic, 2024), also demonstrate robust code gen- eration capabilities. However, these models still face challenges related to syntactic correctness, se- mantic alignment, generation robustness, and ver- sion conflicts. As a result, more refined control and evaluation mechanisms for code generation are necessary. 2.3 Prompting and Debugging Techniques Researchers have proposed various prompting and debugging techniques to improve code genera- tion. Prompting strategies generally fall into three categories: retrieval-based (Islam et al., 2024), planning-based (Yao et al., 2023b), and debugging- based (Chen et al., 2023) approaches. These aim to guide LLMs in decomposing complex tasks into manageable parts through step-by-step reason- ing. Techniques such as Chain-of-Thought (Wei et al., 2022), Tree-of-Thoughts (Yao et al., 2023a), and cumulative reasoning mimic human problem- solving paths, significantly enhancing model per- formance on complex tasks (Zhou et al., 2022; Zhang et al., 2023). More advanced methods simu- late the software development process by construct- ing multiple candidate programs and exploring the solution space in parallel (Li et al., 2025; Antoni- ades et al., 2025). Debugging systems such as Self-Debugging (Chen et al., 2023) and LDB (Zhong et al., 2024) iteratively refine code using model explanations, execution, and human feedback. However, their effectiveness decreases when the initial code di- verges from the intended function. To improve generation quality with limited supervision, some methods break down the coding task by incorporat- ing visible test cases, step-by-step verification (Hu et al., 2025; Li and Yuan, 2024; Mathews and Na- gappan, 2024), and natural language instructions to improve controllability and alignment. Previousmethodseitherfixreasoning depth—wastingcomputeonsimpletasks and underperforming on hard ones—or rely on a single LLM, limiting recovery from failures. SEMAG tackles both with three mechanisms: a hierarchical controller that scales from one-shot to multi-step planning based on feedback; a discussion–decision phase where agents critique and merge solutions to avoid local optima; and an automatic model selector that switches to a more capable backbone, boosting Pass@1 accuracy as difficulty rises. 3 Method We present a hierarchical multi-agent framework for code synthesis that adapts to task complex- ity through progressive refinement levels, cou- pled with a self-evolution mechanism for dynamic model selection. The overview of SEMAG is shown in Figure 2. 3.1 Problem Formulation We define a code generation task asT = (P,S,C) whereP ∈ Pis problem description,S = (x i ,y i ) n i=1 are input-output examples, andCis the program space. The core agent operations are: CODER :P ×S× Π× Θ→C, PLANNER :P ×S → Π, VERIFIER : Π×P ×S → 0, 1× Π×L, DEBUGGER :C× Σ→C (1) whereΠis the plan space,Θparameters,L logs, andΣsuggestions. Additional agents in- clude EMBEDTRACE (C → T), EXPLAINER (C×P →E), and SUGGESTOR (T ×L×E → Σ). 3.2 Hierarchical Code Synthesis Framework Our framework employs a four-level hierarchical architecture that progressively increases computa- tional effort based on task complexity. Level 1 (Direct Generation): The system ini- tially attempts direct code synthesis using minimal prompting: Y = CODER(P,S, ∅, ∅),(2) Algorithm 1 Hierarchical workflow of SEMAG Input: Problem P , examples S Output: Program Y 1: Y ← CODER(P,S)▷ Level 1 2: if TEST(Y,S) then return Y 3: end if 4: π ← PLANNER(P,S)▷ Level 2 5: for i = 1 to M plan do 6: (ν,π,ℓ)← VERIFIER(π,P,S) 7:if ν = 1 then break 8:end if 9: end for 10: Y ← CODER(P,S,π) 11: if TEST(Y,S) then return Y 12: end if 13: for t = 1 to M try do▷ Level 3 14: τ prev ← ∅ 15:for d = 1 to M debug do 16:τ ← EMBEDTRACE(Y ) 17:σ ← SUGGESTOR(τ,ℓ, EXPLAINER(Y,P)) 18:Y ← DEBUGGER(Y,σ) 19:if TEST(Y,S) then return Y 20:end if 21:if ρ(τ,τ prev ) > δ(d,T ) then break 22:end if 23:τ prev ← τ 24:end for 25: H ←DEBATER j (P,τ,Y ) N debater j=1 ▷ Level 4 26: Y ← CODER(P,S, DECIDER(H)) 27:if TEST(Y,S) then return Y 28:end if 29: end for 30: return Y where ∅ indicates no plan or parameters. Level 2 (Planning and Verification): Upon Level 1 failure, the system generates and iteratively re- fines a structured solution plan. The planning pro- cess operates as: π 0 = PLANNER(P,S),(3) followed by iterative verification: (ν i ,π i ,ℓ i ) = VERIFIER(π i−1 ,P,S), i∈ [1,M plan ]. (4) whereν i ∈ 0, 1indicates verification status, π i is the refined plan, andℓ i contains verification logs. The process terminates whenν i = 1or i = M plan , with the final planπ ∗ guiding code generation: Y = CODER(P,S,π ∗ , ∅).(5) Level 3 (Trace-Guided Debugging): When Level 2 fails, the system enters an iterative debugging phase withK pass passes andM try attempts per pass. For each attempt, the debugging process consists of: τ = EMBEDTRACE(Y ), ε = EXPLAINER(Y,P ), σ = SUGGESTOR(τ,ℓ ∗ ,ε), Y ′ = DEBUGGER(Y,σ). (6) This process repeats forM debug iterations, where τcaptures runtime variable states,εprovides se- mantic analysis, andσsynthesizes targeted modifi- cations. Level 4 (Multi-Agent Collaborative Refinement): When iterative debugging stalls, the system em- ploys collaborative multi-agent discussion. Each of N debater agents generates proposals incorporating discussion history: d j = DEBATER j (P,τ,Y,H j−1 ), j ∈ [1,N debater ]. (7) whereH j−1 =d 1 ,...,d j−1 represents accumu- lated discussion history. The decision aggregation employs weighted consensus: (α ∗ ,θ ∗ ) = arg max (α,θ) N debater X j=1 w j · φ(d j ,α,θ), w j = exp(η j /τ w ) P k exp(η k /τ w ) . (8) whereη j represents historical performance andφ evaluates proposal alignment. 3.3 Adaptive Level Transition Mechanism Rather than using fixed iteration thresholds, we employ an adaptive transition mechanism based on execution trace similarity. The transition decision is formulated as: Transition(t) = ( True, if ρ(τ t ,τ t−1 ) > δ(t,T ) False, otherwise (9) whereρmeasures trace similarity using normal- ized edit distance: ρ(τ t ,τ t−1 ) = 1− EditDist(τ t ,τ t−1 ) max(|τ t |,|τ t−1 |) (10) The adaptive thresholdδ(t,T )adjusts based on task complexity and iteration count: δ(t,T ) = δ 0 · exp −λ· t T max · complexity(T ) (11) ModelMethodHumanEvalMBPPHumanEval-ETMBPP-ET GPT-3.5 Direct 72.0%± 1.2%55.2%± 0.8%62.8%± 0.6%45.6%± 0.6% Self-Planning77.4%± 1.8%69.2%± 0.4%69.5%± 0.6%52.4%± 1.0% MapCoder77.4%± 0.6%72.0%± 0.6%66.5%± 1.2%56.6%± 0.8% LDB81.1%± 0.6%72.4%± 0.2%72.6%± 1.8%55.6%± 0.4% LPW 89.0%± 0.8%76.0%± 0.2%77.4%± 0.8%57.6%± 0.2% SEMAG (Ours) 91.5%± 1.8%76.2%± 0.8%79.9%± 0.6%64.4%± 0.4% (+27.1%)(+38.0%)(+27.2%)(+41.2%) Table 1: Pass@1 accuracy comparison of different methods using GPT-3.5 on code generation benchmarks. The values enclosed in parentheses represent the improvement over the Direct Prompting approach. The standard deviation (±) is calculated based on the results of three independent runs and applies to the data analysis of subsequent experiments. whereδ 0 = 0.85is the initial threshold,λ = 0.5 is the decay rate,t ∈ [1,T max ]is the current iter- ation count within the active level, andT max rep- resents the maximum iterations before mandatory level transition. 3.4 Self-Evolution Mechanism To enable dynamic adaptation to evolving LLMs, we propose an automated model se- lection framework employingN selectors parallel agents.Each selectoriperforms four opera- tions: First, it generates task-specific keywords κ i = KEYWORDGEN(T, context)and retrieves recent informationL i = SEARCH(κ i )by search- ing tools. Then, relevant links are filtered and sum- marized: L ′ i =l∈ L i : relevance(l,T ) > θ r ,(12) C i = [ ℓ∈L ′ i SUMMARIZE(ℓ).(13) Third, each selector proposes modelsm i with confidence score: (m i ,r i ,s i ) = SELECTOR(C i , Perf(m i ,T sample )),(14) wheres i reflects sampled performance on task subsetT sample . Finally, consensus is achieved through weighted voting: m ∗ = arg max m∈M N selectors X i=1 s i · I[m i = m].(15) This mechanism ensures optimal model selection without manual intervention while maintaining adaptability to emerging LLMs. 4 Experiments 4.1 Experimental Setup Evaluation Datasets. We evaluate SEMAG on seven text-to-code benchmarks across two cate- gories. The foundational datasets include Hu- manEval (Chen et al., 2021) and HumanEval-ET (164 problems each), and MBPP (Austin et al., 2021) and MBPP-ET (500 problems each). The ET variants (Dong et al., 2025) extend their coun- terparts with additional edge test cases.For MBPP/MBPP-ET, which lack sample input-output pairs, we follow previous work (Zhong et al., 2024; Lei et al., 2024) by randomly selecting one test case from the hidden test set as a sample (excluded from evaluation). The competition-level datasets consist of APPS (Hendrycks et al., 2021) (139 prob- lems), LiveCode (Jain et al., 2025) (140 problems), and CodeContests (Li et al., 2022) (150 problems). LiveCode, released after the LLM training cutoff, ensures uncontaminated evaluation. Baseline Methods. We compare SEMAG against several baseline approaches: Direct inputs tasks directly into an LLM; Self-Planning (Jiang et al., 2024b) decomposes tasks into subgoals; MapCoder (Islam et al., 2024) employs four agents for re- trieval, planning, execution, and debugging; LDB (Zhong et al., 2024) utilizes control flow diagrams for programme decomposition and error localiza- tion; and LPW (Lei et al., 2024), the state-of-the-art approach, verifies plans step-by-step and uses print statements for debugging. 4.2 Main Results Comparison with Baselines. Tables 1 and 2 present results using GPT-3.5 and GPT-4o as back- bone models. With GPT-3.5, SEMAG achieves the highest Pass@1 accuracy across all benchmarks, outperforming the strongest baseline LPW by 2.5%, ModelMethodHumanEvalMBPPHumanEval-ETMBPP-ET GPT-4o Direct 91.5%± 1.8%62.8%± 0.4%79.3%± 1.2%51.0%± 0.2% LDB92.1%± 1.2%82.4%± 0.8%81.7%± 1.8%65.4%± 1.0% LPW98.2%± 0.6%84.8%± 0.6%84.8%± 1.2%65.8%± 0.8% SEMAG (Ours) 98.8%± 0.6%87.6%± 0.4%86.6%± 0.6%71.8%± 0.2% (+8.0%)(+38.9%)(+9.2%)(+40.8%) ModelMethodAPPSLiveCodeCodeContestsOverall Avg. GPT-4o Direct47.5%± 0.3%46.4%± 0.8%24.6%± 1.3%57.6% LDB53.2%± 0.7%54.3%± 0.7%29.3%± 0.7%65.5% LPW62.6%± 0.3%59.3%± 1.4%34.7%± 0.7%70.0% SEMAG (Ours) 67.6%± 0.8%65.0%± 0.7%38.0%± 1.3%73.6% (+42.3%)(+40.1%)(+54.5%)(+27.7%) Table 2: Pass@1 accuracy comparison of different methods using GPT-4o (2024-05-13) across multiple benchmarks. The values enclosed in parentheses represent the improvement over the Direct Prompting approach. Level Benchmark HumanEvalMBPPHumanEval-ETMBPP-ETAPPSLiveCodeCodeContests Level 1148314130255666537 Level 28186109166 Level 3448246745 Level 4 4120261895755102 Table 3: Distribution of prompt difficulty levels across multiple benchmarks using GPT-4o (2024-05-13). GPT‑4o GPT‑4.1 DeepSeek‑v3Claude‑3.7‑Sonnet 0 10 20 30 40 50 60 38.0 48.7 48.7 52.6 Pass@1 Accuracy (%) Model Figure 3: Pass@1 accuracy on CodeContests using GPT-4o(2024-05-13), GPT-4.1(2025-04-14), DeepSeek- v3(2025-03-24) and Claude-3.7-Sonnet(2025-02-19). 0.2%, 2.5%, and 6.8% on HumanEval, MBPP, HumanEval-ET, and MBPP-ET respectively. Using GPT-4o, SEMAG establishes new state- of-the-art results across all seven benchmarks, achieving 98.8% accuracy on HumanEval (solving 162/164 problems). Compared to LPW, SEMAG demonstrates consistent improvements of 1.8-6.0% on foundational benchmarks and 3.3-5.7% on competition-level benchmarks, with particularly significant gains of 40-54% over Direct prompting. Self-Evolution Agents in Code Task. To evaluate self-evolution capability, we deploy agents on the CodeContests benchmark to select optimal LLMs autonomously. Agents analyze real-time informa- tion to identify three candidate models: Claude- 3.7-Sonnet, GPT-4.1, and DeepSeek-v3. Figure 3 shows that Claude-3.7-Sonnet achieves 52.6% Pass@1 accuracy, establishing a new state-of- the-art and significantly outperforming GPT-4o’s 38.0%. GPT-4.1 and DeepSeek-v3 both achieve 48.7%, demonstrating that the self-evolution mech- anism effectively identifies and evaluates task- optimized models for continuous improvement. 4.3 Ablations Studies and Analyses Token Efficiency Analysis. Table 3 presents the distribution of prompt difficulty levels (1–4, indi- cating increasing complexity) across benchmarks using GPT-4o.Simpler datasets (HumanEval, MBPP) predominantly use Level 1 prompts (90.2% and 62.8%, respectively), while complex datasets (APPS, CodeContests) require more Level 3–4 prompts (46.0% and 71.3%, respectively). Figure 4 compares token consumption between LPW and SEMAG. Our hierarchical prompt strategy reduces token usage while improving accuracy across all datasets. On simpler tasks (HumanEval, MBPP), SEMAG achieves 19.3% and 15.5% token reduc- tion compared to LPW, respectively. For com- plex tasks (APPS, CodeContests), where Level 4 PlanRefineDiscussion andPass@1 Verification SuggestionDecisionaccuracy × ×71.9% (-21.4%) ✓ ×77.4% (-15.4%) ×✓×80.5% (-12.0%) × ×✓81.7% (-10.7%) ×✓83.5% (-8.7%) ✓ ×✓83.5% (-8.7%) ✓×82.9% (-9.4%) ✓91.5% Table 4: Pass@1 accuracy of different component com- binations in SEMAG, showing relative decreases from the full implementation (91.5% baseline). Results ob- tained using GPT-3.5 on the HumanEval benchmark. prompts dominate, token reduction is 9.3% and 5.1%, respectively, constrained by inherent task complexity. This demonstrates SEMAG’s hierar- chical decomposition effectively optimizes both performance and efficiency. 0 25 50 75 100 125 150 30 40 50 60 70 80 90 100 LPW: 98.2% Benchmark (Method) ■ HumanEval (LPW) ● HumanEval (SEMAG) ■ MBPP (LPW) ● MBPP (SEMAG) ■ APPS (LPW) ● APPS (SEMAG) ■ LiveCode (LPW) ● LiveCode (SEMAG) ■ CodeContests (LPW) ● CodeContests (SEMAG) Pass@1 Accuracy (%) Tokens (K) SEMAG(Ours): 98.8% LPW: 84.8% SEMAG(Ours): 87.6% SEMAG(Ours): 65.0% SEMAG(Ours): 67.6% LPW: 59.3% LPW: 62.6% SEMAG(Ours): 38.0% LPW: 34.7% Figure 4: Comparison of Pass@1 accuracy and average token count per question for LPW and SEMAG across benchmarks, using GPT-4o as the LLM backbone. Here, K = 10 3 . Impact of Different Agents. We conduct an ab- lation study on HumanEval using GPT-3.5 to eval- uate each agent’s contribution. As shown in Ta- ble 4, excluding any component reduces Pass@1 accuracy. Individual agents achieve limited im- provements: Plan Verification alone reaches 77.4% (+5.5% from baseline 71.9%), Refine Suggestion 80.5%, and Discussion and Decision 81.7%. Dual- agent configurations perform better (82.9%-83.5%) but remain 8.7%-9.4% below the full implemen- tation. The complete SEMAG achieves 91.5% Pass@1, demonstrating the synergistic importance of all three components. Impact of Tool Using. In the planning stage, the planning agent can choose to utilise external tools, such as search engines, to enhance decision- making. We conduct an experiment on the Hu- manEval benchmark with GPT-3.5. Table 5 shows that when the planning agent uses tools, SEMAG achieves a Pass@1 accuracy of 91.5%. Without tools, the accuracy decreases to 87.8%. This 3.7% decline emphasizes the importance of external tools in planning. The results demonstrate that these tools help the planning agent access more relevant information, improving the quality of plans and SEMAG’s overall performance. With Tool UsingWithout Tool Using 91.5%87.8% Table 5: Pass@1 accuracy of SEMAG with and without tool usage in the planning stage. Results are obtained using GPT-3.5 on the HumanEval benchmark. Analysis of Self-Evolution Agents.To cali- brate the crawler depth of self-evolution agents, we vary the number of returned pages,N links ∈ 10, 15, 20, 25, 30, while fixing all other vari- ables (five random seeds, identical search prompts, temperature= 0.1). After summarizing the first NURLs (published≤ 30days ago), the agents ranked the evidence and proposed 3 candidate LLMs for the given code task. Table 6 reports (i) the probability that Claude-3.7-Sonnet appears in the Top-3 list, (i) average token consumption during summarization & reasoning, and (i) end- to-end selection latency, all averaged over the five seeds. N links Pr(%)↑Tokens (K)↓Latency (min)↓ 1040.030.43.5 1560.039.14.6 2080.045.96.0 2580.065.27.8 3080.078.39.2 Table 6: Impact of crawl depth on the probability (%) of discovering Claude-3.7-Sonnet in Top-3 and asso- ciated resource costs (averaged over five runs, 30-day window). The results show that shallower crawls with 10– 15 pages often miss key benchmark posts, yielding a lower than 70% probability of identifying Claude- 3.7-Sonnet and defaulting to weaker models, albeit at lower cost. Scaling toN links = 20achieves per- fect discovery (probability 80%) with modest over- head (45k tokens, 6 minutes). Further increases add little value but inflate costs by 30–55%. Thishighlightsuncertaintiesinsearch- dependent model selection: online information may be incomplete or biased due to search algorithms, recency effects, or uneven coverage. In our experiments, insufficient depth (N links ≤ 15) omitted Claude-3.7-Sonnet in up to 60% of runs, risking suboptimal choices. Thus,N links = 20 balances reliability and efficiency, ensuring top performers are captured while minimizing resources. Parameters Details. We experiment on how differ- ent temperatures of LLM influence the accuracy of SEMAG. Figure 5 shows the variation in Pass@1 accuracy on the HumanEval benchmark using GPT- 3.5. The highest mean Pass@1 accuracy (91.1%) is achieved atT = 0.1andT = 0.8, withT = 0.1 exhibiting the lowest variance. To improve the reproducibility and consistency of our experimen- tal results, we maintain a constant temperature of T = 0.1 throughout all stages of SEMAG. 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0 2.5 84 86 88 90 92 89.4 90.6 91.1 89.9 90.3 90.0 89.8 90.9 90.6 91.1 Pass@1 Accuracy (%) Variance (×10 -4 ) Temperature Variance Pass@1 Accuracy (%) Figure 5: Pass@1 accuracy (right y-axis) and its vari- ance (left y-axis, scaled by×10 −4 ) on the HumanEval benchmark using GPT-3.5 as the backbone, measured over three independent runs for each temperature setting (0.1 to 1.0). To further quantify the influence of the num- ber of candidate generations (M try ) and debug- ging iterations (M debug ), we conduct a grid search over(M try ,M debug ) ∈ 0, 1,..., 6 2 . Figure 6 shows the variation in Pass@1 accuracy on the HumanEval benchmark using GPT-3.5. Increas- ing eitherM try orM debug consistently improves performance. Starting from(0, 0), where only 71.3% accuracy is achieved, the Pass@1 accuracy increases steadily with higher values of both pa- rameters. The performance begins to plateau near (M try = 5,M debug = 4), where SEMAG reaches 91.5%, representing a near-optimal balance be- 71.3 68.9 70.1 70.3 69.5 71.9 71.9 78.7 80.5 82.9 82.3 85.9 82.9 84.8 82.3 83.5 86.6 85.9 87.2 85.4 90.9 85.4 85.9 86.6 87.8 88.4 85.4 89 84.8 87.8 89 87.2 90.9 89 90.2 85.4 85.4 86.6 88.4 91.5 90.9 92.1 86 88.4 88.4 89 90.9 91.5 88.4 Pass@1 Accuracy (%) 0 M try M debug 70 75 80 85 90 1 2 3 4 5 6 7 0 2 3 4 5 6 7 Figure 6: Pass@1 accuracy on the HumanEval bench- mark with GPT-3.5 as the backbone, evaluated under dif- ferent combinations ofM try andM debug values. Each cell represents the mean Pass@1 accuracy for a specific parameter pair. tween solution diversity and iterative refinement. Although the highest accuracy observed (92.1%) occurs at(5, 6), the gain over(5, 4)is minimal and comes with increased inference costs. As a result, we setM try = 5andM debug = 4for all subsequent experiments, as these values have been empirically shown to optimize SEMAG’s perfor- mance. 5 Conclusion We introduce SEMAG, a Self-Evolutionary Multi- Agent framework designed for code generation. By employing a division of labour with hierar- chical prompting mechanisms, the coding agents of SEMAG significantly enhance the performance of LLMs across diverse programming tasks. The self-evolutionary agents of SEMAG feature self- evolving capabilities, enabling them to access the latest models in real-time and automatically up- grade the backbone model. The coding agents of SEMAG achieve state-of-the-art Pass@1 accu- racy across seven benchmarks, including 98.8% on HumanEval, 87.6% on MBPP, and 38.0% on CodeContests, while substantially reducing com- putational resource overhead and token consump- tion. With controlled backbone, SEMAG improves 3.3% over LPW on CodeContests. With self- evolutionary model selection, it further reaches 52.6%, demonstrating the benefit of adaptive back- bone switching. Future work will explore finer- grained decomposition, cross-modal collaboration, and efficient model selection strategies. 6 Limitations Among the limitations of our work, firstly, SEMAG involves inference-time hyperparameters (M try and M debug ) that affect the trade-off between accuracy and cost; however, our experiments in Section 4.3 identify a stable configuration that general- izes across benchmarks, and adaptive tuning strate- gies are left for future work. Secondly, the hier- archical multi-agent design invests more compu- tation on challenging problems through iterative refinement, which may increase latency in time- sensitive scenarios; our adaptive level transition mechanism partially addresses this by reducing token consumption by 15–20% on simpler tasks compared to fixed-depth baselines. Thirdly, the self-evolutionary model selection component re- lies on real-time information retrieval to identify optimal backbones; we note that this module is optional—the core framework operates indepen- dently with any fixed model as shown in Table 1 and Table 2. Offline model recommendation could be explored in future work. Finally, as with any system executing machine-generated code, running outputs inside a sandbox environment is advisable to mitigate potential security risks. References Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, and 1 others. 2023. Gpt-4 techni- cal report. Preprint, arXiv:2303.08774. Anthropic. 2024.Introducing the next generation of claude.https://w.anthropic.com/news/ claude-3-family. Accessed: 2024-03-04. Antonis Antoniades, Albert Örwall, Kexun Zhang, Yuxi Xie, Anirudh Goyal, and William Wang. 2025. Swe- search: Enhancing software agents with monte carlo tree search and iterative refinement. In Interna- tional Conference on Representation Learning, vol- ume 2025, pages 64485–64515. Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and 1 others. 2021. Program synthesis with large language models. Preprint, arXiv:2108.07732. R BALZER. 1985. A 15 year perspective on automatic programming. IEEE transactions on software engi- neering, 11(11):1257–1268. Devichand Budagam, Ashutosh Kumar, Mahsa Khosh- noodi, Sankalp KJ, Vinija Jain, and Aman Chadha. 2025. Hierarchical prompting taxonomy: A univer- sal evaluation framework for large language models aligned with human cognitive principles. In First In- ternational KDD Workshop on Prompt Optimization, 2025. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Ka- plan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, and 1 others. 2021. Evaluating large language models trained on code. Preprint, arXiv:2107.03374. Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. 2023. Teaching large language models to self-debug. Preprint, arXiv:2304.05128. Yihong Dong, Jiazheng Ding, Xue Jiang, Ge Li, Zhuo Li, and Zhi Jin. 2025. Codescore: Evaluating code generation by learning code execution. ACM Trans- actions on Software Engineering and Methodology, 34(3):1–22. Yihong Dong, Xue Jiang, Zhi Jin, and Ge Li. 2024. Self-collaboration code generation via chatgpt. ACM Transactions on Software Engineering and Method- ology, 33(7):1–38. Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas, Naoto Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng Gao, and Hoifung Poon. 2021. Domain-specific lan- guage model pretraining for biomedical natural lan- guage processing. ACM Transactions on Computing for Healthcare (HEALTH), 3(1):1–23. Dan Hendrycks, Steven Basart, Saurav Kadavath, Man- tas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. 2021. Measuring coding challenge com- petence with apps. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1. Yaojie Hu, Qiang Zhou, Qihong Chen, Xiaopeng Li, Linbo Liu, Dejiao Zhang, Amit Kachroo, Talha Oz, and Omer Tripp. 2025. Qualityflow: An agentic workflow for program synthesis controlled by llm quality checks. Preprint, arXiv:2501.17167. Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, and 1 others. 2024. Qwen2. 5-coder technical report. Preprint, arXiv:2409.12186. Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Os- trow, Akila Welihinda, Alan Hayes, Alec Radford, and 1 others. 2024. Gpt-4o system card. Preprint, arXiv:2410.21276. Md Ashraful Islam, Mohammed Eunus Ali, and Md Rizwan Parvez. 2024. Mapcoder: Multi-agent code generation for competitive problem solving. In Proceedings of the 62nd Annual Meeting of the As- sociation for Computational Linguistics (Volume 1: Long Papers), pages 4912–4944. Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar- Lezama, Koushik Sen, and Ion Stoica. 2025. Live- codebench: Holistic and contamination free evalua- tion of large language models for code. In Interna- tional Conference on Representation Learning, pages 58791–58831. Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, and 1 others. 2024a.Mixtral of experts.Preprint, arXiv:2401.04088. Xue Jiang, Yihong Dong, Lecheng Wang, Zheng Fang, Qiwei Shang, Ge Li, Zhi Jin, and Wenpin Jiao. 2024b. Self-planning code generation with large language models. ACM Transactions on Software Engineering and Methodology, 33(7):1–30. Chao Lei, Yanchuan Chang, Nir Lipovetzky, and Krista A Ehinger. 2024. Planning-driven program- ming: A large language model programming work- flow. Preprint, arXiv:2411.14503. Dacheng Li, Shiyi Cao, Chengkun Cao, Xiuyu Li, Shangyin Tan, Kurt Keutzer, Jiarong Xing, Joseph E Gonzalez, and Ion Stoica. 2025. S*: Test time scal- ing for code generation. Preprint, arXiv:2502.14382. Kefan Li and Yuan Yuan. 2024. Large language models as test case generators: Performance evaluation and enhancement. Preprint, arXiv:2404.13340. Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, and 1 others. 2022. Competition-level code generation with alphacode. Science, 378(6624):1092–1097. Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, and 1 others. 2024.Deepseek-v3 technical report.Preprint, arXiv:2412.19437. Zohar Manna and Richard J Waldinger. 1971. Toward automatic program synthesis. Communications of the ACM, 14(3):151–165. Noble Saji Mathews and Meiyappan Nagappan. 2024. Test-driven development for code generation. Preprint, arXiv:2402.13521. John McCarthy. 1978. History of lisp. SIGPLAN Not., 13(8):217–223. Marjan Mernik, Jan Heering, and Anthony M Sloane. 2005. When and how to develop domain-specific lan- guages. ACM computing surveys (CSUR), 37(4):316– 344. Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, and 1 others. 2023. Code llama: Open foundation models for code. Preprint, arXiv:2308.12950. Elliot Soloway. 1986. Learning to program=learning to construct mechanisms and explanations. Commu- nications of the ACM, 29(9):850–858. Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, and 1 others. 2023. Gemini: a family of highly capable multimodal models. Preprint, arXiv:2312.11805. Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, and 1 others. 2024. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. Preprint, arXiv:2403.05530. Richard J Waldinger and Richard CT Lee. 1969. Prow: A step toward automatic program writing. In Pro- ceedings of the 1st international joint conference on Artificial intelligence, pages 241–252. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, and 1 others. 2022. Chain-of-thought prompting elic- its reasoning in large language models. Advances in neural information processing systems, 35:24824– 24837. Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. 2023a. Tree of thoughts: Deliberate problem solving with large language models. Advances in neural information processing systems, 36:11809–11822. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023b. React: Synergizing reasoning and acting in language models. In International Conference on Representa- tion Learning. Yifan Zhang, Jingqin Yang, Yang Yuan, and Andrew Chi-Chih Yao. 2023. Cumulative reasoning with large language models. Preprint, arXiv:2308.04371. Li Zhong, Zilong Wang, and Jingbo Shang. 2024. De- bug like a human: A large language model debugger via verifying runtime execution step by step. In Find- ings of the Association for Computational Linguistics: ACL 2024, pages 851–870. Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, and 1 oth- ers. 2022. Least-to-most prompting enables com- plex reasoning in large language models. Preprint, arXiv:2205.10625. A Analysis of Solving Different Levels A.1 APPS APPS is a well-established dataset for evaluating algorithmic problem-solving capabilities, categoris- ing programming problems into three distinct dif- ficulty levels: Introductory, Interview, and Com- petition. These levels range from basic coding exercises to advanced competitive programming challenges, providing a structured framework to assess the performance of LLM-based methods across varying complexities. 78.7 50.0 13.0 78.7 52.2 28.3 87.2 65.2 34.8 89.4 80.4 32.6 Introductory Interview Competition 0 APPS Pass@1 (%) Direct LDB LPW SEMAG (Ours) 20 40 60 80 100 Figure 7: Pass@1 accuracy on the APPS benchmark across different difficulty levels, Introductory, Interview, and Competition, of Direct, LDB, LPW and SEMAG, when using GPT-4o as the LLM backbone. Figure 7 compares accuracy on the APPS bench- mark across different levels of problems. SEMAG demonstrates superior performance in Introductory and Interview levels, achieving 89.4% and 80.4% respectively, which represents a significant margin over existing approaches. Specifically, SEMAG surpasses the next-best LPW approach by 2.2% in the Introductory level and establishes a notable 15.2% advantage in the Interview level. However, in competitive environments, SEMAG (32.6%) shows slightly reduced effectiveness compared to LPW’s 34.8%, suggesting potential areas for op- timization in Competition level. The hierarchical prompting strategy affects model performance, re- sulting in success in visible tests but failure in hidden tests. The baseline Direct exhibits funda- mental limitations, particularly in competition con- texts (13%), while LDB demonstrates moderate im- provements in Interview (52.2%) and Competition (28.3%) levels compared to Direct. These results collectively highlight SEMAG’s exceptional capa- bility in the initial engagement and interpersonal evaluation phases. A.2 LiveCode LiveCode benchmark focuses on real-time coding scenarios reflective of practical software develop- ment tasks. Its problems are classified into Easy, Medium, and Hard levels, capturing varying de- grees of complexity encountered in applied set- tings. 80.0 35.0 25.0 82.2 41.8 35.0 86.7 47.3 40.0 86.7 60.0 47.5 EasyMediumHard 0 20 40 60 80 100 LiveCode Pass@1 (%) Direct LDB LPW SEMAG (Ours) Figure 8: Pass@1 accuracy on the LiveCode benchmark across different difficulty levels, Easy, Medium, and Hard, of Direct, LDB, LPW and SEMAG, when using GPT-4o as the LLM backbone. Figure 8 compares accuracy on the LiveCode benchmark across different levels of problems. In the Easy level, both SEMAG and LPW achieve the highest accuracy of 86.7%, which is 6.7% higher than the Direct prompting approach (80.0%). This indicates that both methods possess effective rep- resentation capabilities in low-complexity scenar- ios. In the Medium level, SEMAG demonstrates a significant advantage, achieving an accuracy of 60.0%, which surpasses the second-best method, LPW (47.3%), by 12.7%. In the most challenging Hard level, SEMAG continues to lead with an ac- curacy of 47.5%, outperforming LPW (40.0%) and LDB (35.0%). This validates the strong robustness of SEMAG in extremely complex problems. B Prompt of SEMAG Here, we list the prompts of SEMAG in detail as follows. Plan Agent Prompt Persona: Yourprimaryroleistodecomposecomplexprogrammingtasksintomodularcomponentsandprovideguidanceonhowtostructurethem forreusabilityandmaintainability. Problem: problem Instructions: •Breakdownthecodingtaskintomodulesandrecommendaclearandconcisestructureforeachmodule. •Adviseondatastructures,algorithms,andmodularizationtechniques. •GuidetheCoderAgentstep-by-steptoimplementcode. •Youhaveaccesstothefollowingtools:tool_des. OutputFormat: •Action:Theactionyoutake,mustbeoneof[tool_names].(Learntoolusagefirst,e.g.,theparametersofthetool). •Actioninput:Theinputtotheaction., •Finalplan:Yourconcise,structuredfinalplanforthetask. Remember: StrictlyfollowtheJSONOutputFormatandgiveonlytheplantosolvetheproblem.Avoidextraexplanationorwords.Toolsareoptional.If usingtools,specify‘Action’and‘Actioninput’,thenawaitoutputbeforeprovidingthe‘Finalplan’.Ifnotusingtools,leave‘Action’and ‘Actioninput’empty. Figure 9: The prompt of Planning Agent. Plan Verify Agent Prompt Persona: Yourprimaryroleistoverifythesolutionplanforthegivenprogrammingtask.Youmustperformastep-by-stepanalysisoftheprovided solutionplan,recordintermediatevariablevalues,andcomparethederivedresultswiththeexpectedtestoutcomes. Problem: problem Plan: plan Instructions: •ReviewtheprovidedsolutionplanforthegivenPythonproblem. •Foreachtestcase,beginbyrecordingallnecessaryintermediatevariables. •Asyouprocesstheplan,updateanyintermediatevariablevalues. •Afterexecutingtheverificationstepsforatestcase,comparethederivedresultwiththeexpectedtestoutput. •Ifthederivedresultmatchestheexpectedoutput,outputtheplanasthe‘Finalplan’andmarkitascorrect. •Ifthederivedresultdoesnotmatch,providearevisedsolutionplanasthe‘Finalplan’. OutputFormat: •Verification:Adetailed,step-by-stepverificationprocess. •Correctness:Thegivensolutionplaniscorrectornot(True/False). •Finalplan:Thefinalplanyouprovideforthetask. Remember: StrictlyfollowtheJSONOutputFormatandincludeonlytheverificationprocessandfinalplan.Donotincludeanyextraexplanationsor words. Figure 10: The prompt of Plan verifying Agent. Code Agent Prompt Persona: Yourprimaryroleistogeneratepython3codetosolvethegivencodingproblembasedonthegivenplanandtheproblemdescription. Problem: problem Plan: plan Planverification: plan_verification Instructions: •Usetheentrypointentry_pointoftheproblem,donotaddmainfunction. OutputFormat: •Code:Thecodeyougenerate. Remember: Startwith“‘Codeandendwith“‘.Writeallthecodeintoasingle.pyfile.Avoidextraexplanation,wordsor”inyouroutput. Figure 11: The prompt of Coding Agent. Add Trace Agent Prompt Persona: Yourprimaryroleistoaddprintstatementstothegivencodetosolvetheproblem. Code: code Planverification: plan_verification Instructions: •Incorporatedebugprintstatementstotraceintermediatevariablechangesasdescribedintheplanverifications. OutputFormat: •Code:TheprogramwithPRINTstatementsaddedtotracevariablechanges. Remember: Startwith“‘Codeandendwith“‘.Writeallthecodeintoasingle.pyfile.Avoidextraexplanation,wordsor”inyouroutput. Figure 12: The prompt of Adding Trace Agent. Code Explain Agent Prompt Persona: YourprimaryroleistoexplaineachlineofagivenPythonprogram,describingtheeffectofeachline. Problem: problem Code: code Instructions: •YouwillreceiveanewproblemdescriptionandageneratedPythonprogramintendedtosolvetheproblem. •GenerateadetailedexplanationforeachlineoftheprovidedPythonprogram. OutputFormat: •Codeexplanation:AdetailedexplanationforeachlineofthePythonprogram.Eachline’sexplanationshoulddescribeitseffectonthe program’sbehaviour. Remember: StrictlyfollowtheJSONOutputFormat.ProvideonlytheexplanationforthePythonprogramasspecified,withoutextraexplanationor additionalwords. Figure 13: The prompt of Code Explaining Agent. Refinement Suggest Agent Prompt Persona: Yourprimaryroleistoverifyerrorexecutiontraces,comparingtheprovided‘ErrorExecutionTrace’withthe‘CorrectPlanVerification’for aPythonproblemandidentifyanydiscrepancies. Problem: problem CodewithError: code CorrectPlanVerification: plan_verification ErrorExecutionTrace: test_log Instructions: •The‘ErrorExecutionTrace’istheoutputofthe‘CodewithError’whenitfailstomeettheexpectedoutput. •Comparethe‘ErrorExecutionTrace’withthe‘CorrectPlanVerification’andoutputthedifferencesandyouranalysisoftheerrors. •Provideasummaryoftheerrors,includingreasonsforthediscrepanciesandsuggestionsforcorrections. OutputFormat: •Analysis:Youroutputincludingadetaileddiscrepancyanalysis. •Refinesuggestion:Youranalysisoftheerrorsandsuggestionsonhowtocorrectthem. Remember: StrictlyfollowtheJSONOutputFormat.ProvideonlytheexplanationforthePythonprogramasspecified,withoutextraexplanationor additionalwords. Figure 14: The prompt of Suggesting Agent. Debug Agent Prompt Persona: YourprimaryroleistocorrectanerroneousPythonprogrambasedontheprovidederroranalysisandrelatedexplanations. Problem: problem CodewithError: code CodeExplanation: code_explanation RefineSuggestion: refine_suggestion Instructions: •Youwillbepresentedwithanewproblemdescription,thecodewitherror,itsexplanation,andanalysisoftheerrorsandsuggestionson howtocorrectthem. •GeneratetherefinedPythonprogrambasedonthe‘RefineSuggestion’. •OutputyourrefinedcodewithonlythePythoncode,andprovidearefinementexplanationthatdetailsthemodificationsmade. OutputFormat: •RefinedExplanation:Adetailedexplanationdescribingthemodificationsmadetothecode. •Code:Therefinedcodethataccordingtotheerroranalysis. Remember: Startwith“‘Codeandendwith“‘.Writeallcodeintoasingle.pyfile.Avoidextraexplanation,wordsor”inyouroutput. Figure 15: The prompt of Debugging Agent. Algorithm Suggest Agent Prompt Persona: Yourprimaryroleistoanalyzecodefailuresonsampletestsandsuggestanimprovedalgorithm,method,orparameters. Instructions: •Targetproblem:problem. •Thegivencodeisnotworkingasexpectedtest_log. •Provideanewalgorithm,method,orparameterstoaddresstheproblem. •Theexistingcodeis:code. OutputFormat: •Algorithm:Theproposedalgorithmormethod. •Parameters:Thesuggestedparameters. •Reason:Explanationforchoosingthisalgorithm,method,orparameters. Remember: StrictlyfollowtheJSONOutputFormat.Avoidadditionalexplanationsortextoutsidetheformat. Figure 16: The prompt of Discussing Agent. Algorithm Decide Agent Prompt Persona: Yourprimaryroleistodecidewhichalgorithmandparameterstouseontheproblembasedonthediscussion. Instructions: •Evaluatetheperformanceofthealgorithmandparametersfortheproblemproblemanddecidewhethertoadoptthem. •Thegivencodeisnotworkingasexpectedtest_log. •Thediscussionis:discussion. OutputFormat: •Algorithm:Theselectedalgorithmormethod. •Parameters:Thechosenparameters. Remember: StrictlyfollowtheJSONOutputFormat.Avoidadditionalexplanationsortextoutsidetheformat. Figure 17: The prompt of Discriminating Agent. Code Refinement Agent Prompt Instructions: •Refinetheexistingcodetomatchtheexpectedoutputusingthealgorithm/method:algorithmandparameters:parameters providedbythediscriminatorforimprovedperformance. •Targetproblem:problem. •Theplanis:plan. •Usingtheentrypointentry_pointoftheproblem,donotaddmainfunction. •Handleedgecasessuchasinvalidinputs,emptyvalues,orboundaryconditions. Remember: Startwith“‘Codeandendwith“‘.Writeallcodeintoasingle.pyfile.Avoidextraexplanation,wordsor”inyouroutput. Figure 18: The prompt of Code Refining Agent. C Prompt of Self-Evolution Agent Here, we list the prompt of the Self-Evolution agent as follows. LLM Select Agent Prompt Persona: Yourprimaryroleistosearchanddecideonthebestlargelanguagemodelsforthegiventask. Date: date. Task: task. Instructions: •Youneedtoselect3bestlargelanguagemodelsforthetask:task. •Youhaveaccesstothefollowingtools:tool_des. OutputFormat: •Action:Theactionyoutake,mustbeoneof[tool_names].(Remembertolearnhowtousethetoolfirst,e.g.,theparametersofthe tool.) •Actioninput:Theinputtotheaction. •Modelname:Thenamesofthelargelanguagemodelsyouselect. •Reason:Thereasonforselectingthemodel. Remember: StrictlyfollowtheJSONOutputFormat.Avoidadditionalexplanationsortextoutsidetheformat.Youmusttakeactionfirst,waiting fortheoutput.Afterreceivingtheoutput,youshouldmakeadecisiononwhichmodelisbestsuitedforthetask. Figure 19: The prompt of LLM Selecting Agent. Link Select Agent Prompt Persona: Yourprimaryroleistoselect10linksmostrelevanttothegivenquestion. Question: question. Linkswithrelevantinformation: links. Instructions: •Youneedtoselectthelinksmostrelevanttothequestionbasedontheinformationprovided. OutputFormat: •Links:Theselectedlinks. Remember: StrictlyfollowtheJSONOutputFormat.Avoidadditionalexplanationsortextoutsidetheformat. Figure 20: The prompt of Link Selecting Agent. Content Summary Agent Prompt Persona: Yourprimaryroleistosummarizethecontentofthegivenarticle. Article: content. Instructions: •Youneedtosummarizethecontentofthearticle. •Thesummaryshouldbeshortandinformative.. OutputFormat: •Summary:Thesummaryofthearticle. Remember: StrictlyfollowtheJSONOutputFormat.Avoidadditionalexplanationsortextoutsidetheformat. Figure 21: The prompt of Content Summarizing Agent. LLM Decide Agent Prompt Persona: Yourprimaryroleistodecidewhichlargelanguagemodelisbestsuitedforthegiventask. Task: task. Instructions: •Youneedtodecidewhichlargelanguagemodelisbestsuitedforthetask:task. •Thediscussionis:discussion. OutputFormat: •Modelname:Thenameofthelargelanguagemodelyouselect. •Reason:Thereasonforselectingthemodel. Remember: StrictlyfollowtheJSONOutputFormat.Avoidadditionalexplanationsortextoutsidetheformat. Figure 22: The prompt of LLM Deciding Agent. Model Match Agent Prompt Persona: Yourprimaryroleistomatchthegivenmodelnamewiththegivenmodellist. Modelname: model_name. Modellist: model_list. Instructions: •Youneedtofindthecorrectmodelnamefromthegivenlistbasedonthegivenmodelname. •Youmustusethenewestmodelversion(adddateofreleaseor‘-latest’)ofthemodelifavailableinthemodellist. OutputFormat: •Modelname:Thenameofthelargelanguagemodelyouselect. Remember: StrictlyfollowtheJSONOutputFormat.Avoidadditionalexplanationsortextoutsidetheformat. Figure 23: The prompt of Model Matching Agent. D Example Problem Here, we show how SEMAG works on an example problem(51st problem) from the HumanEval bench- mark. The detailed prompts and responses are given as follows. ==================================User================================== "Persona":"Yourprimaryroleistodecomposecomplexprogrammingtasksintomodularcomponentsand provideguidanceonhowtostructurethemforreusabilityandmaintainability.", "Problem":" defencode_shift(s:str): """ returnsencodedstringbyshiftingeverycharacterby5inthealphabet. """ return"".join([chr(((ord(ch)+5-ord("a"))%26)+ord("a"))forchins]) defdecode_shift(s:str): """ takesasinputstringencodedwithencode_shiftfunction.Returnsdecodedstring. """ ", "Instructinos":[ "Breakdownthecodingtaskintomodulesandrecommendaclearandconcisestructureforeachmodule.", "Adviseondatastructures,algorithms,andmodularizationtechniques.", "GuidetheCoderAgentstep-by-steptoimplementcode.", "Youhaveaccesstothefollowingtools:google_search:UsethistooltointeractwiththegoogleAPI.Purpose: UsethistooltoperformaGooglesearch.TheinputmustbeavalidJSONobject. RequiredInputFormat(JSON): "search_query":"coffee" " ], "OutputFormat":[ "Action":"Theactionyoutake,mustbeoneof[google_search](learntoolusagefirst,e.g.,theparametersof thetool).", "Actioninput":"Theinputtotheaction.", "Finalplan":"Yourconcise,structuredfinalplanforthetask." ], "Remember":"StrictlyfollowthejsonOutputFormatandgiveonlytheplantosolvetheproblem.Avoidextra explanationorwords.Toolsareoptional.Ifusingtools,specifyActionandActioninput,thenawaitoutput beforeprovidingtheFinalplan.Ifnotusingtools,leaveActionandActioninputempty." ================================Assistant================================= "Action":"google_search", "Actioninput": "search_query":"howtodecodeaCaesarcipher" ===============================Searchtool================================= Takethevalueof26minustheshiftvalue,andapplythatnewvaluetoshifttheencodedmessagebacktoits originalform. ================================Assistant================================= "Finalplan": "1.Createafunctiondecode_shift(s:str)thattakesanencodedstringasinput. 2.Iteratethrougheachcharacterintheinputstring. 3.Foreachcharacter,calculatethenewshiftvaluebysubtracting5from26. 4.Applythenewshiftvaluetodecodethecharacterbacktoitsoriginalform. 5.Returnthedecodedstring." Figure 24: An example of Planning Agent. ==================================User================================== "Persona":"Yourprimaryroleistoverifythesolutionplanforthegivenprogrammingtask.Youmustperforma step-by-stepanalysisoftheprovidedsolutionplan,recordintermediatevariablevalues,andcomparethe derivedresultswiththeexpectedtestoutcomes.", "Problem":" defencode_shift(s:str): """ returnsencodedstringbyshiftingeverycharacterby5inthealphabet. """ return"".join([chr(((ord(ch)+5-ord("a"))%26)+ord("a"))forchins]) defdecode_shift(s:str): """ takesasinputstringencodedwithencode_shiftfunction.Returnsdecodedstring. """ ", "Plan":"1.Createafunctiondecode_shift(s:str)thattakesanencodedstringasinput. 2.Iteratethrougheachcharacterintheinputstring. 3.Foreachcharacter,calculatethenewshiftvaluebysubtracting5from26. 4.Applythenewshiftvaluetodecodethecharacterbacktoitsoriginalform. 5.Returnthedecodedstring.", "Instructions":[ "ReviewtheprovidedsolutionplanforthegivenPythonproblem.", "Foreachtestcase,beginbyrecordingallnecessaryintermediatevariables.", "Asyouprocesstheplan,updateanyintermediatevariablevalues.", "Afterexecutingtheverificationstepsforatestcase,comparethederivedresultwiththeexpectedtest output.", "Ifthederivedresultmatchestheexpectedoutput,outputtheplanastheFinalplanandmarkitascorrect.", "Ifthederivedresultdoesnotmatch,providearevisedsolutionplanastheFinalplan." ], "OutputFormat":[ "Verification":"Adetailed,step-by-stepverificationprocess.", "Correctness":"Thegivensolutionplaniscorrectornot(True/False)." "Finalplan":"Thefinalplanyouprovideforthetask." ], "Remember":"StrictlyfollowtheJSONOutputFormatandincludeonlytheverificationprocessandfinalplan. Donotincludeanyextraexplanationsorwords." ================================Assistant================================= "Verification": "TestCase1:s='fghij'" "Intermediatevariables:None" "1.Iteratingthrougheachcharacterin'fghij':" "-For'f':((ord('f')+5-ord('a'))%26)+ord('a')=((102+5-97)%26)+97=106->'j'" "-For'g':((ord('g')+5-ord('a'))%26)+ord('a')=((103+5-97)%26)+97=107->'k'" "-For'h':((ord('h')+5-ord('a'))%26)+ord('a')=((104+5-97)%26)+97=108->'l'" "-For'i':((ord('i')+5-ord('a'))%26)+ord('a')=((105+5-97)%26)+97=109->'m'" "-For'j':((ord('j')+5-ord('a'))%26)+ord('a')=((106+5-97)%26)+97=110->'n'" "2.Decodedstring:'jklmn'" "Expectedoutput:'fghij'" "Derivedresultmatchestheexpectedoutput." "Finalplan:Givensolutionplaniscorrect.", "Correctness":true, Figure 25: An example of Plan Verifying Agent. ==================================User================================== "Persona":"Yourprimaryroleistogeneratepython3codetosolvegivencodingproblembasedonthegiven planandtheproblemdescription.", "Problem":" defencode_shift(s:str): """ returnsencodedstringbyshiftingeverycharacterby5inthealphabet. """ return"".join([chr(((ord(ch)+5-ord("a"))%26)+ord("a"))forchins]) defdecode_shift(s:str): """ takesasinputstringencodedwithencode_shiftfunction.Returnsdecodedstring. """ ", "Plan":"1.Createafunctiondecode_shift(s:str)thattakesanencodedstringasinput. 2.Iteratethrougheachcharacterintheinputstring. 3.Foreachcharacter,calculatethenewshiftvaluebysubtracting5from26. 4.Applythenewshiftvaluetodecodethecharacterbacktoitsoriginalform. 5.Returnthedecodedstring.", "Planverification":"TestCase1:s='fghij'" "Intermediatevariables:None" "1.Iteratingthrougheachcharacterin'fghij':" "-For'f':((ord('f')+5-ord('a'))%26)+ord('a')=((102+5-97)%26)+97=106->'j'" "-For'g':((ord('g')+5-ord('a'))%26)+ord('a')=((103+5-97)%26)+97=107->'k'" "-For'h':((ord('h')+5-ord('a'))%26)+ord('a')=((104+5-97)%26)+97=108->'l'" "-For'i':((ord('i')+5-ord('a'))%26)+ord('a')=((105+5-97)%26)+97=109->'m'" "-For'j':((ord('j')+5-ord('a'))%26)+ord('a')=((106+5-97)%26)+97=110->'n'" "2.Decodedstring:'jklmn'" "Expectedoutput:'fghij’”, "Instructions":[ "Usetheentrypointdecode_shiftoftheproblem,donotaddmainfunction." ], "OutputFormat":[ "Code":"Thecodeyougenerate." ], "Remember":"Startwith"```Code"andendwith"```".Writeallcodeintoasingle.pyfile.Avoidextraexplanation, wordsor"""inyouroutput." ================================Assistant================================= ```Code defdecode_shift(s:str): return"".join([chr(((ord(ch)-5-ord("a"))%26)+ord("a"))forchins]) ``` Figure 26: An example of Coding Agent. ==================================User================================== "Persona":"Yourprimaryroleistoaddprintstatementstothegivencodetosolvetheproblem.", "Code":" defdecode_shift(s:str): return"".join([chr(((ord(ch)-5-ord("a"))%26)+ord("a"))forchins]) ", "Planverification":"TestCase1:s='fghij'" "Intermediatevariables:None" "1.Iteratingthrougheachcharacterin'fghij':" "-For'f':((ord('f')+5-ord('a'))%26)+ord('a')=((102+5-97)%26)+97=106->'j'" "-For'g':((ord('g')+5-ord('a'))%26)+ord('a')=((103+5-97)%26)+97=107->'k'" "-For'h':((ord('h')+5-ord('a'))%26)+ord('a')=((104+5-97)%26)+97=108->'l'" "-For'i':((ord('i')+5-ord('a'))%26)+ord('a')=((105+5-97)%26)+97=109->'m'" "-For'j':((ord('j')+5-ord('a'))%26)+ord('a')=((106+5-97)%26)+97=110->'n'" "2.Decodedstring:'jklmn'" "Expectedoutput:'fghij’”, "Instructions":[ "Incorporatedebugprintstatementstotraceintermediatevariablechangesasdescribedintheplan verifications." ], "OutputFormat":[ "Code":"TheprogramwithPRINTstatementsaddedtotracevariablechanges." ], "Remember":"Startwith"```Code"andendwith"```".Writeallcodeintoasingle.pyfile.Avoidextraexplanation, wordsor"""inyouroutput." ================================Assistant================================= ```Code defdecode_shift(s:str): forchins: print(f"Character:ch") print(f"ord(ch):ord(ch)") print(f"ord('a'):ord('a')") print(f"((ord(ch)-5-ord('a'))%26):((ord(ch)-5-ord('a'))%26)") print(f"chr(((ord(ch)-5-ord('a'))%26)+ord('a')):chr(((ord(ch)-5-ord('a'))%26)+ord('a'))") return"".join([chr(((ord(ch)-5-ord("a"))%26)+ord("a"))forchins]) ``` Figure 27: An example of Embedding Trace Statement Agent. ==================================User================================== "Persona":"YourprimaryroleistoprovideanexplanationforeachlineofagivenPythonprogram,describing theeffectofeachline.", "Problem":" defencode_shift(s:str): """ returnsencodedstringbyshiftingeverycharacterby5inthealphabet. """ return"".join([chr(((ord(ch)+5-ord("a"))%26)+ord("a"))forchins]) defdecode_shift(s:str): """ takesasinputstringencodedwithencode_shiftfunction.Returnsdecodedstring. """ ", "Code":" defdecode_shift(s:str): return"".join([chr(((ord(ch)-5-ord("a"))%26)+ord("a"))forchins]) ", "Instructions":[ "YouwillreceiveanewproblemdescriptionandageneratedPythonprogramintendedtosolvethe problem.", "GenerateadetailedexplanationforeachlineoftheprovidedPythonprogram.", ], "OutputFormat":[ "Codeexplanation":"AdetailedexplanationforeachlineofthePythonprogram.Eachline'sexplanation shoulddescribeitseffectontheprogram'sbehavior." ], "Remember":"StrictlyfollowtheJSONOutputFormat.ProvideonlytheexplanationforthePythonprogramas specified,withoutextraexplanationoradditionalwords." ================================Assistant================================= "Codeexplanation": "1":"Defineafunctionnameddecode_shiftthattakesastringsasinput.", "2":"Returnastringcreatedbyjoiningthecharactersgeneratedbyalistcomprehension.", "3":"Foreachcharacterchintheinputstrings,calculatethenewcharacterbyshiftingitbackby5inthe alphabet.", "4":"ConverttheshiftedcharacterbacktoitsASCIIvalueusingord()function.", "5":"Subtract5fromtheASCIIvalueofthecharacterandsubtracttheASCIIvalueof'a'tonormalizethe range.", "6":"Takethemodulo26oftheresulttohandlewrappingaroundthealphabet.", "7":"AddtheASCIIvalueof'a'totheresulttogetthefinalASCIIvalueofthedecodedcharacter.", "8":"ConvertthefinalASCIIvaluebacktoacharacterusingchr()function.", "9":"Returnthedecodedcharacterandcontinuetheprocessforeachcharacterintheinputstring." Figure 28: An example of Code Explaining Agent. ======================================User====================================== "Persona":"Yourprimaryroleistoverifyerrorexecutiontraces,comparingtheprovided'ErrorExecutionTrace'withthe 'CorrectPlanVerification'foraPythonproblemandidentifyanydiscrepancies.", "Problem":" defencode_shift(s:str): """ returnsencodedstringbyshiftingeverycharacterby5inthealphabet. """ return"".join([chr(((ord(ch)+5-ord("a"))%26)+ord("a"))forchins]) defdecode_shift(s:str): """ takesasinputstringencodedwithencode_shiftfunction.Returnsdecodedstring. """ ", "CodewithError":" defdecode_shift(s:str): forchins: print(f"Character:ch") print(f"ord(ch):ord(ch)") print(f"ord('a'):ord('a')") print(f"((ord(ch)-5-ord('a'))%26):((ord(ch)-5-ord('a'))%26)") print(f"chr(((ord(ch)-5-ord('a'))%26)+ord('a')):chr(((ord(ch)-5-ord('a'))%26)+ord('a'))") return"".join([chr(((ord(ch)-5-ord("a"))%26)+ord("a"))forchins]) ", "CorrectPlanVerification":"TestCase1:s='fghij'" "Intermediatevariables:None" "1.Iteratingthrougheachcharacterin'fghij':" "-For'f':((ord('f')+5-ord('a'))%26)+ord('a')=((102+5-97)%26)+97=106->'j'" "-For'g':((ord('g')+5-ord('a'))%26)+ord('a')=((103+5-97)%26)+97=107->'k'" "-For'h':((ord('h')+5-ord('a'))%26)+ord('a')=((104+5-97)%26)+97=108->'l'" "-For'i':((ord('i')+5-ord('a'))%26)+ord('a')=((105+5-97)%26)+97=109->'m'" "-For'j':((ord('j')+5-ord('a'))%26)+ord('a')=((106+5-97)%26)+97=110->'n'" "2.Decodedstring:'jklmn'" "Expectedoutput:'fghij’”, "ErrorExecutionTrace":"Errorintestcase:assertdecode_shift(encode_shift('abc'))=='abc'. Statusoutput: Errorintestcase:assertdecode_shift(encode_shift('abcabc'))=='abcabc'. Statusoutput: ", "Instructions":[ "The'ErrorExecutionTrace'istheoutputofthe'CodewithError'whenitfailstomeettheexpectedoutput." "Comparethe'ErrorExecutionTrace'withthe'CorrectPlanVerification'andoutputthedifferencesandyouranalysisof theerrors.", "Provideasummaryoftheerrors,includingreasonsforthediscrepanciesandsuggestionsforcorrections." ], "OutputFormat":[ "Analysis":"Youroutputincludingadetaileddiscrepancyanalysis." "Refinesuggestion":"Youranalysisoftheerrorsandsuggestionshowtocorrectthem." ], "Remember":"StrictlyfollowtheJSONOutputFormat.Provideonlytheanalysisasspecified,withoutanyextraexplanations orwords." ====================================Assistant==================================== "Analysis":"Theerrorinthecodeliesinthecalculationofthedecodedcharacter.Theformulausedtodecodethe characterisincorrect,leadingtodiscrepanciesintheoutput.Thecodesubtracts5fromthecharactervalueandthentakes themodulo26,whichisnotthecorrectapproachfordecoding.Thisresultsinincorrectdecodedcharactersandhencethe mismatchintheoutput.", "Refinesuggestion":"Tocorrecttheerror,thedecodingformulashouldbeadjustedtocorrectlyreversetheencoding process.Insteadofsubtracting5andtakingmodulo26,thecodeshouldadd21(since5istheshiftvalueand26-5=21) andthentakemodulo26togetthecorrectdecodedcharacter.Bymakingthisadjustment,thedecodingprocesswillbe accurateandmatchtheexpectedoutput." Figure 29: An example of Suggesting Agent. ==================================User================================== "Persona":"YourprimaryroleistocorrectanerroneousPythonprogrambasedontheprovidederroranalysis andrelatedexplanations." "Problem":" defencode_shift(s:str): """ returnsencodedstringbyshiftingeverycharacterby5inthealphabet. """ return"".join([chr(((ord(ch)+5-ord("a"))%26)+ord("a"))forchins]) defdecode_shift(s:str): """ takesasinputstringencodedwithencode_shiftfunction.Returnsdecodedstring. """ ", "CodewithError":" defdecode_shift(s:str): return"".join([chr(((ord(ch)-5-ord("a"))%26)+ord("a"))forchins]) ", "CodeExplanation":"'1':'Defineafunctionnameddecode_shiftthattakesastringsasinput.','2':'Returna stringcreatedbyjoiningthecharactersgeneratedbyalistcomprehension.','3':'Foreachcharacterchinthe inputstrings,calculatethenewcharacterbyshiftingitbackby5inthealphabet.','4':'Converttheshifted characterbacktoitsASCIIvalueusingord()function.','5':"Subtract5fromtheASCIIvalueofthecharacterand subtracttheASCIIvalueof'a'tonormalizetherange.",'6':'Takethemodulo26oftheresulttohandle wrappingaroundthealphabet.','7':"AddtheASCIIvalueof'a'totheresulttogetthefinalASCIIvalueofthe decodedcharacter.",'8':'ConvertthefinalASCIIvaluebacktoacharacterusingchr()function.','9':'Returnthe decodedcharacterandcontinuetheprocessforeachcharacterintheinputstring.'", "RefineSuggestion":"Tocorrecttheerror,thedecodingformulashouldbeadjustedtocorrectlyreversethe encodingprocess.Insteadofsubtracting5andtakingmodulo26,thecodeshouldadd21(since5istheshift valueand26-5=21)andthentakemodulo26togetthecorrectdecodedcharacter.Bymakingthis adjustment,thedecodingprocesswillbeaccurateandmatchtheexpectedoutput." "Instructions":[ "Youwillbepresentedwithanewproblemdescription,thecodewitherror,itsexplanation,andanalysisof theerrorsandsuggestionshowtocorrectthem.", "GeneratetherefinedPythonprogrambasedonthe'RefineSuggestion'.", "OutputyourrefinedcodewithonlythePythoncode,andprovidearefinementexplanationthatdetailsthe modificationsmade." ], "OutputFormat":[ "RefinedExplanation":"Adetailedexplanationdescribingthemodificationsmadetothecode.", "Code":"Therefinedcodethataccordingtotheerroranalysis." ], "Remember":"Startwith"```Code"andendwith"```".Writeallcodeintoasingle.pyfile.Avoidextraexplanation, wordsor"""inyouroutput." ================================Assistant================================= ```Code defdecode_shift(s:str): return"".join([chr(((ord(ch)+21-ord("a"))%26)+ord("a"))forchins]) ``` RefinedExplanation: Tocorrecttheerrorinthedecodingfunction,Iadjustedtheformulatocorrectlyreversetheencodingprocess. Insteadofsubtracting5andtakingmodulo26,Iadded21(since5istheshiftvalueand26-5=21)andthen tookmodulo26togetthecorrectdecodedcharacter.Thisadjustmentensuresthatthedecodingprocess accuratelymatchestheexpectedoutput. Figure 30: An example of Debugging Agent. ==================================User================================== "Persona":"Yourprimaryroleistoanalyzecodefailuresonsampletestsandsuggestanimprovedalgorithm, method,orparameters." "Instructions":[ "Targetproblem: defencode_shift(s:str): """ returnsencodedstringbyshiftingeverycharacterby5inthealphabet. """ return"".join([chr(((ord(ch)+5-ord("a"))%26)+ord("a"))forchins]) defdecode_shift(s:str): """ takesasinputstringencodedwithencode_shiftfunction.Returnsdecodedstring. """ ." "ThegivencodeisnotworkingasexpectedErrorintestcase:assertdecode_shift(encode_shift('abc'))== 'abc'. Statusoutput: Errorintestcase:assertdecode_shift(encode_shift('abcabc'))=='abcabc'. Statusoutput: ." "Provideanewalgorithm,method,orparameterstoaddresstheproblem.", "Theexistingcodeis: defdecode_shift(s:str): forchins: print(f"Character:ch") print(f"ord(ch):ord(ch)") print(f"ord('a'):ord('a')") print(f"((ord(ch)-5-ord('a'))%26):((ord(ch)-5-ord('a'))%26)") print(f"chr(((ord(ch)-5-ord('a'))%26)+ord('a')):chr(((ord(ch)-5-ord('a'))%26)+ord('a'))") return"".join([chr(((ord(ch)-5-ord("a"))%26)+ord("a"))forchins]) ."], "OutputFormat":[ "Algorithm":"Theproposedalgorithmormethod.", "Parameters":"Thesuggestedparameters.", "Reason":"Explanationforchoosingthisalgorithm,method,orparameters." ], "Remember":"StrictlyfollowtheJSONOutputFormat.Avoidadditionalexplanationsortextoutsidetheformat." ================================Assistant================================= "Algorithm":"Todecodetheshiftedstring,weneedtoreversetheencodingprocess.Wecanachievethis bysubtracting5fromtheASCIIvalueofeachcharacterandwrappingaroundiftheresultislessthantheASCII valueof'a'.", "Parameters":"None", "Reason":"Thisalgorithmdirectlyreversestheencodingprocessbyundoingtheshiftof5positionsinthe alphabet.Itensuresthatthedecodedstringmatchestheoriginalinput." Figure 31: An example of Debating Agent. ================================Assistant================================= "Algorithm":"Todecodetheshiftedstring,wecanmodifytheexistingcodebyadding5totheASCIIvalue ofeachcharacterandwrappingaroundiftheresultexceedstheASCIIvalueof'z'.", "Parameters":"None", "Reason":"Thisalgorithmdirectlyreversestheencodingprocessbyundoingtheshiftof5positionsinthe alphabet.Itensuresthatthedecodedstringmatchestheoriginalinput." ================================Assistant================================= "Algorithm":"Todecodetheshiftedstring,wecanmodifytheexistingcodebysubtracting5fromthe ASCIIvalueofeachcharacterandwrappingaroundiftheresultislessthantheASCIIvalueof'a'.", "Parameters":"None", "Reason":"Thisalgorithmdirectlyreversestheencodingprocessbyundoingtheshiftof5positionsinthe alphabet.Itensuresthatthedecodedstringmatchestheoriginalinput." Figure 32: An example of Debating Agent, following Figure 31. ==================================User================================== "Persona":"Yourprimaryroleistodecidewhichalgorithmandparameterstouseontheproblembasedonthe discussion." "Instructions":[ "Evaluatetheperformanceofthealgorithmandparametersfortheproblem defencode_shift(s:str): """ returnsencodedstringbyshiftingeverycharacterby5inthealphabet. """ return"".join([chr(((ord(ch)+5-ord("a"))%26)+ord("a"))forchins]) defdecode_shift(s:str): """ takesasinputstringencodedwithencode_shiftfunction.Returnsdecodedstring. """ anddecidewhethertoadoptthem." "ThegivencodeisnotworkingasexpectedErrorintestcase:assertdecode_shift(encode_shift('abc'))== 'abc'. Statusoutput: Errorintestcase:assertdecode_shift(encode_shift('abcabc'))=='abcabc'. Statusoutput: .", "Thediscussionis: "Algorithm":"Todecodetheshiftedstring,weneedtoreversetheencodingprocess.Wecanachievethis bysubtracting5fromtheASCIIvalueofeachcharacterandwrappingaroundiftheresultislessthantheASCII valueof'a'.", "Parameters":"None", "Reason":"Thisalgorithmdirectlyreversestheencodingprocessbyundoingtheshiftof5positionsinthe alphabet.Itensuresthatthedecodedstringmatchestheoriginalinput." , "Algorithm":"Todecodetheshiftedstring,wecanmodifytheexistingcodebyadding5totheASCIIvalue ofeachcharacterandwrappingaroundiftheresultexceedstheASCIIvalueof'z'.", "Parameters":"None", "Reason":"Thisalgorithmdirectlyreversestheencodingprocessbyundoingtheshiftof5positionsinthe alphabet.Itensuresthatthedecodedstringmatchestheoriginalinput." , "Algorithm":"Todecodetheshiftedstring,wecanmodifytheexistingcodebysubtracting5fromthe ASCIIvalueofeachcharacterandwrappingaroundiftheresultislessthantheASCIIvalueof'a'.", "Parameters":"None", "Reason":"Thisalgorithmdirectlyreversestheencodingprocessbyundoingtheshiftof5positionsinthe alphabet.Itensuresthatthedecodedstringmatchestheoriginalinput." ." ], "OutputFormat":[ "Algorithm":"Theselectedalgorithmormethod." "Parameters":"Thechosenparameters." ], "Remember":"StrictlyfollowtheJSONOutputFormat.Avoidadditionalexplanationsortextoutsidetheformat." ================================Assistant================================= "Algorithm":"Todecodetheshiftedstring,weneedtoreversetheencodingprocess.Wecanachievethis bysubtracting5fromtheASCIIvalueofeachcharacterandwrappingaroundiftheresultislessthantheASCII valueof'a'.", "Parameters":"None" Figure 33: An example of Discriminating Agent. ==================================User================================== "Instructions":[ "Refinetheexistingcodetomatchtheexpectedoutputusingthealgorithm/method:"Todecodetheshifted string,weneedtoreversetheencodingprocess.Wecanachievethisbysubtracting5fromtheASCIIvalueof eachcharacterandwrappingaroundiftheresultislessthantheASCIIvalueof'a'.", andparameters:"None" providedbythediscriminatorforimprovedperformance.", "Targetproblem: defencode_shift(s:str): """ returnsencodedstringbyshiftingeverycharacterby5inthealphabet. """ return"".join([chr(((ord(ch)+5-ord("a"))%26)+ord("a"))forchins]) defdecode_shift(s:str): """ takesasinputstringencodedwithencode_shiftfunction.Returnsdecodedstring. """ .", "Theplanis:Theprovidedsolutionplanfortheencode_shiftfunctioniscorrect..", "Usingtheentrypointdecode_shiftoftheproblem,donotaddmainfunction." "Handleedgecasessuchasinvalidinputs,emptyvalues,orboundaryconditions." ], "Remember":"Startwith"```Code"andendwith"```".Writeallcodeintoasingle.pyfile.Avoidextraexplanation, wordsor"""inyouroutput." ================================Assistant================================= ```python defencode_shift(s:str): return"".join([chr(((ord(ch)+5-ord("a"))%26)+ord("a"))forchins]) defdecode_shift(s:str): return"".join([chr(((ord(ch)-5-ord("a"))%26)+ord("a"))forchins]) ``` Figure 34: An example of Coding Agent (Refine stage).