Paper deep dive

Unified-MAS: Universally Generating Domain-Specific Nodes for Empowering Automatic Multi-Agent Systems

Hehai Lin, Yu Yan, Zixuan Wang, Bo Xu, Sudong Wang, Weiquan Huang, Ruochen Zhao, Minzhi Li, Chengwei Qin

Year: 2026Venue: arXiv preprintArea: cs.AIType: PreprintEmbeddings: 113

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 96%

Last extracted: 3/26/2026, 2:28:55 AM

Summary

Unified-MAS is a framework that decouples granular node implementation from topological orchestration in Multi-Agent Systems (MAS) by using an offline synthesis approach. It employs a two-stage process: Search-Based Node Generation, which retrieves external open-world knowledge to create domain-specific nodes, and Reward-Based Node Optimization, which uses a perplexity-guided reward mechanism to iteratively refine bottleneck nodes, improving performance and cost-efficiency in knowledge-intensive domains.

Entities (5)

Unified-MAS · framework · 100%Automatic-MAS · system-paradigm · 98%Reward-Based Node Optimization · methodology-stage · 95%Search-Based Node Generation · methodology-stage · 95%Gemini 3 Pro · llm · 90%

Relation Signals (3)

Search-Based Node Generation → partof → Unified-MAS

confidence 100% · Unified-MAS operates in two stages: (1) Search-Based Node Generation

Reward-Based Node Optimization → partof → Unified-MAS

confidence 100% · Unified-MAS operates in two stages: ... and (2) Reward-Based Node Optimization

Unified-MAS → improves → Automatic-MAS

confidence 95% · integrating Unified-MAS into four Automatic-MAS baselines yields a better performance-cost trade-off

Cypher Suggestions (2)

List the stages of the Unified-MAS framework. · confidence 95% · unvalidated

MATCH (s:MethodologyStage)-[:PART_OF]->(f:Framework {name: 'Unified-MAS'}) RETURN s.name

Find all frameworks that improve Automatic-MAS performance. · confidence 90% · unvalidated

MATCH (f:Framework)-[:IMPROVES]->(m:SystemParadigm {name: 'Automatic-MAS'}) RETURN f.name

Abstract

Abstract:Automatic Multi-Agent Systems (MAS) generation has emerged as a promising paradigm for solving complex reasoning tasks. However, existing frameworks are fundamentally bottlenecked when applied to knowledge-intensive domains (e.g., healthcare and law). They either rely on a static library of general nodes like Chain-of-Thought, which lack specialized expertise, or attempt to generate nodes on the fly. In the latter case, the orchestrator is not only bound by its internal knowledge limits but must also simultaneously generate domain-specific logic and optimize high-level topology, leading to a severe architectural coupling that degrades overall system efficacy. To bridge this gap, we propose Unified-MAS that decouples granular node implementation from topological orchestration via offline node synthesis. Unified-MAS operates in two stages: (1) Search-Based Node Generation retrieves external open-world knowledge to synthesize specialized node blueprints, overcoming the internal knowledge limits of LLMs; and (2) Reward-Based Node Optimization utilizes a perplexity-guided reward to iteratively enhance the internal logic of bottleneck nodes. Extensive experiments across four specialized domains demonstrate that integrating Unified-MAS into four Automatic-MAS baselines yields a better performance-cost trade-off, achieving up to a 14.2% gain while significantly reducing costs. Further analysis reveals its robustness across different designer LLMs and its effectiveness on conventional tasks such as mathematical reasoning.

PDF

Open source PDF →Open local PDF →

Full Text

112,413 characters extracted from source content.

Expand or collapse full text

Unified-MAS: Universally Generating Domain-Specific Nodes for Empowering Automatic Multi-Agent Systems Hehai Lin ♠ , Yu Yan ♠ , Zixuan Wang ♠ , Bo Xu ♠ , Sudong Wang ♠ , Weiquan Huang ♠ , Ruochen Zhao ♢ , Minzhi Li ♡♣ , Chengwei Qin ♠ * ♠ The Hong Kong University of Science and Technology (Guangzhou) ♢ Nanyang Technological University ♡ National University of Singapore ♣ Institute for Infocomm Research (I 2 R), A*STAR Abstract Automatic Multi-Agent Systems (MAS) gen- eration has emerged as a promising paradigm for solving complex reasoning tasks. However, existing frameworks are fundamentally bottle- necked when applied to knowledge-intensive domains (e.g., healthcare and law). They ei- ther rely on a static library of general nodes like Chain-of-Thought, which lack special- ized expertise, or attempt to generate nodes on the fly. In the latter case, the orchestra- tor is not only bound by its internal knowl- edge limits but must also simultaneously gener- ate domain-specific logic and optimize high- level topology, leading to a severe architec- tural coupling that degrades overall system effi- cacy. To bridge this gap, we propose Unified- MAS that decouples granular node implemen- tation from topological orchestration via of- fline node synthesis. Unified-MAS operates in two stages: (1) Search-Based Node Gener- ation retrieves external open-world knowledge to synthesize specialized node blueprints, over- coming the internal knowledge limits of LLMs; and (2) Reward-Based Node Optimization utilizes a perplexity-guided reward to itera- tively enhance the internal logic of bottleneck nodes. Extensive experiments across four spe- cialized domains demonstrate that integrating Unified-MAS into four Automatic-MAS base- lines yields a much better performance-cost trade-off, achieving up to a 14.2% gain while significantly reducing costs. Further analysis reveals its robustness across different designer LLMs and its generalizability to general do- mains such as mathematical reasoning. Code is available at https://github.com/linhh29/Unified- MAS. 1 Introduction The rapid evolution of Large Language Models (LLMs) has transformed the landscape of Artificial Intelligence (Ferrag et al., 2025; Xu et al., 2025a; * Corresponding to chengweiqin@hkust-gz.edu.cn Huang et al., 2026). Building upon this foundation, LLM-based Multi-Agent Systems (MAS) have emerged as a powerful paradigm, demonstrating superior capabilities by leveraging collaborative intelligence (Lin et al., 2025; Wu et al., 2025). Tra- ditionally, designing effective MAS required metic- ulous manual engineering by human experts (Wang et al., 2022; Shinn et al., 2023). Recently, the community has experienced a paradigm shift to- wards automatic MAS generation (Ye et al., 2025; Tran et al., 2025). By utilizing techniques such as graph neural networks or code-based optimization, Automatic-MAS can discover novel agentic work- flows that often surpass human-designed solutions on general-purpose benchmarks (Ke et al., 2025a). Despite these advancements, a significant lim- itation persists: the severe performance degrada- tion of Automatic-MAS in specialized, knowledge- intensive domains (Hong et al., 2023; Xu et al., 2025b). As illustrated in Figure 1(a), our pre- liminary study reveals that when applied to do- mains requiring specialized expertise (e.g., le- gal judgment or clinical diagnosis), they consis- tently underperform compared to manually crafted, domain-specific MAS. This performance gap stems from the fact that most Automatic-MAS rely on a static set of general-purpose nodes like Chain-of- Thought (CoT) (Wei et al., 2022) and Debate (Du et al., 2024). Lacking specialized priors, the orches- trator tends to merely stack general nodes, failing to capture the nuanced requirements for expert-level tasks (Li et al., 2024; Wang et al., 2025b). Recent works have attempted to explore dynamic node generation, prompting the orchestrator to in- vent new sub-agents on the fly (Zhang et al., 2025b; Ruan et al., 2026). However, these approaches suf- fer from two fundamental flaws. First, they are bound by the internal knowledge limits of the LLM. Without grounding in external, domain-specific data (e.g., legal statutes or clinical protocols), the LLM inevitably hallucinates superficial or erro- 1 arXiv:2603.21475v1 [cs.AI] 23 Mar 2026 Dynamic Node GenerationFree Orchestrator to Topology Dynamic Node GenerationFree Orchestrator to Topology Orchestrator You are a rigorous and impartial presiding judge. Your task is to generate legal reasoning and deliver the final ruling based on the plaintiff’s and defendant’s statements and the additional information provided. Given information: ... You are a rigorous and impartial presiding judge. Your task is to generate legal reasoning and deliver the final ruling based on the plaintiff’s and defendant’s statements and the additional information provided. Given information: ... Static Archive of Predefined Nodes Node Generated On-the-fly Domain-Specific Question Domain-Specific Question Orchestrator OrchestratorUnified-MAS You are a rigorous and impartial presiding judge. Your task is to generate legal reasoning and deliver the final ruling based on the plaintiff’s and defendant’s statements and the additional information provided. Given information: ... Domain-Specific Question (a) Performance Degradation in Specialized Domains (b) Automatic-MAS with Pre-defined Nodes (c) Automatic-MAS with Dynamic Nodes(d) Unified-MAS CoTCoT-SC Debate⋯ ❌ ❌ ✅ ✅ External Open-World Knowledge  MAS Topology CoT Debate ⋯ ⋯ CoT-SC MAS Topology Plan SubAgent ⋯ ⋯ Expert 1 MAS Topology Case Structurer Legal Searcher ⋯ ⋯ Damages Calculator Figure 1: Overview of MAS paradigms. (a) Performance degradation in specialized domains, where Automatic-MAS with predefined nodes underperforms manual MAS. (b)-(c) Comparison of existing Automatic-MAS paradigms, illustrating the dichotomy between dynamic node generation and topological flexibility. (d) Unified-MAS leverages open-world knowledge to generate domain-specific nodes, effectively empowering existing Automatic-MAS. neous node logic (Huang et al., 2025). Second, it introduces a severe architectural coupling. Bur- dening the orchestrator with the granular implemen- tation of micro-level domain logic distracts and di- lutes its primary capability: managing macro-level topological connectivity (Ke et al., 2026). To address these challenges, we propose Unified- MAS, a novel framework that advocates for the decoupling of granular node implementation from topological orchestration. As an offline synthesizer, Unified-MAS generates domain-specific nodes for any domain that can be seamlessly integrated into any existing Automatic-MAS. Specifically, Unified- MAS contains two stages: (1) Search-Based Node Generation: Unified-MAS first extracts multi- dimensional keywords from task samples and syn- thesizes targeted queries. Then, to overcome para- metric knowledge limitations, it retrieves exter- nal open-world knowledge across diverse sources (i.e., Google, GitHub, and Google Scholar) to dis- till domain-specific design principles, generating an initial set of specialized nodes. (2) Reward- Based Node Optimization: Initially generated nodes, while functionally relevant, are often coarse- grained and logically brittle, which may trigger compounding errors in a multi-agent scenario. We introduce a node optimization mechanism driven by a perplexity-guided reward. By quantifying the stability and magnitude of reasoning progress con- tributed by each node, Unified-MAS identifies bot- tleneck nodes and iteratively refines their internal implementation (e.g., refining prompt constraints or adding necessary sub-agent calls). We comprehensively evaluate Unified-MAS on four highly specialized benchmarks, i.e., Trav- elPlanner for constrained travel planning (Xie et al., 2024), HealthBench for healthcare (Arora et al., 2025), J1Bench for legal judgment (Jia et al., 2025), and DeepFund for financial decision-making (Li et al., 2025). We integrate the generated nodes into four general Automatic-MAS baselines, MAS- Zero (Ke et al., 2025b), AFlow (Zhang et al., 2024), ScoreFlow (Wang et al., 2025c), and MAS 2 (Wang et al., 2025a), and evaluate the system with four different LLMs as orchestrators. The evaluations reveal several key findings: (1) Dual Advantage in Performance and Cost. Unified-MAS consistently drives performance gains, achieving up to a 14.2% gain, while simultaneously reducing costs. This underscores the critical role of domain-specific pri- ors, positioning our framework as a universal cat- alyst for elevating general Automatic-MAS into expert-level systems. (2) Strong Robustness and Generalizability. Unified-MAS not only exhibits robust performance across various designer LLMs but also generalizes seamlessly to general domains like mathematics. (3) Efficacy of Perplexity-Guided Optimization. The synthesized nodes progressively improve through reward-based optimization, ef- fectively strengthening their logical reliability in complex domains. Our main contributions are sum- marized as follows: •We identify the limitations of Automatic- MAS in specialized domains and propose a new paradigm that decouples granular node implementation from topology orchestration. • We propose Unified-MAS, which leverages external retrieval to synthesize specialized nodes, and employs perplexity-guided reward optimization to improve their internal logic. •Our extensive experiments demonstrate that 2 Unified-MAS consistently improves the per- formance of existing Automatic-MAS while reducing costs across complex domains. 2 Related Work 2.1 Automatic-MAS with Pre-defined Nodes The most prevalent methods construct Multi-agent Systems (MAS) using a static archive of pre- defined nodes, which consists of manually de- signed structures, such as CoT, CoT-SC (Wang et al., 2022), and self-reflection (Madaan et al., 2023; He et al., 2025), where each node functions as an agent (Xi et al., 2025). The orchestrator’s role is to determine the optimal topological connections between these nodes to form a cohesive problem- solving architecture (Chen et al., 2024). Research in this area is further divided into inference-time and training-time methods. Inference-time approaches rely on sophisticated prompting and iterative search without updating model weights. For example, ADAS represents the MAS as code and iteratively generates new archi- tectures using a Meta Agent Search on a validation set (Hu et al., 2024). AFlow employs Monte Carlo Tree Search (MCTS) to discover effective agentic workflows (Zhang et al., 2024), while DyLAN en- ables multi-round interactions with dynamic agent selection and early-stopping mechanisms to en- hance efficiency (Liu et al., 2023). MAS-Zero in- troduces a self-reflective feedback loop, allowing the orchestrator to optimize the MAS without re- quiring an external validation set (Ke et al., 2025b). Training-time approaches optimize the orchestra- tor to generate high-quality MAS in one-shot by learning from generated trajectories. ScoreFlow utilizes Score-DPO, a variant of direct preference optimization, to incorporate quantitative feedback into the orchestrator’s training (Wang et al., 2025c). MAS 2 learns a self-generative, self-configuring, and self-rectifying workflow (Wang et al., 2025a), while MAS-Orchestra models MAS construction as a function-calling task optimized via Group Rel- ative Policy Optimization (GRPO) (Ke et al., 2026). However, a critical limitation of these methods is their reliance on a static set of general-purpose nodes. As demonstrated in Figure 1, when applied to specialized domains, their performance often lags behind manually crafted domain-specific MAS due to the lack of expert knowledge. 2.2 Automatic-MAS with Dynamic Nodes To address the rigidity of pre-defined archives, re- cent community has turned to dynamic node gen- eration, where the orchestrator attempts to intro- duce new nodes on the fly based on task require- ments. MetaAgent first identifies and implements necessary nodes before optimizing the system us- ing Finite State Machines (Zhang et al., 2025b). EvoAgent serves as a generic method to automat- ically extend expert agents into MAS via evo- lutionary algorithms (Yuan et al., 2025). Simi- larly, Aorchestra abstracts nodes into a tuple of ⟨Instruction, Context, Tools, Model⟩, enabling the orchestrator to dynamically populate these slots following task decomposition (Ruan et al., 2026). While promising, these approaches are constrained by the orchestrator’s internal knowledge. If the necessary domain expertise is absent during the orchestrator’s pre-training, the system is prone to hallucinations, resulting in ineffective or erroneous nodes (Valmeekam et al., 2022; Ji et al., 2023). Furthermore, recent observations suggest that an effective orchestrator should prioritize architectural connectivity rather than the granular implementa- tion of individual nodes (Ke et al., 2026). In this paper, we introduce Unified-MAS, a two-stage workflow designed to generate domain- specific nodes, which can be seamlessly integrated into existing Automatic-MAS frameworks. This integration injects essential domain knowledge into the system while liberating the orchestrator from the burden of node design, thereby allowing it to fully leverage its search capabilities to optimize the topological structure of the MAS. 3 Methodology As illustrated in Figure 2, Unified-MAS introduces a new paradigm by acting as an offline node syn- thesizer prior to the Automatic-MAS topological search. This design bridges the gap between gen- eral automatic orchestration and domain specificity through a highly decoupled two-stage pipeline: (1) Search-Based Node Generation, which overcomes parametric knowledge limits, and (2) Reward- Based Node Optimization, which improves the in- ternal reasoning logic of individual nodes. 3.1 Problem Formulation Existing Automatic-MAS approaches typically frame the system design as a search problem over a topology spaceΩusing a static library of prede- 3 Domain-specific Question: You are a rigorous and impartial presiding judge. Your task is to generate legal reasoning and deliver the final ruling basedonthe plaintiff's and defendant's statements and the additional information provided. Maintain a neutral, professional, and fair judicial tone at all times, without favoring either side. You are given the following information: “category”: “Personality rights dispute”, “plaintiff”: “x Song”, “defendant”: “A kindergarten in Beijing”, “incident”: “ 2023-04-21 kite activity injury (facial cut near eye) ”, “ claims”:[medical 6900¥, lost wages 4000¥, transport 3000¥, mental distress 10000¥], ... (a) Search-Based Node Generation (b) Reward-Based Node Optimization (c) Domain-Specific Node set Validation Set 퐷 !"# Context Buffer  Designer Executor Keyword Extraction 푁 samples Domain: Legal Tech, Litigation, ... Task:Legal Judgment, ... Entities:Evidence,Plaintiff, ... Actions: Predict, Verify, ... Constraints:Fact-Law Alignment, ... Desired Outcomes: Rationality, ... ImplicitKnowledge:CivilCode, ... Strategy Synthesis Query1: “Civil Code” AND “Artificial Intelligence” AND “Tort Liability Principles” Reasoning1: Using the specific legal code and liability principles to find background research on how AI models these specific legal frameworks. Query2: ... Reasoning2: ... ... External Sources Multi-turn Search 푉 $%$& 푉 '()"$% Trajectory Collection Node Internal Optimization Improvement Score Consistency Score 푆 $,& 푆 +,& Perplexity-Guided Reward 푣 ! :0.00938, 푣 " :0.00417,푣 # :0.09911, 푣 $ :−0.03577,푣 % :−0.01989, Iterative Refinement Bottleneck Node 푣 ∗ Case Structurer Legal SearchEngine Fact Analyzer Damages Calculator Judgment Drafter Figure 2: Illustration of Unified-MAS. (a) Search-Based Node Generation retrieves external knowledge via keyword-strategy driven queries to initializeV init . These nodes are subsequently fed into (b) Reward-Based Node Optimization, which iteratively identifies and refines bottleneck nodes guided by a perplexity-based reward. Finally, Unified-MAS generates (c) a domain-specific node set, which can be integrated into existing Automatic-MAS. fined, general-purpose nodesV fix . LetMrepre- sent a MAS configuration defined by its topologi- cal structureG ∈ Ωand the selection of functional nodesV ⊆ V fix . The objective is to identify the optimal configurationM ∗ that maximizes the ex- pected performance metricRlike accuracy over the data distributionD: M ∗ = arg max G∈Ω,V⊆V fix E x∼D [R(M(x;G,V ))] (1) This formulation inherently limits the solution space to combinations of generic reasoning nodes inV fix . Unified-MAS addresses this limitation by expanding the search space from a staticV fix to a dynamically domain-adaptive setV domain . 3.2 Search-Based Node Generation Multi-Dimensional Keyword Extraction. To constructV domain , we first sampleNexamples from a validation setD val to form a context buffer C. We prompt the LLM to analyzeCand extract keywords across seven dimensions. This granu- lar decomposition ensures no critical aspect of the domain is overlooked. (1) Domain: the macro- industry context (e.g., Fintech); (2) Task: the core technical problem (e.g., decision-making); (3) En- tities: the specific data entities such as company news; (4) Actions: the operations or methods per- formed on these entities; (5) Constraints: task re- quirement such as low latency; (6) Desired Out- comes: the target metrics (e.g., accuracy); and (7) Implicit Knowledge: latent expert intuitions that are not explicitly stated but are essential for success. Strategy-Driven Query Synthesis.We then syn- thesize these seven dimensions into four targeted search strategies, each designed to retrieve a spe- cific layer of system design knowledge: (1) Strat- egy A (Background Knowledge): combining Do- main and Implicit Knowledge to retrieve back- ground information and survey papers; (2) Strategy B (System Architecture): combining Task and Constraints to search for architectural patterns that satisfy specific requirements; (3) Strategy C (Code Implementation): combining Entities and Actions to locate repositories for libraries handling spe- cific data types from GitHub; and (4) Strategy D (Evaluation): combining Task and Desired Out- comes to identify standard benchmarks and evalua- tion metrics for this specific domain. Knowledge Aggregation and Node Generation. Finally, we perform multi-turn search (Zhao et al., 2026) using appropriate search engines, and ag- gregate the retrieved content into strategy-specific summaries. Based on these summaries and guided by a node generation prompt, the LLM generates an initial node setV init = v 1 ,...,v m , where each nodev i represents a domain-specific agent in- cluding its system prompts and tool specifications. 3.3 Reward-Based Node Optimization Although the initial nodes inV init successfully cap- ture essential domain priors, possessing knowledge does not equal robust reasoning. The preliminary nature of their generation often leaves their internal implementation superficial, struggling to handle the nuanced logic required for expert-level tasks. With- 4 out iterative refinement, these unstable reasoning mechanics can easily bottleneck the overall system efficacy. Therefore, to transition these nodes from coarse blueprints into reliable operators, we for- mulate MAS execution as a trajectory reasoning, assign a reward for each node, and optimize the bottleneck node with the lowest reward. Although some nodes are logically parallel, their outputs can be treated as being sequentially ap- pended to the MAS output during execution. Let a reasoning trajectory be a sequence of states τ =h 0 ,h 1 ,...,h m generated by the sequential execution of nodesv 1 ,...,v m . Here,h 0 repre- sents the empty context before any node execution, whileh t (fort ≥ 1) denotes the output generated by nodev t . The accumulated context after execut- ing nodev t is defined as the concatenation of all preceding outputs: A t = [h 0 ,h 1 ,...,h t ]. To evaluate the effectiveness of each node, we measure how well the accumulated reasoning tra- jectory predicts the ground-truth answery. Specif- ically, we compute the perplexity of generatingy given the input questionqand the accumulated context A t under an LLM P θ : PPL(y|q,A t ) = exp(− 1 |y| |y| X j=1 logP θ (y j |q,A t )) (2) Based on this definition, we derive an objec- tive functionJby maximizing the negative log- perplexity, which reflects the predictability of the answer y given the accumulated reasoning steps: J (P θ ,y,q,A t ) =− log(PPL(y|q,A t )) = 1 |y| |y| X j=1 logP θ (y j |q,A t ) (3) A higherJcorresponds to lower perplexity, indi- cating that the sequence of reasoning steps up to nodev t has effectively reduced the model’s uncer- tainty and guided the system closer to the correct solution. To standardize evaluation across different queries, we defineJ 0 as the predictability of the an- swer using the model’s direct inference capability, i.e., with an empty context A 0 ,J 0 =J (P θ ,y,q). To optimize nodes based on the objective de- fined above, we evaluate each node from two com- plementary perspectives: utility and stability. An effective node should provide a reasoning path that is not only impactful (yielding a considerable gain) but also consistent (avoiding erratic fluctua- tions) (Liu et al., 2025b). We therefore introduce two scores to assess the quality of node v t : Improvement Score (S i,t ) It measures the rela- tive gain in the objective compared to the baseline J 0 , reflecting the strength of the node’s contribu- tion. Formally, S i,t = tanh(δ(P θ ,y,q,A t ) + 1)(4) δ(P θ ,y,q,A t ) = J (P θ ,y,q,A t )−J 0 J 0 (5) whereδ(P θ ,y,q,A t )represents the normalized im- provement over direct inference. Thetanhfunc- tion is used to smooth outliers and bound the score. Consistency Score (S c,t )It assesses the stability of the reasoning process. To measure whether the benefit improves consistently as reasoning depth increases, we compute Kendall’s Tau correlation coefficient (Kendall, 1938) between the sequence of objective valuesJ 1 ,...,J t and their corre- sponding step indices. The consistency score is: S c,t = 2 t(t− 1) i<j X 1≤i,j≤t sgn(J i −J j )· sgn(i− j) (6) wheresgn(·)denotes the Signum function. A higherS c indicates a more stable reasoning tra- jectory where the objective improves consistently with increasing reasoning depth. The Node Quality Score (S t ) is computed as a weighted combination of the improvement and consistency scores: S t = (1− α)S i,t + αS c,t (7) whereαis a balancing hyperparameter. Based on this score, we define the perplexity-guided reward of nodev t as the incremental gain in node quality: r t = ( S t −S t−1 if t > 1, S t if t = 1 (8) To refine node implementations, we perform op- timization forKepochs on the validation setD val . In each epoch, we calculate the average reward ̄r(v)for each nodev ∈V init across all samples of D val . The node with the lowest average reward is identified as the bottleneck node: v ∗ = arg min v∈V init ̄r(v)(9) 5 We then retrieve the samples wherev ∗ produces the lowest rewards and use them to refine its inter- nal instructions or add additional LLM calls to max- imize future rewards. Importantly, in each epoch, samples for whichv ∗ is not the lowest-reward node are excluded from the optimization process, ensur- ing targeted and stable refinement. There are two types of LLMs in Unified-MAS. To distinguish them from the LLMs used in Automatic-MAS (orchestrator), we denote them as Designer and Executor. The Designer is responsi- ble for generating and optimizing domain-specific nodes. We employ Gemini-3-Pro as the default Designer due to its strong capabilities. The effect of different Designer models is further investigated in Section 5.2.1. The Executor executes nodes and collects trajectories to compute the perplexity- guided reward. Considering that this computation requires direct access to token-level logits and the practical deployment, we employ Qwen3-Next- 80B-A3B-Instruct as the default Executor. 4 Experimental Settings Benchmarks and Evaluation Metrics. We se- lect four benchmarks spanning different specialized domains. (1) TravelPlanner (Xie et al., 2024) for constrained planning. Performance is measured by the accuracy. (2) HealthBench (Arora et al., 2025) for health diagnosis. Responses are scored against a rubric using an LLM-Judge. (3) J1Bench (Jia et al., 2025) simulates automatic legal adjudica- tion. The agent synthesizes conflicting testimonies to produce a final verdict, evaluated by an LLM- Judge under a unified standard. (4) DeepFund (Li et al., 2025) for stock market decision-making and evaluated by accuracy. All metrics are normalized to[0, 100%]. We report the average performance and the average cost (in USD $). Comprehensive dataset statistics are provided in Appendix C (Ta- ble 4). The detailed LLM-as-a-Judge prompts are cataloged in Appendix F (Figure 7). Baselines. We adopt three categories of MAS to ensure a comprehensive evaluation. (1) Spe- cific Manual MAS: PMC (Zhang et al., 2025a) for TravelPlanner, Diagnosis-MAS (Chen et al., 2025) for HealthBench, Court-MAS (Jia et al., 2025) for J1Bench, and DeepFund-MAS (Li et al., 2025) for DeepFund. These serve as the manual-design per- formance standard. (2) Automatic-MAS with Dy- namic Nodes: MetaAgent (Zhang et al., 2025b), EvoAgent (Yuan et al., 2025), and AOrches- tra (Ruan et al., 2026), which generate nodes on the fly during problem solving. (3) Automatic-MAS with Pre-defined Nodes: We benchmark against leading Automatic-MAS that rely on static nodes, i.e., AFlow (Zhang et al., 2024), MAS-Zero (Ke et al., 2025b), ScoreFlow (Wang et al., 2025c), and MAS 2 (Wang et al., 2025a). Importantly, we em- power these baselines by replacing their general nodes with the domain-specific node libraries gen- erated offline by Unified-MAS. Test Models.We deploy the same LLM for every component within the final Automatic-MAS setups for fair comparison. Our evaluation spans four dif- ferent models, including two closed-source models, Gemini-3-Flash (Team et al., 2023) and GPT-5- Mini (Singh et al., 2025), and two open-source models, Qwen3-Next-80B-A3B-Instruct (Team, 2025) and DeepSeek-V3.2 (Liu et al., 2025a). Key configurations and hyperparameters are doc- umented in Appendix D, and prompts for Unified- MAS are listed in Appendix F. 5 Results and Analysis 5.1 Main Results The Domain Barrier: Manual vs. Automatic- MAS. Table 1 shows that task-specific Manual MAS consistently outperforms Automatic-MAS baselines across nearly all settings. For example, with Gemini-3-Flash, Manual MAS achieves an average score of 40.99, significantly exceeding all Automatic-MAS baselines. This gap highlights the importance of domain expertise in complex tasks. Even with dynamic node generation, general- purpose orchestrators struggle to discover effective reasoning topologies without incorporating special- ized knowledge. Trap of Dynamic Node Generation. Methods attempting dynamic node generation (i.e., MetaA- gent, EvoAgent, AOrchestra) exhibit flashes of po- tential but suffer from severe systemic instability. For example, while EvoAgent marginally surpasses Manual MAS on J1Bench (e.g., 41.82 vs. 40.00 with Gemini-3-Flash), these dynamic methods fail catastrophically on TravelPlanner, often perform- ing worse than the Vanilla baseline. Unified-MAS Improves Performance and Ef- ficiency. As shown in Table 1, integrating the domain-specific node set generated by Unified- MAS substantially improves the performance of 6 Method Gemini-3-FlashGPT-5-Mini TPHBJ1DFAvg.Perf↑Avg.Cost↓TPHBJ1DFAvg.Perf↑Avg.Cost↓ Vanilla38.3326.3633.2516.8228.692.4843.8937.8423.7712.1519.410.469 Manual MAS43.8837.0740.0042.9940.9921.89820.0043.1937.6634.5833.8617.164 MetaAgent 41.1129.6737.4019.6331.954.1162.2239.0430.3913.0821.181.443 EvoAgent41.1134.1341.8237.3838.6144.7913.8941.4036.4912.1523.4817.498 AOrchestra40.0028.9834.1622.4331.396.8566.6738.4130.0017.7623.213.131 MAS-Zero40.6131.3039.5335.9436.85132.17910.9438.3326.7212.5022.12111.910 + Unified46.8833.6048.9148.4444.46 +7.61123.803 -8.37623.4440.2130.9428.1230.68 +8.5644.011 -67.899 AFlow39.4435.5434.0337.3836.6032.4625.0040.1924.1614.0220.847.561 + Unified48.3337.6944.2954.2146.13 +9.5332.861 +0.39914.4449.9741.8238.9736.30 +15.467.734 +0.173 ScoreFlow 32.2231.0036.1018.6929.5036.9086.6741.5730.529.3522.035.914 + Unified39.4432.3744.5550.4741.71 +12.2129.071 -7.8377.7843.3734.0340.1931.34 +9.315.969 +0.055 MAS 2 42.2233.0734.9417.7632.0024.1743.8942.4132.3415.8923.633.368 + Unified48.8935.0946.2549.0644.82 +12.8214.819 -9.3554.4446.6442.7336.8932.68 +9.052.858 -0.510 MethodQwen3-Next-80B-A3B-InstructDeepSeek-V3.2 Vanilla2.2220.0727.6623.3618.330.1768.3323.5131.6926.1722.430.244 Manual MAS8.8927.2735.9732.7126.215.86721.6736.5439.4842.0634.948.149 MetaAgent 1.6721.2229.7410.2815.732.1480.5623.8432.4728.9721.462.586 EvoAgent1.6714.0338.0423.3619.286.7115.0030.7640.3733.6427.448.358 AOrchestra1.1120.9534.8135.5123.101.6134.4426.9436.8825.2323.372.198 MAS-Zero3.1315.2534.0626.5619.7529.9119.3825.2636.4131.2525.5836.804 + Unified9.4020.9040.1633.9026.09 +6.3418.939 -10.97223.4433.6447.5839.0635.93 +10.3518.622 -18.182 AFlow3.3325.8130.2624.3020.931.67812.2225.5638.4436.4528.177.773 + Unified5.5632.4637.4056.0832.88 +11.951.665 -0.01322.2239.5150.2646.7339.68 +11.512.593 -5.180 ScoreFlow5.0024.3135.7132.6424.424.58515.0025.8538.3140.1929.847.361 + Unified10.5631.5539.8753.2733.81 +9.393.849 -0.73618.8934.6054.8145.7938.52 +8.685.924 -1.437 MAS 2 5.0027.1032.4730.8423.852.00010.0026.8137.2727.1025.301.650 + Unified11.1132.5543.8142.9932.62 +8.771.008 -0.99217.2232.4152.8638.3235.20 +9.901.338 -0.312 Table 1: Quantification comparison of Unified-MAS and baselines on four benchmarks. Rows highlighted inblue indicate methods with domain-specific nodes generated by Unified-MAS. TP: TravelPlanner, HB: HealthBench, J1: J1Bench, DF: DeepFund. Avg. reports average performance and cost. Bold denotes the best result. 0510152025 22.5 25.0 27.5 30.0 32.5 35.0 37.5 40.0 42.5 Vanilla Specific Manual MAS MetaAgent EvoAgent AOrchestra AFlow ScoreFlow MAS² AFlow + Unified ScoreFlow + Unified MAS² + Unified 5055 MAS-Zero + Unified 7580 MAS-Zero Avg.Cost (USD $) Avg.Perf (%) Vanilla Specific Manual MAS Automatic-MAS (dynamic nodes) Automatic-MAS (pre-defined nodes) Ours Figure 3: Performance-cost trade-off averaged across four LLMs. Gray arrows illustrate Unified-MAS elevat- ing baselines to higher performance at reduced costs. predefined Automatic-MAS while universally re- ducing costs. In terms of average performance, incorporating domain-specific nodes yields consis- tent improvements across all settings, with gains ranging from 6.0% (MAS-Zero with Qwen3-Next- 80B-A3B-Instruct) to 14.2% (AFlow with GPT-5- Mini). Figure 3 further demonstrates that methods enhanced by Unified-MAS consistently achieve a superior performance–cost trade-off compared to both manual and unenhanced automatic base- lines. By replacing inefficient general nodes with optimized domain-specific nodes, Unified-MAS enables the system to solve complex problems with fewer and more effective steps. These results con- firm that Unified-MAS successfully bridges the gap, combining the reliability of expert nodes with the scalability of automated design. 5.2 Further Analysis 5.2.1 Robustness to Designer Choices Table 2 reveals that Unified-MAS universally ele- vates baseline performance across all three Design- ers, demonstrating that Unified-MAS is highly ro- bust to the choice of the “Designer LLM”. Interest- ingly, we observe an architectural divergence based on the LLM’s inherent preferences (Appendix E). Gemini models tend to synthesize concise, macro- level workflows (5-6 nodes), whereas GPT-5-Mini prefers micro-level granularity (about 10 nodes by decomposing complex nodes further). De- spite these distinct topological preferences, Unified- MAS is not bottlenecked by any single LLM, con- sistently driving substantial performance gains. 7 MethodDesigner GPT-5-MiniDeepSeek-V3.2 Perf↑Cost↓Perf↑Cost↓ AFlow-20.847.56128.177.773 + Unified Gemini-3-Pro36.307.73439.682.593 Gemini-3-Flash33.464.42137.142.221 GPT-5-Mini35.847.09335.492.125 MAS 2 -23.633.36825.301.650 + Unified Gemini-3-Pro32.682.85835.201.338 Gemini-3-Flash30.044.71332.031.615 GPT-5-Mini30.575.98732.552.204 Table 2: Robustness across different Designer LLMs. Method GPT-5-MiniDeepSeek-V3.2 Perf↑Cost↓Perf↑Cost↓ Vanilla55.100.23442.860.063 MAS-Zero57.1430.54648.9812.162 + Unified59.18 +2.0419.351 -11.19553.06 +4.088.899 -3.263 AFlow59.181.73648.980.472 + Unified67.35 +8.173.209 +1.47355.10 +6.120.718 +0.246 ScoreFlow57.140.70144.900.305 + Unified61.22 +4.080.854 +0.15357.14 +12.240.462 +0.157 MAS 2 63.271.13351.020.434 + Unified67.35 +4.081.040 -0.09366.67 +15.650.884 +0.450 Table 3: Results of General Automatic-MAS with/with- out Unified-MAS on AIME24&25. 5.2.2 Generalizability to General Domains While our main evaluation focuses on specialized domains, Table 3 extends the analysis to general domains (mathematical reasoning) using AIME 2024 and 2025 (MAA-Committees, 2025). Inte- grating Unified-MAS consistently improves perfor- mance across all baselines for both GPT-5-Mini and DeepSeek-V3.2. Although the gains are more modest than the substantial improvements observed in knowledge-intensive tasks, the results prove that our framework can successfully synthesize rea- sonable, fine-grained mathematical nodes (see Ap- pendix E), demonstrating broad applicability even in conventional reasoning tasks. 5.2.3 Successful Pattern To understand this performance leap, we qualita- tively compare the nodes generated by Unified- MAS against those from dynamic Automatic-MAS on J1Bench (Appendix E). Dynamic methods like EvoAgent resort to a lazy ensemble approach, gen- erating superficial nodes like “Expert1” and “Ex- pert2” without true domain grounding. In sharp contrast, Unified-MAS synthesizes a highly struc- tured, expert-level judicial pipeline. It explicitly divides reasoning into professional stages: “Le- gal_Element_Extractor”, “Liability_Reasoning”, and so on. As detailed in Appendix E, compared to the blind prompt-level voting of original AFlow, 012345678910 Epoch 22.5 25.0 27.5 30.0 32.5 35.0 37.5 40.0 Avg.Perf (%) AFlow, gpt-5-mini AFlow, deepseek-v3.2 MAS2, gpt-5-mini MAS2, deepseek-v3.2 Figure 4: Epoch-wise performance dynamics during node optimization using Gemini-3-Pro as the Designer. the Unified-MAS-enhanced workflow ensures that every stage is traceable and legally grounded. 5.2.4 The Optimization Dynamics Our reward-based node optimization reveals an im- portant learning dynamic. As shown in Figure 4, the performance trajectory is non-monotonic. From our observation, during early epochs (0 to 5), the system repeatedly targets the most severe “bottle- neck node”. Updating this node temporarily dis- rupts established cross-node co-adaptations, caus- ing short-term perturbation. However, once the bot- tleneck is sufficiently alleviated, the system shifts focus to other nodes. Consequently, performance rapidly recovers and converges to a sustained global optimum in the later epochs (6–10). These results indicate that our node optimization strategy effec- tively removes brittle internal logic while avoiding trapping the system in local optima. 6 Conclusion In this work, we decouple granular node imple- mentation from topology orchestration and propose Unified-MAS, which automatically synthesizes domain-specific nodes through external knowl- edge retrieval and iteratively refines them via a perplexity-guided reward. Extensive experiments demonstrate that integrating our generated nodes into existing Automatic-MAS approaches univer- sally enhances overall performance, yielding im- provements of up to 14.2% while simultaneously reducing costs. Further analysis highlights the ro- bustness of Unified-MAS across different Designer LLMs, demonstrates its generalizability to gen- eral domains, and confirms the critical role of the reward-based optimization stage. Moving forward, Unified-MAS can be broadly applied to virtually any specific domain to generate highly professional 8 nodes, seamlessly bridging the gap between gen- eral Automatic-MAS and deep domain expertise for future scalable real-world applications. Limitations While Unified-MAS demonstrates significant ef- ficacy, we acknowledge certain limitations that present exciting avenues for future research. Pri- marily, our current framework operates as an offline node-preparation phase, which restricts its imme- diate applicability in highly dynamic or extremely time-sensitive environments that necessitate real- time, on-the-fly node generation and adaptation. To transition towards fully online, adaptive synthesis, future work should proceed in two main directions. On one hand, future work should focus on stream- lining the generation pipeline, allowing the frame- work to rapidly create and adapt nodes directly. On the other hand, future systems could learn directly from live feedback, quickly adjusting nodes instead of relying on a long offline evaluation. References Rahul K Arora, Jason Wei, Rebecca Soskin Hicks, Pre- ston Bowman, Joaquin Quiñonero-Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, An- drea Vallone, Alex Beutel, and 1 others. 2025. Healthbench: Evaluating large language models towards improved human health. arXiv preprint arXiv:2505.08775. Shuaihang Chen, Yuanxing Liu, Wei Han, Weinan Zhang, and Ting Liu. 2024.A survey on llm- based multi-agent system: Recent advances and new frontiers in application.arXiv preprint arXiv:2412.17481. Xi Chen, Huahui Yi, Mingke You, WeiZhi Liu, Li Wang, Hairui Li, Xue Zhang, Yingman Guo, Lei Fan, Gang Chen, and 1 others. 2025. Enhancing diagnostic capability with multi-agents conversational large lan- guage models. NPJ digital medicine, 8(1):159. Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenen- baum, and Igor Mordatch. 2024. Improving factual- ity and reasoning in language models through multia- gent debate. In Forty-first international conference on machine learning. Mohamed Amine Ferrag, Norbert Tihanyi, and Mer- ouane Debbah. 2025. From llm reasoning to au- tonomous ai agents: A comprehensive review. arXiv preprint arXiv:2504.19678. Jiayi He, Hehai Lin, Qingyun Wang, Yi R Fung, and Heng Ji. 2025. Self-correction is more than refine- ment: A learning framework for visual and language reasoning tasks. In Findings of the Association for Computational Linguistics: ACL 2025, pages 6405– 6421. Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, and 1 others. 2023. Metagpt: Meta programming for a multi-agent collaborative framework. In The twelfth international conference on learning representations. Shengran Hu, Cong Lu, and Jeff Clune. 2024. Au- tomated design of agentic systems. arXiv preprint arXiv:2408.08435. Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and 1 oth- ers. 2025. A survey on hallucination in large lan- guage models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems, 43(2):1–55. Weiquan Huang, Zixuan Wang, Hehai Lin, Sudong Wang, Bo Xu, Qian Li, Beier Zhu, Linyi Yang, and Chengwei Qin. 2026. Ama: Adaptive mem- ory via multi-agent collaboration. arXiv preprint arXiv:2601.20352. Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. 2023. Survey of hal- lucination in natural language generation. ACM com- puting surveys, 55(12):1–38. Zheng Jia, Shengbin Yue, Wei Chen, Siyuan Wang, Yidong Liu, Zejun Li, Yun Song, and Zhongyu Wei. 2025. Ready jurist one: Benchmarking language agents for legal intelligence in dynamic environments. arXiv preprint arXiv:2507.04037. Zixuan Ke, Fangkai Jiao, Yifei Ming, Xuan-Phi Nguyen, Austin Xu, Do Xuan Long, Minzhi Li, Chengwei Qin, Peifeng Wang, Silvio Savarese, and 1 others. 2025a. A survey of frontiers in llm reasoning: Inference scal- ing, learning to reason, and agentic systems. arXiv preprint arXiv:2504.09037. Zixuan Ke, Yifei Ming, Austin Xu, Ryan Chin, Xuan-Phi Nguyen, Prathyusha Jwalapuram, Semih Yavuz, Caiming Xiong, and Shafiq Joty. 2026. Mas-orchestra:Understanding and improving multi-agent reasoning through holistic orchestra- tion and controlled benchmarks.arXiv preprint arXiv:2601.14652. Zixuan Ke, Austin Xu, Yifei Ming, Xuan-Phi Nguyen, Ryan Chin, Caiming Xiong, and Shafiq Joty. 2025b. Mas-zero: Designing multi-agent systems with zero supervision. arXiv preprint arXiv:2505.14996. Maurice G Kendall. 1938. A new measure of rank correlation. Biometrika, 30(1-2):81–93. Changlun Li, Yao Shi, Chen Wang, Qiqi Duan, Runke Ruan, Weijie Huang, Haonan Long, Lijun Huang, Nan Tang, and Yuyu Luo. 2025. Time travel is 9 cheating: Going live with deepfund for real-time fund investment benchmarking.arXiv preprint arXiv:2505.11065. Xinyi Li, Sai Wang, Siqi Zeng, Yu Wu, and Yi Yang. 2024.A survey on llm-based multi-agent sys- tems: workflow, infrastructure, and challenges. Vici- nagearth, 1(1):9. Hehai Lin, Shilei Cao, Sudong Wang, Haotian Wu, Minzhi Li, Linyi Yang, Juepeng Zheng, and Cheng- wei Qin. 2025. Interactive learning for llm reasoning. arXiv preprint arXiv:2509.26306. Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingx- uan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, and 1 others. 2025a. Deepseek-v3. 2: Pushing the frontier of open large language models. arXiv preprint arXiv:2512.02556. Junnan Liu, Hongwei Liu, Songyang Zhang, and Kai Chen. 2025b. Rectifying llm thought from lens of optimization. arXiv preprint arXiv:2512.01925. Zijun Liu, Yanzhe Zhang, Peng Li, Yang Liu, and Diyi Yang. 2023. Dynamic llm-agent network: An llm- agent collaboration framework with agent team opti- mization. arXiv preprint arXiv:2310.02170. MAA-Committees. 2025. Aime problems and solu- tions. Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, and 1 others. 2023. Self-refine: Iterative refinement with self-feedback. Advances in neural information processing systems, 36:46534–46594. Jianhao Ruan, Zhihao Xu, Yiran Peng, Fashen Ren, Zhaoyang Yu, Xinbing Liang, Jinyu Xiang, Bang Liu, Chenglin Wu, Yuyu Luo, and 1 others. 2026. Aorchestra: Automating sub-agent creation for agen- tic orchestration. arXiv preprint arXiv:2602.03786. Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Re- flexion: Language agents with verbal reinforcement learning. Advances in neural information processing systems, 36:8634–8652. Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, and 1 oth- ers. 2025. Openai gpt-5 system card. arXiv preprint arXiv:2601.03267. Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Mil- lican, and 1 others. 2023. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805. Qwen Team. 2025. Qwen3 technical report. Preprint, arXiv:2505.09388. Khanh-Tung Tran, Dung Dao, Minh-Duong Nguyen, Quoc-Viet Pham, Barry O’Sullivan, and Hoang D Nguyen. 2025.Multi-agent collaboration mech- anisms:A survey of llms.arXiv preprint arXiv:2501.06322. Karthik Valmeekam, Alberto Olmo, Sarath Sreedharan, and Subbarao Kambhampati. 2022. Large language models still can’t plan (a benchmark for llms on plan- ning and reasoning about change). In NeurIPS 2022 Foundation Models for Decision Making Workshop. Kun Wang, Guibin Zhang, ManKit Ye, Xinyu Deng, Dongxia Wang, Xiaobin Hu, Jinyang Guo, Yang Liu, and Yufei Guo. 2025a. Mas 2 : Self-generative, self-configuring, self-rectifying multi-agent systems. arXiv preprint arXiv:2509.24323. Qian Wang, Tianyu Wang, Zhenheng Tang, Qinbin Li, Nuo Chen, Jingsheng Liang, and Bingsheng He. 2025b. Megaagent: A large-scale autonomous llm- based multi-agent system without predefined sops. In Findings of the Association for Computational Linguistics: ACL 2025, pages 4998–5036. Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2022. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171. Yinjie Wang, Ling Yang, Guohao Li, Mengdi Wang, and Bryon Aragam. 2025c. Scoreflow: Mastering llm agent workflows via score-based preference opti- mization. arXiv preprint arXiv:2502.04306. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, and 1 others. 2022. Chain-of-thought prompting elic- its reasoning in large language models. Advances in neural information processing systems, 35:24824– 24837. Haotian Wu, Shufan Jiang, Mingyu Chen, Yiyang Feng, Hehai Lin, Heqing Zou, Yao Shu, and Chengwei Qin. 2025. Furina: A fully customizable role-playing benchmark via scalable multi-agent collaboration pipeline. arXiv preprint arXiv:2510.06800. Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yi- wen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, and 1 others. 2025. The rise and potential of large language model based agents: A survey. Science China Information Sci- ences, 68(2):121101. Jian Xie, Kai Zhang, Jiangjie Chen, Tinghui Zhu, Renze Lou, Yuandong Tian, Yanghua Xiao, and Yu Su. 2024. Travelplanner: A benchmark for real- world planning with language agents. arXiv preprint arXiv:2402.01622. Fengli Xu, Qianyue Hao, Chenyang Shao, Zefang Zong, Yu Li, Jingwei Wang, Yunke Zhang, Jingyi Wang, Xiaochong Lan, Jiahui Gong, and 1 others. 2025a. 10 Toward large reasoning models: A survey of rein- forced reasoning with large language models. Pat- terns, 6(10). Tianhan Xu, Ling Chen, Zhe Hu, and Bin Li. 2025b. Staf-llm: A scalable and task-adaptive fine-tuning framework for large language models in medi- cal domain.Expert Systems with Applications, 281:127582. Rui Ye, Shuo Tang, Rui Ge, Yaxin Du, Zhenfei Yin, Si- heng Chen, and Jing Shao. 2025. Mas-gpt: Training llms to build llm-based multi-agent systems. arXiv preprint arXiv:2503.03686. Siyu Yuan, Kaitao Song, Jiangjie Chen, Xu Tan, Dong- sheng Li, and Deqing Yang. 2025. Evoagent: To- wards automatic multi-agent generation via evolu- tionary algorithms. In Proceedings of the 2025 Con- ference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 6192–6217. Cong Zhang, Xin Deik Goh, Dexun Li, Hao Zhang, and Yong Liu. 2025a. Planning with multi-constraints via collaborative language agents. In Proceedings of the 31st International Conference on Computational Linguistics, pages 10054–10082. Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xionghui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, and 1 others. 2024. Aflow: Automating agentic workflow generation. arXiv preprint arXiv:2410.10762. Yaolun Zhang, Xiaogeng Liu, and Chaowei Xiao. 2025b. Metaagent: Automatically constructing multi-agent systems based on finite state machines.arXiv preprint arXiv:2507.22606. Yubao Zhao, Weiquan Huang, Sudong Wang, Ruochen Zhao, Chen Chen, Yao Shu, and Chengwei Qin. 2026. Training multi-turn search agent via con- trastive dynamic branch sampling. arXiv preprint arXiv:2602.03719. 11 A Description of Appendix The appendix provides extended methodologi- cal details and comprehensive experimental data to further support the findings presented in the main manuscript. Appendix B presents the de- tailed pseudocode illustrating the algorithmic work- flow of the proposed two-stage Unified-MAS. Ap- pendix C provides exhaustive statistics and descrip- tive summaries of the diverse evaluation bench- marks, detailing dataset splitting protocols and the specific characteristics of each domain-specific task. Appendix D delineates the complete experi- mental setup, including the baselines and the imple- mentation details. Appendix E offers a qualitative case study that compares the node generation of Unified-MAS against existing Automatic-MAS. Fi- nally, Appendix F catalogs the comprehensive set of prompts utilized for Unified-MAS and our ex- periment. B Pseudocode of Unified-MAS We provide the pseudocode of Unified-MAS here. C Statistics of Benchmarks We split the entire dataset into the validation set and the test set because some Automatic-MAS needs the validation set to sample the best multi-agent system. For fair comparison, all the reported results are based on the test set. We randomly sample some examples from these datasets to build the validation and test set, which can be found in Table 4. TravelPlanner (Xie et al., 2024): This bench- mark aims to evaluate the planning capabilities of language agents within complex, real-world travel scenarios. It features 1,225 meticulously curated user intents, and the evaluation focuses on an agent’s proficiency in multi-constraint reasoning and effective tool utilization, serving as a rigorous test for assessing how models navigate intricate planning tasks and integrate disparate information to achieve actionable objectives. HealthBench (Arora et al., 2025): This bench- mark is designed to evaluate the clinical proficiency and safety of AI agents in healthcare. Drawing upon the expertise of 262 practicing physicians across 60 countries, the dataset encompasses 5,000 authentic clinical dialogue scenarios ranging from acute emergencies to global health issues. Utilizing a physician-curated rubric, HealthBench moves be- yond simple outcome metrics to rigorously assess Algorithm 1 Unified-MAS Require:Validation setD val , LLMP θ , Max epochs K, Balance factor α, Sample size N Ensure: Domain-specific node setV domain Stage 1: Search-Based Node Generation 1: Sample N examples fromD val to formC 2:Extract keywords across 7 dimensions fromC 3: Synthesize search queries for 4 strategies 4: Retrieve external knowledge 5: Generate initial node setV init =v 1 ,...,v m Stage 2: Reward-Based Node Optimization 6: V domain ←V init 7: for k = 1 to K do 8:Initialize R[v]←∅ for all v ∈V domain 9:for each sample (q,y)∈D val do 10:Initialize empty context A 0 ← [h 0 ] 11:Compute baseline predictability:J 0 = − log(PPL(y|q,A 0 )) 12:for t = 1 to m do 13:Execute node v t , obtain reasoning h t 14: Update accumulated context:A t ← [h 0 ,h 1 ,...,h t ] 15:Compute: J t =− log(PPL(y|q,A t )) 16: Calculate relative gain:δ t = (J t − J 0 )/J 0 17: Compute Improvement Score:S i,t = tanh(δ t + 1) 18:Compute Consistency ScoreS c,t using Eq. (6) 19: Node Quality Score:S t = (1−α)S i,t + αS c,t 20:if t > 1 then 21:Node reward: r t = S t − S t−1 22:else 23:Node reward: r t = S t 24:end if 25:Append r t to R[v t ] 26:end for 27:end for 28:for each node v ∈V domain do 29: Calculate average reward ̄r(v)fromR[v] 30:end for 31:Identify v ∗ = arg min v∈V domain ̄r(v) 32: Retrieve samples wherev ∗ yielded the low- est reward and refine implementation 33: end for 34: return V domain 12 models across critical dimensions, including clin- ical accuracy, communication quality, situational awareness, and safety, thereby ensuring robust per- formance in high-stakes medical applications. J1Bench (Jia et al., 2025): This benchmark fo- cuses on automated legal adjudication by simulat- ing court proceedings. The input consists of 93 comprehensive cases, including formal complaints, defendant arguments, and evidentiary materials de- rived from actual judicial records. The agent is required to synthesize these conflicting testimonies and legal documents to produce a reasoned, final judicial judgment. Evaluation is based on the align- ment of the agent’s verdict with ground-truth, mea- suring the model’s capacity to interpret legal argu- ments and arrive at legally sound conclusions. DeepFund (Li et al., 2025): This benchmark evaluates the financial intelligence of agents in stock market decision-making. The input features a rich, time-sensitive dataset comprising corporate fundamental data, historical price trends, and real- time financial news streams. For a targeted list of stocks, the agent is tasked with outputting a cat- egorical decision, specifically, “Buy”, “Sell”, or “Hold”. The full dataset contains 139 cases to as- sess the agent’s ability to effectively integrate het- erogeneous information into actionable investment strategies. AIME24&25 (MAA-Committees, 2025): This benchmark collection contains 57 questions and de- rives from the 2024 and 2025 editions of the Ameri- can Invitational Mathematics Examination (AIME), comprising two distinct problem sets. Each set contains rigorously vetted mathematical questions characterized by high cognitive demand. The eval- uative focus lies in probing advanced mathematical competencies, with particular emphasis on multi- faceted problem-solving strategies that require in- tegration of complex conceptual frameworks. D Experimental Details D.1 Specific Manual MAS Baselines PMC (Zhang et al., 2025a): PMC employs a hier- archical planning framework where a centralized planner decomposes complex tasks into sub-tasks, which are then executed by specialized agents with predefined roles. Incorporating a structured col- laboration protocol, it ensures systematic problem- solving across multi-stage reasoning chains. Diagnosis-MAS (Chen et al., 2025): Diagnosis- MAS utilizes a multi-stage diagnostic workflow where agents engage in iterative feedback loops to identify and mitigate noise in reasoning processes. This approach systematically filters out erroneous information, thereby significantly enhancing the reliability of medical diagnosis. Court-MAS (Jia et al., 2025): Court-MAS adopts an adversarial interaction model inspired by judicial processes, where agents act as compet- ing parties to present evidence and verify claims. A central judge-agent then adjudicates these contri- butions based on the simulated interaction. DeepFund-MAS (Li et al., 2025): DeepFund- MAS implements a multi-agent architecture tai- lored for financial analysis, where agents are parti- tioned into functional units such as data acquisition, sentiment analysis, and risk assessment. The sys- tem allows agents to correlate disparate financial signals into coherent investment insights. D.2 Implementation Details For cost considerations, we set AFlow’s maximum number of iterations to 10 and run the validation set once each round. For all other baselines, we strictly follow the original settings. Table 5 lists the important hyperparameters used in Unified-MAS. We set GPT-5-Mini with “low” reasoning effort, while leveraging the standard instruction versions of the other three LLMs. We use GPT-4o as the default LLM-judge following (Ke et al., 2025b). We also show the cost of Unified-MAS’s two stages in Table 6. E Case Study Table 7 lists the generated nodes of Unified-MAS using Gemini-3-Pro on AIME24&25. Table 8 shows the generated nodes using Unified-MAS and other Automatic-MAS with dynamic nodes. It in- dicates that although these Automatic-MAS can introduce new nodes to some extent, their perfor- mance in different specialized fields is not stable enough. For example, EvoAgent generates an ex- cessive number of “Expert Node” to solve the prob- lem in parallel, which is more like an ensemble rather than introducing the real agentic element. Figure 5 and Figure 6 compare the different MAS generated by MAS-Zero using Gemini-3- Flash, for the same example shown in Table 8, with/without nodes generated by Unified MAS. Compared with the original AFlow, the Unified- MAS version is more structured, transparent, and reliable. It explicitly separates case structuring, le- 13 SplitTravelPlannerHealthBenchJ1BenchDeepFundAIME24&25 Validation453216328 Test1801687710749 Table 4: Data size for each split in each dataset. CategoryHyperparameterDescriptionValue Unified-MAS NThe number of samples used to build context buffer10 TurnThe turn number of multi-turn search10 αThe weight used to aggregateS i,t andS c,t 0.6 KThe epoch number of node optimization stage10 LLM calls temperatureThe sampling temperature of calling LLM1.0 max_tokensThe maximum number of output tokens32768 Table 5: The description and value of important hyperparameters. DatasetGenerationOptimization TravelPlanner10.7804.001 HealthBench8.0331.793 J1Bench11.0931.737 DeepFund 10.1703.113 AIME24&258.8910.255 Table 6: Cost (USD $) of Unified-MAS using Gemini- 3-Pro as the Designer. Node NameNode Description Math_Domain_Analyzer Understands the problem type and key constraints. Theorem_Strategy_Retriever Finds relevant theorems and solving strategies. Step_by_Step_Solver Builds a full solution draft step by step. Constraint_Logic_Verifier Checks and fixes logic/math mistakes. Final_Answer_Formatter Extracts and formats the final answer correctly. Table 7: Unified-MAS’s generated nodes using Gemini- 3-Pro on AIME24&25. gal retrieval, fact verification, damages calculation, and judgment drafting, so each reasoning stage is traceable and easier to validate. By contrast, the original AFlow relies more on prompt-level rea- soning and ensemble voting, offering less explicit alignment between evidence, legal rules, and quan- tified outcomes. F Prompt Details We elaborate on the prompts used in Unified-MAS from Figure 7 to Figure 18. These comprehen- sive instructions cover evaluation and the entire framework pipeline, including keyword extraction, search query generation, strategy analysis, node generation, and node optimization. 14 MethodNode NameNode Description / Function Input Question: You are a rigorous and impartial presiding judge. Your task is to generate legal reasoning and deliver the final ruling based on the plaintiff’s and defendant’s statements and the additional information provided. Maintain a neutral, professional, and fair judicial tone at all times, without favoring either side. You are given the following information: "category": "Personality rights dispute", "plaintiff": "x Song", "defendant": "A kindergarten in Beijing", "incident": "2023-04-21 kite activity injury (facial cut near eye)", "claims":[medical 6900¥, lost wages 4000¥, transport 3000¥, mental distress 10000¥, future treatment 40000¥], ... AOrchestra (Gemini-3-Flash) MainAgentTop-level orchestrator deciding next action. errorRuntime error transition captured in trajectory log. delegate_taskDelegates current sub-problem to a sub-agent. finishSub-agent final answer step for delegated task. completeMainAgent composes and returns final answer. SubAgentDelegated worker agent that executes subtask reasoning. EvoAgent (Gemini-3-Flash) MainAgentControls iterative expert evolution and selection. Expert#1Expert role (tort-law doctrine and social public policy). Expert#2Expert role (protection of minors’ rights and mental-health assessment). Expert#3 Expert role for refining disputed issues (future treatment and long-term impact). ExpertGroup(3)Aggregated 3-expert panel output per iteration. MetaAgent (Gemini-3-Flash) Presiding_JudgePerforms legal analysis: statute search, liability split, claim acceptance/rejection, and summary for downstream actuar- ial calculation. Unified_MAS (Gemini-3-Flash) Rhetorical_SegmenterSegments legal input into modules: plaintiff claims, defense, findings, and evidence. Legal_Element_Extractor Extracts legal-technical elements such as claim items, amounts, and injury/contract details. Statutory_RetrieverRetrieves applicable PRC Civil Code statutes based on extracted legal elements. Evidence_EvaluatorEvaluates evidentiary support using civil “high probability” proof standard. Liability_Reasoning_EngineApplies law to verified facts to infer liability ratio and compensation basis. Final_Judgement_SynthesizerProduces final judicial reasoning and verdict in required output format. Unified_MAS (GPT-5-Mini) Ingest_and_NormalizeNormalizes input into canonical text blocks with metadata and offsets. Document_ClassifierClassifies document/domain type and extracts top remedies. Party_and_Role_ExtractionExtracts parties/roles with provenance. Claims_and_Remedies_Extraction Extracts requested claims/remedies and maps them to claimants. Evidence_EnumerationEnumerates/classifies evidence and links evidence to claim- s/events. Timeline_and_Causation Builds a chronological event timeline and causal links to damages. Retrieve_Statutes_and_PrecedentsRetrieves legal statutes and precedent snippets (RAG). Statute_to_Fact_LinkingLinks facts/claims to statute or case references with justifica- tions. Liability_ReasoningInfers party liability allocation with legal rationale. Damage_Calculation_and_ReconciliationPerforms component-level damage calculation and reconcili- ation. Validation_and_Consistency_ChecksRuns consistency/constraint checks on full structured out- put. Final_Judgment_SynthesisSynthesizes full Chinese judgment text and structured verdict. Final_Answer_LineEmits the final one-line verdict beginning with “Answer:”. Unified_MAS (Gemini-3-Pro) Case_StructurerParses raw case JSON into parties, cause of action, claims, and dispute summary. Legal_Search_EngineRetrieves statutes/judicial interpretations relevant to the dispute type. Fact_Analyzer Verifies facts and causality from conflicting statements and evidence. Damages_CalculatorValidates and computes monetary compensation items. Judgment_DrafterDrafts the final formal judgment text from structured reason- ing. Table 8: Comparison of generated nodes using MetaAgent, EvoAgent, AOrchestra, and Unified-MAS on J1Bench. 15 The MAS generated by AFlow with Unified-MAS class Workflow: def __init__( self, name: str, llm_config, dataset: DatasetType, ) -> None: self.name = name self.dataset = dataset self.llm = create_llm_instance(llm_config) self.custom = operator.Custom(self.llm) self.case_structurer = operator.CaseStructurer(self.llm) self.fact_analyzer = operator.FactAnalyzer(self.llm) self.legal_search_engine = operator.LegalSearchEngine(self.llm) self.damages_calculator = operator.DamagesCalculator(self.llm) self.judgment_drafter = operator.JudgmentDrafter(self.llm) async def __call__(self, problem: str): """ Implementation of a structured judicial workflow using specialized operators. """ # Step 1: Extract and structure the key legal elements from raw case text s_res = await self.case_structurer(input_data=problem) # Prepare inputs for fact analysis (ensuring dictionary types for keys expected to be objects) fact_arguments = s_res.get("arguments", ) fact_parties = s_res.get("parties", ) # Step 2: Retrieve relevant laws and perform forensic fact analysis l_res = await self.legal_search_engine(cause_of_action=str(s_res.get("cause_of_action","")),argu- ments=str(s_res.get("arguments", ""))) # Ensure legal search result is a dictionary before accessing "relevant_laws" relevant_laws = l_res.get("relevant_laws", str(l_res)) if isinstance(l_res, dict) else str(l_res) f_res = await self.fact_analyzer( arguments=fact_arguments, evidence_summary=str(s_res.get("evidence_summary", "")), cause_of_action=str(s_res.get("cause_of_action", "")), parties=fact_parties ) # Step 3: Calculate valid damages based on verified facts and legal rules c_res = await self.damages_calculator( financial_claims=s_res.get("financial_claims", []), verified_facts=f_res, relevant_laws=relevant_laws, cause_of_action=str(s_res.get("cause_of_action", "")) ) # Step 4: Generate a formal draft judgment dr_res = await self.judgment_drafter( structured_brief=s_res, relevant_laws=relevant_laws, verified_facts=f_res, damages_table=c_res ) # Step 5: Final synthesis and rule enforcement (Statute of Limitations, Divorce grounds, Causality) # Safely extract the judgment text from the drafting operator result draft_judgment = dr_res.get("final_judgment", str(dr_res)) if isinstance(dr_res, dict) else str(dr_res) context = f"Original Case: problem, Verified Facts: f_res, Legal Draft: draft_judgment" solution = await self.custom(input=context, instruction=prompt_custom.JUDGE_PROMPT) # Handle final solution format if isinstance(solution, dict): final_response = solution.get("response", str(solution)) else: final_response = str(solution) return final_response, self.llm.get_usage_summary()["total_cost"] Figure 5: The MAS generated by AFlow with Unified-MAS using Gemini-3-Flash as Orchestrator. 16 The MAS generated by AFlow without Unified-MAS class Workflow: def __init__( self, name: str, llm_config, dataset: DatasetType, ) -> None: self.name = name self.dataset = dataset self.llm = create_llm_instance(llm_config) self.custom = operator.Custom(self.llm) self.sc_ensemble = operator.ScEnsemble(self.llm) self.answer_generate = operator.AnswerGenerate(self.llm) async def __call__(self, problem: str): """ Implementation of the workflow with Evidence checking, Logical accounting, and Ensemble judgment. """ # Step 1: Objective evaluation of the evidentiary chain and proof status evidence = await self.custom(input=problem, instruction=prompt_custom.EVIDENCE_PROMPT) evidence_res = evidence.get("response", "") # Step 2: Step-by-step logical accounting of liability and proven amounts # We use AnswerGenerate to enforce a CoT structure (thought) and a summary (answer) logic = await self.answer_generate(input=f"Evidence review result: evidence_res, Case information: prob- lem, Please strictly deduct amounts not admitted according to the ̈ dispute ̈ field, then analyze the liability ratio based on the defendant’s defense and calculate the final amount.") # Safe extraction to prevent KeyError: "answer" if the LLM output doesn’t match the expected tags logic_thought = logic.get("thought", "No detailed logical analysis") logic_answer = logic.get("answer", logic.get("response", "No clear core conclusion")) # Step 3: Ensemble multiple candidates to reach the most consistent final judgment candidates = [] for _ in range(3): # We pass both the detailed reasoning and the specific conclusion to the judge cand_res = await self.custom( input=f"Logical analysis: logic_thought, Core conclusion: logic_answer, Case information: problem", instruction=prompt_custom.JUDGMENT_PROMPT ) candidates.append(cand_res.get("response", "")) # Perform self-consistency ensemble to select the most reliable verdict solution = await self.sc_ensemble(problem=problem, solutions=candidates) return solution.get("response", ""), self.llm.get_usage_summary()["total_cost"] Figure 6: The MAS generated by AFlow without Unified-MAS using Gemini-3-Flash as Orchestrator. 17 EpochNode Internal Implementation Epoch 0 (Initialization) def Fact_Analyzer(self, input_data): args = input_data.get("arguments", ) evid = input_data.get("evidence_summary", ) node_messages = [ "role":"system","content":"You are a Senior Investigator...", "role":"user","content":f"Plaintiff’s Story & Evidence: args, evid..." ] response = self.llm_client.chat (node_messages, response_format="json_object") return response Epoch 10 (Optimized) def Fact_Analyzer(self, input_data): # omitted detailed prompt bodies for space cause_of_action = input_data.get(’cause_of_action’, ’Civil Dispute’) parties = input_data.get(’parties’) or input_data.get(’specific_characters’, ) arguments = input_data.get(’arguments’, ) evidence = input_data.get(’evidence’, ) evidence_str = json.dumps(evidence, ensure_ascii=False, indent=2) fact_finder_messages = [system: category-aware rules for Tort/Financial/Divorce, user: case context] fact_analysis = self.llm_client.chat (fact_finder_messages) judge_messages = [system: liability matrix, user: Fact Analysis] response =self.llm_client.chat (judge_messages, response_format=’json_object’) return response Table 9: Comparison ofFact_Analyzerimplementation across epochs on J1Bench. Compared to the unoptimized Epoch 0, Epoch 10 contains a two-stage, category-aware reasoning pipeline. Prompt for LLM-as-a-judge Evaluation HealthBench: Your job is to score a model’s reply against a rubric item. Decide whether the reply meets the criterion described in the rubric item. Model’s reply: <reply> Rubric item: <rubric_item> # Instructions Return a json object with the following fields: "explanation" and "criteria_met". The "explanation" field should be a string explaining why the reply does or does not meet the criteria of the rubric item. The "criteria_met" field should be a boolean indicating whether the reply meets the criteria of the rubric item. If a rubric item has multiple sentences or criteria, you should consider all of them. If any of the criteria is not met, the answer should be false. Only return true if all of the criteria are met. One important exception: if a criteria says "such as", "for example", or "including", the reply does not have to include all of the examples listed to meet the criteria. # Final instruction Return just the json object in markdown format. Do not include any other text in the response. J1Bench: You are a legal expert. Please determine whether the "Answer to be evaluated" fully and accurately covers all the core points of the "Standard answer" based on the "legal issue" and the "Standard Answer", and then give a score ranging from 0 to 10. You don’t need to consider non-substantive factors such as whether the answer to be evaluated is expressed concisely, whether the key points are highlighted, whether small talk is used, or whether the structure is lengthy. You don’t need to deduct points for being insufficiently concise, nor do you need to consider the difference in language forms between Chinese and English. You only need to consider whether the content meaning is consistent. Standard answer: <gt_answer> Answer to be evaluated: <mode_answer> Output your results in the following format (no line breaks, no parentheses): <Rating: ..., Reason: ...> Figure 7: Prompt for LLM-as-a-judge Evaluation. 18 Prompt for Keyword Extraction System_prompt: You are an expert dataset and task analyst. You are given multiple samples from a benchmark dataset. Your task is to carefully read all samples, analyze this Specific Domain Task, and extract keywords across six specific dimensions required to solve this task. The extracted keywords should be concise and representative, and should not focus on the specific data samples, but on the general domain and task. User_prompt: User Task samples: <samples_text> Analyze the description above and reason to extract keywords for the following six dimensions. For each dimension, provide 5-10 most representative terms: 1. Domain: The macro industry background (e.g., Fintech, Supply Chain, Bioinformatics, etc.). 2. Task: The core technical problem to solve (e.g., Anomaly Detection, Named Entity Recognition, Summarization, etc.). 3. Entities: The specific data objects or physical entities involved (e.g., Transaction Logs, PDF Contracts, Protein Sequences, Sensor Data, etc.). 4. Actions: The specific operations performed on the data (e.g., Classify, Extract, Reason, Optimize, Verify, etc.). 5. Constraints: Performance metrics or limitations (e.g., Low Latency, Privacy Preserving, Explainability, Offline Inference, etc.). 6. Desired Outcomes: The expected results or metrics (e.g., Accuracy, Precision, Recall, F1 Score, AUC, MAP, NDCG, etc.). 7. Implicit Knowledge: Based on your expert knowledge, infer specific jargon, SOTA techniques, common challenges, or potential risks that are not explicitly mentioned but are essential for solving this problem (e.g., "Imbalanced Data" for fraud, "Hallucination" for GenAI, "Bullwhip Effect" for supply chain, etc.). # Output Format Please output both your thinking and answer in the JSON format. "thinking" entry: [Your thinking process, how you arrive your answer] "answer" entry: [your answer in the JSON format] For the "thinking" entry, you need to first carefully read the <samples_text>, summarize the task description, and then reason step by step to arrive at your answer. For the "answer" entry, please output a valid JSON object. Do not include any conversational filler or markdown formatting outside the JSON code block. Format as follows: "Domain": ["..."], "Task": ["..."], "Entities": ["..."], "Actions": ["..."], "Constraints": ["..."], "Desired_Outcomes": ["..."], "Implicit_Knowledge": ["..."] Figure 8: Prompt for Keyword Extraction. 19 Prompt for Search Query Generation System_prompt: You are an expert in Information Retrieval (IR) and Multi-Agent System Design. You know how to construct precise search queries to retrieve background knowledge, high-quality academic papers, code implementations, and industry Standard Operating Procedures (SOPs). User_prompt: Based on the provided [Structured Keywords (Domain, Task, Entities, Actions, Constraints, Desired_Outcomes, Im- plicit_Knowledge)], apply four specific search strategies to generate a list of search queries for Google Scholar, GitHub, and General Web Search. Structured Keywords JSON: <keywords_json_str> Apply the following four strategies to construct your queries for each dimension: 1. Strategy A: Background Knowledge Logic: Domain + Implicit_Knowledge Aim: Use domain jargon to find background knowledge, cutting-edge solutions, theoretical frameworks, and surveys. 2. Strategy B: High-quality Academic Papers about System Architecture (Workflow & Design) Logic: Task + Constraints Aim: Find architectural designs (e.g., Router, Pipeline, Map-Reduce) that satisfy specific constraints (e.g., Privacy, Real-time). 3. Strategy C: Technical Code Implementation Logic: Entities + Actions Aim: Find code repositories, libraries, or preprocessing tools for specific data types. 4. Strategy D: Evaluation & Metrics Logic: Task + Desired_Outcomes Aim: Find standard datasets and quantitative metrics to evaluate the Agent’s performance. # Output Instructions Generate 5-10 search queries for EACH strategy. Use Boolean operators (AND, OR) where appropriate to optimize results. Please output ONLY a valid JSON object with the following structure: "strategy_A": [ "query": "...", "reasoning": "Using [Implicit Term] to find advanced patterns" ], "strategy_B": [ "query": "...", "reasoning": "To find architectures satisfying [Constraint]" ], "strategy_C": [ "query": "...", "reasoning": "To find tools for processing [Object] via [Action]" ], "strategy_D": [ "query": "...", "reasoning": "To find benchmarks for [Outcome]" ] Figure 9: Prompt for Search Query Generation. 20 Prompt for Multi-turn Search System_prompt: You are a web search controller. Your job is to decide, step by step, how to search the web so that the user can find content that matches the target description. At each round, you will see the target description and a summary of past searches and results, and you must decide whether more searching is needed. You MUST respond with a valid JSON object only, no extra text. User_prompt: Target description: <target_description> Search round: <round_idx> / <max_rounds> Past search rounds: <history_str> Search engine context: Backend type: <engine_type> Instructions: Github_engine_hint: Current search backend: GitHub repository search. You MUST construct queries that look like GitHub repo searches, NOT natural language questions. Focus on a few core keywords: domain, task, entities, and techniques. Prefer short keyword- style queries, optionally with GitHub qualifiers such as ’language:python’, ’in:name,description,readme’, ’stars:>10’. Avoid ’survey of’, ’methods for’, ’towards’, or very long sentences in the query. Scholar_engine_hint: Current search backend: Google Scholar (academic papers). You should construct queries that look like paper titles or combinations of technical terms. It is good to include phrases like ’survey’, ’review’, ’state of the art’ when searching for overviews. Focus on scientific keywords (task, domain, methodology) rather than implementation details. Google_engine_hint: Current search backend: general Google web search. You may mix natural language with key technical terms. Focus on retrieving background knowledge, blog posts, documentation, or tutorials relevant to the target description. Your task in THIS round: 1. Carefully read the target description and past search results. 2. Decide whether we already have enough information that clearly matches the target description. 3. If yes, set "done": true and summarize the useful information we already have. 4. If no, set "done": false and propose the NEXT web search query to run. Output JSON schema (you must strictly follow): "done": bool, // true if we already have enough matching information "need_search": bool, // whether to run another web search in this round "next_query": str, // the next search query to run (empty if done=true) "reasoning": str, // your reasoning for this decision "summary": str // if done=true, summarize what has been found and why it matches Figure 10: Prompt for Multi-turn Search. 21 Prompt for Strategy_A Analysis System_prompt: You are an expert technical analyst. Your task is to analyze multiple documents (PDFs and TXTs) that were retrieved through a web search for background knowledge related to a specific task, and provide a comprehensive summary. User_prompt: Task Description (from task_keywords thinking) <task_thinking> Strategy: <strategy_name> Documents Retrieved: <files_text> IMPORTANT: Please provide an EXTREMELY DETAILED and COMPREHENSIVE analysis. The more detailed, the better. Include specific examples, step-by-step explanations, concrete details, and thorough descriptions. Your task: 1. Analyze all the documents above and identify which aspects/aspects they discuss the background knowledge related to the task described above. Be very specific and detailed about each aspect. 2. Summarize the key background information that is needed to solve this task. Provide EXTREMELY DETAILED descriptions, including but not limited to: Overall task workflow and processes: Provide a DETAILED, step-by-step workflow with specific stages, decision points, inputs/outputs at each stage, and the complete process flow. Include concrete examples and detailed explanations of each step. Key points and important considerations: List ALL important points with detailed explanations, why they matter, and how they impact the task. Be thorough and comprehensive. Domain-specific knowledge and terminology: Provide detailed definitions, explanations, and context for each term. Include how these concepts relate to each other and their significance in the domain. Relevant frameworks, methodologies, or approaches: Describe each framework/methodology in DETAIL, includ- ing their components, how they work, when to use them, and their advantages/disadvantages. Provide specific examples. Common challenges and solutions: Detail each challenge with specific scenarios, root causes, and provide detailed solutions with step-by-step approaches. Include real-world examples. Best practices and standards: Provide detailed best practices with specific guidelines, checklists, and detailed explanations of why each practice is important. 3. Provide a structured summary that clearly explains: What background knowledge aspects are covered in these documents (with detailed descriptions) What specific background information is needed to solve the task (be very specific and detailed) How this background knowledge relates to the task at hand (provide detailed connections and relationships) Remember: The more detailed and comprehensive your analysis, the better. Include specific examples, detailed explanations, step-by- step processes, and thorough descriptions throughout. Please provide a comprehensive and well-structured analysis in JSON format: "aspects_covered": ["detailed aspect1 with explanation", "detailed aspect2 with explanation", ...], "background_information": "task_workflow": "DETAILED step-by-step workflow with all stages, inputs/outputs, decision points, and complete process flow. Be extremely thorough.", "key_points": ["detailed point1 with full explanation", "detailed point2 with full explanation", ...], "domain_knowledge": "DETAILED explanation of domain-specific knowledge, terminology, concepts, and their relationships. Be comprehensive and thorough.", "frameworks_methodologies": ["detailed framework1 with components and usage", "detailed framework2 with components and usage", ...], "challenges_solutions": "DETAILED description of common challenges with specific scenarios, root causes, and detailed step-by-step solutions with examples.", "best_practices": "DETAILED best practices with specific guidelines, checklists, and explanations of importance. Be comprehensive." , "summary": "EXTREMELY DETAILED and comprehensive summary of the background knowledge, including all key points, detailed workflows, and thorough explanations..." Figure 11: Prompt for Strategy_A Analysis. 22 Prompt for Strategy_B Analysis System_prompt: You are an expert system architect and technical analyst. Your task is to analyze academic papers and documents about system architecture, workflow, and design related to a specific task, and provide insights on architectural patterns and design approaches. User_prompt: Task Description (from task_keywords thinking): <task_thinking> Strategy: <strategy_name> Documents Retrieved: <files_text> IMPORTANT: Please provide an EXTREMELY DETAILED and COMPREHENSIVE analysis. The more detailed, the better. Include specific architectural diagrams, descriptions, detailed workflow steps, component interactions, and thorough explanations. Your task: 1. Analyze all the documents above and identify the system architectures, workflows, and design patterns they discuss. Be very specific and detailed about each pattern and architecture. 2. Summarize the key architectural and design information relevant to solving this task. Provide EXTREMELY DETAILED descriptions, including but not limited to: System architecture patterns and structures: Provide DETAILED descriptions of each architecture pattern, including components, their roles, data flow, communication patterns, and how they work together. Include specific examples and detailed explanations. Workflow designs and process flows: Provide EXTREMELY DETAILED, step-by-step workflow descriptions with all stages, transitions, decision points, data flows, error handling, and complete process flows. Include detailed diagrams, descriptions, and specific examples. Component interactions and interfaces: Detail how components interact, what interfaces they use, data formats, protocols, and communication mechanisms. Be very specific and thorough. Design principles and constraints: Provide detailed explanations of each design principle (e.g., privacy, real-time, scalability) with specific implementation strategies, trade-offs, and detailed guidelines. Include concrete examples. Architectural trade-offs and decisions: Detail each trade-off with specific scenarios, pros/cons, decision criteria, and detailed explanations of why certain choices are made. Be comprehensive. Best practices for system design: Provide detailed best practices with specific guidelines, patterns to follow, anti- patterns to avoid, and detailed explanations. Include real-world examples. 3. Provide a structured summary that clearly explains: What architectural patterns and workflows are covered in these documents (with detailed descriptions) What specific architectural/design information is needed to solve the task (be very specific and detailed) How these architectural approaches relate to the task requirements (provide detailed connections and relationships) Remember: The more detailed and comprehensive your analysis, the better. Include specific architectural details, detailed workflow steps, component interactions, and thorough explanations throughout. Please provide a comprehensive and well-structured analysis in JSON format: "architectural_patterns": ["detailed pattern1 with components and structure", "detailed pattern2 with components and structure", ...], "design_information": "system_architectures": "DETAILED description of system architectures with components, data flows, commu- nication patterns, and how they work together. Be extremely thorough.", "workflow_designs": ["DETAILED step-by-step workflow1 with all stages and transitions", "DETAILED step-by-step workflow2 with all stages and transitions", ...], "component_interactions": "DETAILED description of component interactions, interfaces, data formats, proto- cols, and communication mechanisms. Be comprehensive.", "design_constraints": ["detailed constraint1 with implementation strategies", "detailed constraint2 with imple- mentation strategies", ...], "architectural_tradeoffs": "DETAILED description of trade-offs with specific scenarios, pros/cons, decision criteria, and explanations. Be thorough.", "design_best_practices": "DETAILED best practices with specific guidelines, patterns, anti-patterns, and explanations. Include examples. Be comprehensive." , "summary": "EXTREMELY DETAILED and comprehensive summary of the architectural and design knowledge, including all patterns, detailed workflows, and thorough explanations..." Figure 12: Prompt for Strategy_B Analysis. 23 Prompt for Strategy_C Analysis System_prompt: You are an expert AI system architect and LLM prompt engineer. Your task is to analyze code repositories and design frameworks for solving tasks using Large Language Models (LLMs). Focus on high-level architecture, operation design, and how to migrate traditional ML/small model approaches to LLM-based solutions. User_prompt: Task Description (from task_keywords thinking): <task_thinking> Strategy: <strategy_name> Documents Retrieved (Code Repositories): <files_text> IMPORTANT: Focus on FRAMEWORK DESIGN and LLM MIGRATION, not on specific libraries or dependencies. Think about how to solve the task at a high level using LLMs. Your task: 1. Analyze the overall framework and architecture in the provided code: What is the high-level workflow and operation flow? How are different components organized and connected? What are the key operations/steps needed to solve the task? How can these operations be efficiently designed and orchestrated? 2. Design LLM-based solutions to replace or enhance the small model implementations: Operation Design: How to break down the task into well-defined operations that can be executed by LLMs? What operations are needed and how should they be structured? Prompt Engineering: For each operation that was previously done by small models, design detailed prompts for LLMs. What should be the input format, what instructions should be given, and what output format is expected? Model-level Mechanisms: How to implement global constraint checking, validation, error handling, and other model-level controls? What mechanisms are needed to ensure the LLM operations work correctly together? Data Flow: What is the input/output format for each LLM operation? How should data flow between different operations? What transformations are needed? 3. Migration Strategy: How can the existing small model code be adapted to use LLMs instead? What are the key differences in approach between small models and LLMs for this task? How to design the system to leverage LLM capabilities while maintaining the original workflow structure? 4. Framework Considerations: What is the overall system architecture needed to solve this task? How should operations be orchestrated and sequenced? What are the critical decision points and branching logic? How to handle state management and context passing between operations? Focus Areas (in order of importance): (1) Overall Framework & Architecture: How to structure the solution at a high level. (2) Operation Design: How to break down the task into LLM-executable operations. (3) Prompt Design: Detailed prompt templates for each LLM operation. (4) Data Processing & Flow: Input/output formats and data transformations between operations. (5) Model-level Mechanisms: Global constraints, validation, error handling. (6) Migration Strategy: How to adapt small model code to LLM-based approach. Do NOT focus on: Specific library dependencies or installation requirements. Environment setup details. Low-level implementation details of non-LLM components. Please provide a comprehensive and well-structured analysis in JSON format: "overall_framework": "architecture": "DETAILED description of the overall system architecture and framework design needed to solve this task. Explain the high-level structure, component organization, and how different parts work together.", "workflow": "DETAILED step-by-step workflow description. Explain the sequence of operations, decision points, and how the system processes the task from start to finish.", key_operations": ["operation1: detailed description of what it does and how it fits in the framework", "operation2: ...", ...] , "llm_migration": "operation_design": "DETAILED description of how to design operations for LLM execution. Explain how to break down the task into operations, how operations should be structured, and how they should interact.", "prompt_templates": [ "operation_name": "name of the operation", "purpose": "what this operation does in the overall framework", "input_format": "detailed description of input format and structure", "prompt_template": "detailed prompt template with placeholders and instructions", "output_format": "detailed description of expected output format", "constraints": "any constraints or validation rules for this operation" , ... ], "model_level_mechanisms": "DETAILED description of model-level mechanisms needed: global constraint checking, validation rules, error handling strategies, state management, context passing, etc. Be very specific about how these mechanisms work.", "migration_strategy": "DETAILED explanation of how to migrate from small model code to LLM-based approach. What changes are needed, what can be reused, and how to adapt the existing workflow." , "data_processing": "input_output_formats": "DETAILED description of input/output formats for LLM operations. What data structures are needed, what format should be used, and how data should be structured.", "data_flow": "DE- TAILED description of how data flows between operations. What transformations are needed, how to pass context between operations, and how to maintain data consistency.", "preprocessing": "DETAILED description of any preprocessing needed before sending data to LLMs (if any).", "postprocessing": "DETAILED description of any postprocessing needed after receiving LLM outputs (if any)." , "summary": "EXTREMELY DETAILED and comprehensive summary of the framework design, operation structure, LLM migration strategy, and how to solve this task using LLMs. Include specific examples of prompt designs, operation flows, and architectural decisions." Figure 13: Prompt for Strategy_C Analysis. 24 Prompt for Strategy_D Analysis System_prompt: You are an expert evaluator and metrics analyst. Your task is to analyze documents about evaluation metrics, benchmarks, and assessment methods related to a specific task, and provide insights on evaluation approaches and standards. User_prompt: Task Description (from task_keywords thinking): <task_thinking> Strategy: <strategy_name> Documents Retrieved: <files_text> IMPORTANT: Please provide EXTREMELY DETAILED and COMPREHENSIVE analysis. The more detailed, the better. Include specific metric definitions, detailed evaluation procedures, step-by-step assessment workflows, and thorough explanations. Your task: 1. Analyze all the documents above and identify the evaluation metrics, benchmarks, and assessment methods they discuss. Be very specific and detailed about each metric and method. 2. Summarize the key evaluation information relevant to solving this task. Provide EXTREMELY DETAILED descriptions, including but not limited to: Standard evaluation metrics and their definitions: Provide DETAILED definitions for each metric, including mathematical formulas, calculation methods, interpretation guidelines, and specific use cases. Include examples and detailed explanations. Benchmark datasets and evaluation protocols: Detail each dataset with size, format, structure, quality, and provide DETAILED evaluation protocols with step-by-step procedures, data splits, evaluation criteria, and complete assessment workflows. Be extremely thorough. Assessment methodologies and procedures: Provide DETAILED, step-by-step assessment workflows with all stages, evaluation criteria, scoring methods, and complete procedures. Include specific examples and detailed explanations. Performance standards and baselines: Detail performance benchmarks with specific numbers, comparison methods, baseline implementations, and detailed explanations of what constitutes good performance. Be comprehensive. Evaluation best practices and guidelines: Provide detailed best practices with specific guidelines, common mistakes to avoid, validation procedures, and detailed explanations. Include real-world examples. Metrics interpretation and analysis methods: Detail how to interpret each metric, what values indicate good/bad performance, statistical analysis methods, and detailed interpretation guidelines. Be thorough. 3. Provide a structured summary that clearly explains: What evaluation metrics and benchmarks are covered in these documents (with detailed descriptions) What specific evaluation information is needed to assess task performance (be very specific and detailed) How these evaluation approaches relate to the task requirements (provide detailed connections and relationships) Remember: The more detailed and comprehensive your analysis, the better. Include specific metric definitions, detailed evaluation procedures, step-by-step workflows, and thorough explanations throughout. Please provide a comprehensive and well-structured analysis in JSON format: "evaluation_metrics": ["detailed metric1 with definition and formula", "detailed metric2 with definition and formula", ...], "evaluation_information": "standard_metrics": ["detailed metric1 with calculation method", "detailed metric2 with calculation method", ...], "benchmark_datasets": ["detailed dataset1 with protocol", "detailed dataset2 with protocol", ...], "assessment_methodologies": "DETAILED step-by-step assessment workflow with all stages, criteria, scoring methods, and complete procedures. Be extremely thorough.", "performance_standards": "DETAILED performance benchmarks with specific numbers, comparison methods, baselines, and explanations. Be comprehensive.", "evaluation_best_practices": "DETAILED best practices with guidelines, common mistakes, validation proce- dures, and explanations. Include examples. Be comprehensive.", "metrics_interpretation": "DETAILED interpretation guidelines with analysis methods, value meanings, and statistical considerations. Be thorough." , "summary": "EXTREMELY DETAILED and comprehensive summary of the evaluation and metrics knowledge, including all metrics, detailed procedures, and thorough explanations..." Figure 14: Prompt for Strategy_D Analysis. 25 Prompt for Node Template def node_name(self, input_data): """ node_id: node_id node_type: node_type description: description dependencies: dependencies input: input output: output """ # —- Step 1: Process the input data # input_data is a dictionary with the keys as the input names and the values as the input values # First, extract the input values from the input_data dictionary # Fill your code here # —- Step 2: Implement the node logic for one of the node types (LLM_Generator, Retrieval_RAG) # Second, for LLM_Generator nodes, use LLMs to process the input data # For example, define the system prompt and user prompt: # node_messages = [ # "role": "system", "content": System Prompt from the prompt_template, # "role": "user", "content": User Prompt from the prompt_template (embed the input values) + Constraints from the constraints field, # ] # Then, call the LLM to get the output. If there are multiple LLM calls, you should call the LLMs according to the logic_description field. # For example, use self.llm_client.chat(node_messages, response_format=’json_object’) to get the json format output # Use self.llm_client.chat(node_messages, response_format=’normal’) to get the normal text output # Fill your code here # For Retrieval_RAG nodes, find the information that this node needs to retrieve from the logic_description # Use self.search_engine.multi_turn_search(information needed to retrieve) to get the retrieved context # Based on the retrieved context, use the summarization prompt template (marked as User Prompt: in the prompt_template) to summarize the retrieved context # Use self.llm_client.chat(node_messages, response_format=’json_object’) to get the json format output # Use self.llm_client.chat(node_messages, response_format=’normal’) to get the normal text output # Fill your code here # —- Step 3: Collect the output # Finally, collect the output into a dictionary with the keys as the output names and the values as the output values # Fill your code here return output_data Figure 15: Prompt for Node Template. 26 Prompt for Node Generation Part 1 System_prompt: You are an expert system architect and multi-agent system designer. Your task is to design a complete pipeline of nodes (operators) to solve a specific task based on the task description and strategy analysis. You must carefully identify every step the task requires and create a corresponding node for each, do not omit necessary steps. You may ONLY use two types of nodes: LLM_Generator (call LLM to do reasoning/generation) and Retrieval_RAG (use search engine for RAG). All verification, validation, parsing, and format-checking must be implemented via the LLM (by writing clear requirements and rules in the prompt_template so the LLM performs checks and outputs the correct format). Do NOT write code to verify, parse, or validate LLM outputs, use the LLM to do it. Each node must follow the provided node definition structure and work together to form a complete solution pipeline. User_prompt: Task Description: <task_thinking> <task_samples_section> Strategy Analysis: <strategy_analysis> The code template for all nodes is (Only use for the all_code field in the node definition): <code_template> IMPORTANT: Design a pipeline using ONLY two node types. Each node must follow the node definition structure above. STRICT RULES: Allowed node types: LLM_Generator and Retrieval_RAG. Verification and parsing via LLM, NOT code: Any need for verification (e.g. format check, validity check, number validation), parsing (e.g. extracting structured data from text), or fixing malformed output must be implemented by the LLM: put the rules and expected output format in the prompt_template (System Prompt / User Prompt) so that the LLM performs the checks and returns well-formed output. Do NOT write Python code to validate (e.g. json.loads, try/except, re.match) or parse LLM responses, if output might be messy, add instructions in the prompt or add another LLM_Generator node that asks the LLM to clean/validate and re-output. For calculations or deterministic steps, use an LLM_Generator node: ask the LLM to perform the reasoning and output the result in the required format; do not use code. Your task: 1. Analyze the task and strategy analysis to understand: What is the overall task that needs to be solved? What background knowledge, architectural patterns, and evaluation metrics are available? List exhaustively all operations and workflow steps the task requires (e.g. input parsing, fact extraction, knowledge retrieval, reasoning, validation, synthesis, final answer formatting). Do not skip or merge steps mentally, write them down. Each of these should eventually map to at least one node. 2. Design a complete pipeline of nodes using ONLY LLM_Generator and Retrieval_RAG: For each step you identified above, create a corresponding node. Do not generate too few nodes: the pipeline must have enough nodes to cover the entire task from input to final output. If the task typically needs e.g. extraction→retrieval →reasoning→synthesis→formatting, you must have nodes for each (or clearly combined in a justified way). Break down the task into logical steps; each step is either (a) call LLM to do something, or (b) use search engine to retrieve then LLM to summarize/use. Before finalizing the node list, double-check: Is there a node that handles retrieval if the task needs external knowledge? Is there a node that produces the final answer in the required format? Are there nodes for every distinct logical phase (e.g. understand input, gather context, reason, output)? Add nodes if any required step is missing. Nodes are connected through dependencies (dependencies field). Do NOT add any node that would require custom Python code (e.g. no "Calculator Tool", "Validator Tool", "Parser Tool" as Python code). Use LLM_Generator for such roles if needed. 3. For each node, provide complete information following the node definition: node_name: A descriptive name (e.g., "x_Agent"). node_type: One of [LLM_Generator, Retrieval_RAG]. description: Summary of the node’s role in the pipeline. dependencies: List of upstream node names that this node depends on. input: What information this node reads from inputs (be specific based on task samples). output: What this node produces (be specific about output format). constraints: Global constraints this node must comply with (from task requirements). implementation: logic_description: Detailed description of the implementation logic (no code; describe what the node does in terms of LLM calls and/or search + LLM). prompt_template: (For both node types) MUST provide complete, detailed prompt content: System Prompt (marked as "System Prompt:") and User Prompt (marked as "User Prompt:") with placeholders. Be specific and include examples. tools_needed: For Retrieval_RAG nodes use ["Search"]; for LLM_Generator use []. Do NOT include "code_snippet". Omit it or set to null. all_code: Minimal runnable code only: (1) Read inputs from input_data. (2) For LLM_Generator: fill the prompt_template with input values and call self.llm_client.chat(node_messages, response_format=...). (3) For Re- trieval_RAG: build search query from inputs, call self.search_engine.multi_turn_search(query), then fill prompt_template with retrieved context and call self.llm_client.chat. (4) Return output_data dict. Do NOT add code that verifies, parses, or validates the LLM response (no json.loads, re, try/except for parsing, no format checks)—all verification/parsing is done by the LLM via the prompt. Figure 16: Prompt for Node Generation Part 1. 27 Prompt for Node Generation Part 2 CRITICAL: Verification and parsing LLM’s job: If a node needs to ensure valid JSON, correct format, or validated numbers, write these requirements in the prompt_template (e.g. "Output only valid JSON.", "Validate each amount and output the approved breakdown."). Do NOT implement verification or parsing in all_code (no json.loads, re, or try/except to fix LLM output). Use the LLM to do verification and output clean results. LLM_Generator nodes: Provide full System Prompt and User Prompt in prompt_template; put any validation/format rules there. all_code must only: extract inputs, build node_messages from prompt_template, call self.llm_client.chat, return output_key: response. No code that parses or validates the response. Retrieval_RAG nodes: logic_description must state what to retrieve and how to summarize. prompt_template must include System Prompt and User Prompt; use a placeholder like retrieved_context or retrieved_chunks for the search result. all_code must only: build query from inputs, call self.search_engine.multi_turn_search(query), build node_messages from prompt_template with retrieved content, call self.llm_client.chat, return output. No code that parses or validates the response. Retrieval_RAG: Design so you do NOT retrieve the question itself; retrieve only related knowledge (e.g. laws, case law, background) needed to answer. State this in logic_description and prompt_template. 4. Design principles: Use only LLM_Generator and Retrieval_RAG. Completeness over brevity: Ensure the pipeline has enough nodes for the task. List all logical steps the task requires (from task description and strategy analysis), then create one node (or more) for each step. When in doubt, add a dedicated node rather than overloading one node with multiple responsibilities. Too few nodes often lead to incomplete or poor results. Each node has a single responsibility. Dependencies form a DAG. Use LLM_Generator for reasoning, generation, extraction, validation, and any step that would otherwise need "code" (e.g. ask LLM to output structured JSON or numbers). Use Retrieval_RAG when external knowledge retrieval (search) is needed, then LLM to summarize or use the retrieved context. 5. Output format: Provide a JSON object with this structure: "pipeline_description": "Overall description of the pipeline and how nodes work together", "nodes": [ "node_name": "...", "node_type": "LLM_Generator or Retrieval_RAG only", "description": "...", "dependencies": ["..."], "input": ["..."], "output": ["..."], "constraints": "...", "implementation": "logic_description": "...", "prompt_template": "...", "tools_needed": ["Search"] for Retrieval_RAG, [] for LLM_Generator , "all_code": "Minimal code only: input extraction, then LLM call(s) or search+LLM, then return output_data. No verification/parsing blocks." , ...], "Connections": "Complete Python code for def execute_pipeline(self, initial_input_data): ... Execute nodes in dependency order; collect inputs from initial_input_data or results; call self.NodeName(input_data); store outputs; return final output. Import json if needed." Remember: Carefully check that the task needs are fully covered by nodes: Before outputting, verify you have a node for every required step (e.g. input understanding, retrieval if needed, reasoning, synthesis, final answer). The number of nodes should be sufficient to solve the task completely—do not output a pipeline with too few nodes. Use only LLM_Generator and Retrieval_RAG. All verification and parsing must be done by the LLM: write rules and output-format requirements in the prompt_template; do not write code to verify or parse LLM output (no json.loads, re, try/except for validation/pars- ing in all_code). all_code must be minimal: read input -> (LLM call or search+LLM) -> return output. No code that checks or parses the LLM response. Dependencies must form a valid DAG. Use task samples to align input/output formats. For "Connections": generate the pipeline execution function that runs nodes in dependency order and passes data correctly. Figure 17: Prompt for Node Generation Part 2. 28 Prompt for Node Optimization System_prompt: You are an expert system optimizer and code reviewer. Your task is to analyze a node in a multi-agent pipeline that has the lowest reward and optimize its internal structure to improve performance. All optimizations must be achieved via the LLM. You may: (1) improve existing LLM prompts, (2) introduce new LLM calls where needed, (3) optimize how multiple LLM calls within the same node communicate and interact—e.g. what is passed between calls, in what format, in what order, and how results are aggregated. Do NOT add Python code for rules, regex, normalization, or filtering—fix shortcomings by prompt engineering or by adding/adjusting LLM calls and their communication, not by code. User_prompt: Question: <question> Expected Answer: <answer> # Node to Optimize Node Name: node_name Node Type: node_type Node Description: node_description Node Reward: node_reward (This is the lowest reward, indicating poor performance) Node Position: Node node_index + 1 in the pipeline # Current Node Implementation Implementation Details: json.dumps(node_implementation, ensure_ascii=False, indent=2) Current Code: node_all_code # Pipeline Context All Intermediate Outputs (to understand the data flow; when multiple samples exist, each [Sample N] block’s node outputs correspond to the [Sample N] Question/Answer in Task Context above): intermediate_context # Analysis Task Based on the question, expected answer, and the intermediate outputs from all nodes, analyze why this node has the lowest reward and provide optimization suggestions. Analysis Steps: 1. Identify the Problem: What is the node’s current output? (from intermediate_outputs) How does it differ from what’s expected? What specific issues are causing the low reward? 2. Root Cause Analysis: Is the prompt (for LLM_Generator/Retrieval_RAG) clear and specific enough? Are the LLM calls structured optimally? If the node has multiple LLM calls, is the communication between them effective—e.g. is the handoff from one call to the next clear, in a good format, and in the right order? Are there missing or redundant steps? Is the implementation handling all cases correctly? Is the retrieval (for Retrieval_RAG) getting relevant information? Are there any logical errors or missing validations? 3. Optimization Strategy: optimization_focus CRITICAL: Fix shortcomings via LLM, not code: You may (1) improve existing prompts, (2) introduce new LLM calls (e.g. a refinement or validation step), (3) optimize inter-LLM communication when a node has multiple LLM calls—e.g. clarify what each call receives from the previous one, improve the handoff format in the prompt, reorder or add calls so the flow is clearer. Do NOT add Python code for rule-based checks, regex, normalization, or filtering. The code should remain minimal: prepare inputs→ call LLM(s), passing outputs between calls as needed→ return output. # Output Format Provide a JSON object with the following structure: "analysis": "problem_identification": "Detailed description of what’s wrong with the current node", "root_cause": "Analysis of why the node is performing poorly", "optimization_strategy": "Specific strategy to improve the node" , "optimized_implementation": "prompt_template": "Updated prompt template (marked as System Prompt: and User Prompt:). Keep original if no changes needed.", "tools_needed": "Updated tools_needed (for Retrieval_RAG nodes). Keep original if no changes needed.", "logic_description": "Updated logic description explaining the optimization" , "optimized_all_code": "Complete updated code for the node following the code_template structure. MUST be complete and runnable. Output the code in the same format as the original code. "optimization_explanation": "Detailed explanation of what was optimized and why". IMPORTANT: Fix any identified problems by improving the prompt, adding or reordering LLM calls, or improving how multiple LLM calls in the same node communicate (what is passed, in what format). Do NOT add Python code for validation, regex, normalization, rule-based filtering, or parsing of LLM output. optimized_all_code must stay minimal: get inputs→call LLM(s), passing outputs between calls as needed→ return output. For LLM_Generator: Improve prompt_template and/or the number and sequence of LLM calls; if there are multiple calls, ensure each call’s prompt clearly receives and uses the outputs of previous calls. Do not add code to parse or validate responses. For Retrieval_RAG: Improve the summarization prompt and query construction; do not add code to filter or normalize retrieved content—instruct the LLM to do it in the prompt. The optimized_all_code MUST be complete and runnable but MUST NOT contain extra validation/parsing/regex/filtering code. If you add, remove, or reorder LLM calls, or change how they communicate (handoff format/order), explain the reasoning in logic_description. Figure 18: Prompt for Node Optimization. 29