Paper deep dive
CABTO: Context-Aware Behavior Tree Grounding for Robot Manipulation
Yishuai Cai, Xinglin Chen, Yunxin Mao, Kun Hu, Minglong Li, Yaodong Yang, Yuanpei Chen
Abstract
Abstract:Behavior Trees (BTs) offer a powerful paradigm for designing modular and reactive robot controllers. BT planning, an emerging field, provides theoretical guarantees for the automated generation of reliable BTs. However, BT planning typically assumes that a well-designed BT system is already grounded -- comprising high-level action models and low-level control policies -- which often requires extensive expert knowledge and manual effort. In this paper, we formalize the BT Grounding problem: the automated construction of a complete and consistent BT system. We analyze its complexity and introduce CABTO (Context-Aware Behavior Tree grOunding), the first framework to efficiently solve this challenge. CABTO leverages pre-trained Large Models (LMs) to heuristically search the space of action models and control policies, guided by contextual feedback from BT planners and environmental observations. Experiments spanning seven task sets across three distinct robotic manipulation scenarios demonstrate CABTO's effectiveness and efficiency in generating complete and consistent behavior tree systems.
Tags
Links
- Source: https://arxiv.org/abs/2603.16809v1
- Canonical: https://arxiv.org/abs/2603.16809v1
Intelligence
Status: not_run | Model: - | Prompt: - | Confidence: 0%
Entities (0)
Relation Signals (0)
No relation signals yet.
Cypher Suggestions (0)
No Cypher suggestions yet.
Full Text
43,108 characters extracted from source content.
Expand or collapse full text
CABTO: Context-Aware Behavior Tree Grounding for Robot Manipulation Yishuai Cai 1,2, 4 * † , Xinglin Chen 1,2, 4 * , Yunxin Mao 1 , Kun Hu 1 , Minglong Li 1‡ , Yaodong Yang 3, 4 , Yuanpei Chen 2, 3, 4 1 National University of Defense Technology 2 PsiBot 3 Peking University 4 PKU-Psibot Lab caiyishuai, chenxinglin, liminglong10@nudt.edu.cn Abstract Behavior Trees (BTs) offer a powerful paradigm for design- ing modular and reactive robot controllers. BT planning, an emerging field, provides theoretical guarantees for the au- tomated generation of reliable BTs. However, BT planning typically assumes that a well-designed BT system is already grounded—comprising high-level action models and low- level control policies—which often requires extensive expert knowledge and manual effort. In this paper, we formalize the BT Grounding problem: the automated construction of a complete and consistent BT system. We analyze its com- plexity and introduce CABTO (Context-Aware Behavior Tree grOunding), the first framework to efficiently solve this chal- lenge. CABTO leverages pre-trained Large Models (LMs) to heuristically search the space of action models and control policies, guided by contextual feedback from BT planners and environmental observations. Experiments spanning seven task sets across three distinct robotic manipulation scenarios demonstrate CABTO’s effectiveness and efficiency in gener- ating complete and consistent behavior tree systems. Introduction Robot manipulation necessitates both reliable high-level planning and robust low-level control policies. Recently, Be- havior Trees (BTs) ( ̈ Ogren and Sprague 2022; Colledanchise and ̈ Ogren 2018) have emerged as a highly reliable and ro- bust control architecture for intelligent robots, recognized for their modularity, interpretability, reactivity, and safety. Many methods have been proposed to automatically gener- ate BTs for task execution, including evolutionary comput- ing (Neupane and Goodrich 2019; Colledanchise, Parasur- aman, and ̈ Ogren 2019) and machine learning approaches (Banerjee 2018; French et al. 2019). In particular, BT plan- ning (Cai et al. 2021; Chen et al. 2024; Cai et al. 2025a) has shown significant promise, primarily due to its strong theo- retical guarantees: the BTs generated by such methods are provably successful in achieving goals within a finite time horizon. Despite these advancements, BT planning critically as- sumes the prior existence of a well-grounded BT system. * These authors contributed equally. † This work was completed during an internship at PsiBot. ‡ Corresponding Author. Copyright © 2026, Association for the Advancement of Artificial Intelligence (w.aaai.org). All rights reserved. Constructing such a system, encompassing both high-level action models and their corresponding low-level control policies, typically requires substantial human expertise and effort. Specifically, for high-level planning, the BT sys- tem must contain a sufficient and appropriately modeled set of condition and action nodes, enabling their assembly into BTs capable of accomplishing diverse tasks. Concur- rently, for low-level execution, these nodes must be reliably linked to executable control policies that ensure environ- mental transitions occur precisely as specified by the action models, ideally with high success rates. In this paper, we formally define the BT grounding prob- lem: the automated construction of a complete and consis- tent BT system for a given task set. We characterize a well- designed BT system by two critical properties: (1) Com- pleteness: a complete BT system can generate solution BTs for all tasks within the specified task set through high-level BT planning, based on its action models. (2) Consistency: a consistent BT system ensures that its control policies lead to state transitions that precisely match their corresponding action models during low-level BT execution. Figure 1 il- lustrates these concepts. For instance, a BT system with ac- tion seta 2 ,a 3 is incomplete if it can only produce a solu- tion BT for a subset of tasks, such asp 2 ,p 3 . Conversely, an action set likea 1 ,a 2 that successfully generates solu- tion BTs for all three tasks exemplifies completeness. How- ever, a 1 would be inconsistent if its control policy fails to achieve Holding(apple) as declared by its action model. Furthermore, a 2 is also inconsistent because its policy can- not put the apple in the drawer without the precondition IsOpen(drawer). In contrast, an action like a 3 , whose pol- icy perfectly aligns with its action model’s state transitions, is considered consistent. We demonstrate a naive algorithm that solves the BT grounding problem via exhaustive search. While this ap- proach effectively illustrates the core concepts, its exponen- tial time complexity renders it impractical for deployment. Large Models (LMs), pre-trained on extensive corpora, images or datasets in other modalities, have shown signif- icant abilities in searching, reasoning and planning (Zhou et al. 2024; Valmeekam et al. 2023; Cai et al. 2025b). Lever- aging the advantages of LMs, we propose the first frame- work for efficiently solving the BT grounding problem, named Context-Aware Behavior Tree grOunding (CABTO). arXiv:2603.16809v1 [cs.RO] 17 Mar 2026 Figure 1: Concepts involved in the BT grounding problem. (a) A BT is a directed rooted tree with behavior nodes and control nodes. (b) The solution is a complete and consistent BT system for the given task set. (c) A complete BT system can generate solution BTs for all tasks during the high-level BT planning based on action models. (d) A consistent BT system ensures its control policies result in state transitions consistent with their action models during low-level BT execution. CABTO mainly utilizes pre-trained LMs to heuristically search the space of action models and control policies based on the contexts of BT planning and environmental feedback. CABTO includes three phases: (1) High-level model pro- posal. Given a task set, we first use Large Language Mod- els (LLMs) to generate promising action models and use a sound and complete BT planning algorithm to evaluate their completeness. The contexts here in this phase are plan- ning details. (2) Low-level policy sampling. We then em- ploy Vision-Language Models (VLMs) to sample promising policy types as well as their hyperparameters for explored action models. The matching policy and its corresponding action model together form a consistent action. The contexts in this phase are environment feedbacks. (3) Cross-level re- finement. If the algorithm fails to find any policy for a given action model, then it is determined to be inconsistent. In this case, the contexts of both high-level planning information and low-level environment feedbacks can be combined to help VLMs to refine the action model and generate more promising action models. The key contributions of this work are as follows: • We formally define the BT grounding problem as the construction of a complete and consistent BT system for a given task set. We provide a formal analysis and present a naive algorithm that elucidates the foundational con- cepts for solving this problem. • We propose CABTO, the first framework for efficiently solving the BT grounding problem. CABTO strategi- cally utilizes pre-trained LMs to heuristically explore the space of action models and control policies, informed by both BT planning contexts and environmental feedback. • We empirically validate CABTO’s superior effectiveness and efficiency in automatically generating complete and consistent BT systems across 7 diverse task sets in 3 dis- tinct robotic manipulation scenarios. Comprehensive ab- lation studies further investigate the impact of LMs, con- trol policy types, and cross-level refinement. Related Work Behavior Tree Generation Most existing BT genera- tion methods focus on constructing the BT structure while assuming predefined execution policies. Heuristic search approaches, including grammatical evolution (Neupane, Goodrich, and Mercer 2018), genetic programming (Lim, Baumgarten, and Colton 2010) , and Monte Carlo DAG Search (Scheide, Best, and Hollinger 2021), have been widely studied. Machine learning methods, such as re- inforcement learning (Banerjee 2018; Pereira and Engel 2015)and imitation learning (French et al. 2019), as well as formal synthesis approaches like LTL (Neupane and Goodrich 2023) and its variants(Tadewos, Newaz, and Ka- rimoddini 2022), have also been explored. However, these methods often require complex environment modeling or cannot guarantee BT reliability. In contrast, BT planning (Cai et al. 2025b; Chen et al. 2024; Cai et al. 2025a) based on STRIPS-style action models (Fikes and Nilsson 1971) provides interpretable environment modeling while ensuring both reliability and robustness. High-Level Action Models Action models define the blueprints of actions that drive state transitions in a sys- tem (Arora et al. 2018). They are widely used in classi- cal planning (Hoffmann and Nebel 2001; Bonet, Loerincs, and Geffner 1997), task and motion planning (TAMP) (Yang et al. 2024; Kumar et al. 2024), and symbolic problem solv- ing (Pan et al. 2023; Fikes and Nilsson 1971). To reduce ex- pert design effort, many methods learn action models from plan execution traces (Mahdavi et al. 2024; Bachor and Behnke 2024; Mordoch et al. 2024; Liu et al. 2023), em- ploying inductive learning (Liang et al. 2025), evolutionary algorithms (Newton and Levine 2010), reinforcement learn- ing (Rodrigues et al. 2012), and transfer learning (Zhuo and Yang 2014). However, these approaches typically assume the traces are already available, overlooking how to obtain them through low-level execution—an obstacle to practical deployment. Low-Level Control Policies Modern low-level robot ma- nipulation policies can be broadly categorized into three types: (1) End-to-end policies, which directly map propri- oceptive inputs to joint controls via reinforcement learning (Bai et al. 2025; Chen et al. 2023), imitation learning (Zare et al. 2024), and, more recently, Vision-Language-Action Models (VLAs) fine-tuned from large vision-language mod- els (Zhong et al. 2025; Kim et al. 2024; Zhen et al. 2024). (2) Hierarchical policies, which decompose control into structured modules leveraging representations such as rigid- body poses (Kaelbling and Lozano-P ́ erez 2011), constraints (Huang et al. 2025), affordances (Huang et al. 2023), way- points (Zhang et al. 2024), or skills and symbolic codes (Haresh et al. 2024; Mu et al. 2024). These approaches ex- ploit expert knowledge to improve interpretability and ex- tend long-horizon capabilities. (3) Rule-based policies, built solely on expert-designed control algorithms (Thomason, Kingston, and Kavraki 2024; Sundaralingam et al. 2023), offer strong robustness for specific tasks but struggle to gen- eralize to unseen scenarios. Preliminaries Behavior Tree A BT T is a rooted directed tree where internal nodes are control flow nodes and leaf nodes are ex- ecution nodes (Colledanchise and ̈ Ogren 2018). The tree is executed via periodic ”ticks” from the root. The core nodes include: (1) Condition: returns success if a state propo- sition holds, else failure. (2) Action: performs tasks and returns success, failure, or running. (3) Sequence (→): succeeds only if all children succeed (AND logic). (4) Fallback (?): fails only if all children fail (OR logic). BT System Following (Cai et al. 2021), a BT can be rep- resented as a four-tuple T = ⟨n,h,π,r⟩, where n is the number of binary propositions describing the world state. Here, h : 2 n → 2 n denotes the action model represent- ing the intended state transition; π : 2 n → 2 n denotes the control policy representing the actual execution effect; and r : 2 n 7→success, running, failure partitions the state space according to the BT’s return status. A BT system is defined as Φ= ⟨C,A⟩. Each action a ∈ A is a tuple ⟨h a ,π a ⟩, where h a = ⟨pre h (a),add h (a),del h (a)⟩ is its action model (in- tended effect) and π a = ⟨pre π (a),add π (a),del π (a)⟩ is its control policy (actual effect). The precondition pre h (a),pre π (a), add effects add h (a),add π (a), and delete effects del h (a),del π (a) are all the subset of the condition node set C. In a well-designed BT system, provided that the current state s t satisfies the precondition (i.e., s t ⊇ pre h (a)), the state transition upon completion of action a after k time steps satisfies: s t+k = h a (s t ) = π a (s t ) = s t ∪ add(a)\ del(a)(1) where h a (s t ) and π a (s t ) denote the states resulting from the action model and the control policy execution, respectively. BT Planning Given a BT system Φ, a BT planning prob- lem is defined as: p =⟨S,s 0 ,g⟩, whereS is the finite set of environment states, s 0 is the initial state, g is the goal con- dition. A condition c ⊆ C is a subset of a state s, and can Algorithm 1: Naive Algorithm for BT Grounding Input: Problem⟨P,C P ,H P , Π P ⟩ Output: Solution Φ =⟨C,A⟩ 1: A←∅▷ initialize grounded actions 2: for pre∈ 2 C P ,add∈ 2 C P ,del∈ 2 C P do 3: h←⟨pre,add,del⟩▷ create an action model 4:if h∈H P then 5:for each policy π ∈ Π P do 6:if Consistent(h,π) then 7:a←⟨h,π⟩▷ create a consistent action 8:A←A∪a▷ add the consistent action 9:break 10:end if 11:end for 12:end if 13: end for 14: C ← S a∈A pre h (a)∪ add h (a)∪ del h (a) 15: return Φ =⟨C,A⟩ be an atom condition node or a sequence node with atom condition nodes as children. If c ⊆ s, then c holds in s. A sound and complete BT planning algorithm, like BT Expan- sion (Cai et al. 2021), ensures a solution BTT in finite time if p is solvable. Such a BTT can transition the state from s 0 to s n = π T (s 0 )⊇ g in a finite number of steps n. Problem Formulation In this paper, we focus on the automatic construction of the BT system, and therefore need to formally define the prop- erties that describe a well-designed BT system. Definition 0.1 (Completeness). A BT system Φ is complete in the task setP if,∀p ∈ P , any complete BT planning al- gorithm can produce a BTT that solves the task p according to its action models. The completeness of the BT system Φ describes whether its condition nodes C and action nodes A are sufficient to solve all of the tasks in the given task set at the planning level. Definition 0.2 (Consistency). An action a is consistent if pre π (a) ⊆ pre h (a),add π (a) = add h (a),del π (a) = del h (a). That is, the control policy π a is capable of inducing state transitions that match its action model. A BT system Φ is consistent if∀a∈A,a is consistent. The consistency of the BT system Φ describes whether all action nodes can be successfully executed and cause the state to transition as desired, just as specified by their action models. Both completeness and consistency are essential for constructing a BT system for embodied robots to complete tasks. We then define the BT grounding problem as follows: Problem 1 (BT Grounding). A BT grounding problem is a tuple⟨P,C P ,H P , Π P ⟩, whereP is the finite task set,C P is the finite set of valid condition nodes,H P is the finite set of valid action models, Π P is the set of valid control policies. A solution to this problem is a BT system Φ = ⟨C,A⟩ that is complete and consistent in the task setP , whereC ⊆C P , ∀a∈A,a =⟨h a ,π a ⟩,h a ∈H⊆H P ,π a ∈ Π⊆ Π P . Figure 2: The framework of CABTO includes three phases: (1) High-level model proposal leverages the planning contexts for the LLMs to heuristically explore the space of action models; (2) Low-level policy sampling leverages the execution contexts for the VLMs to heuristically explore the space of control policies; (3) Cross-level refinement leverages both planning and execution contexts for refining inconsistent action models. Methodology This section first presents a naive algorithm and a formal analysis to establish the foundational principles of the BT grounding problem. We then detail the CABTO framework, encompassing high-level model proposal, low-level policy sampling, and cross-level refinement. Finally, we provide the implementation details of the CABTO system. Naive Algorithm for BT Grounding Algorithm 1 out- lines a naive approach to BT grounding. Given the problem tuple ⟨P,C P ,H P , Π P ⟩, the algorithm initializes an empty action set A (line 1) and exhaustively traverses the power set of action components (line 2). For each candidate action model h, the algorithm first verifies its validity (lines 3–4). It excludes models based on domain-independent constraints (e.g., add ∩ del ̸= ∅) or domain-dependent constraints (e.g., mutually exclusive preconditions), though the latter typically require extensive expert knowledge. Even when restricted to domain-independent constraints, exploring the model space entails an exponential complexity of O(2 3n ). The algorithm then retrieves a control policy π ∈ Π P (line 5) and verifies its consistency with h (line 6). Upon a suc- cessful match, it instantiates a consistent action a = ⟨h,π⟩ (line 7) and appends it toA. Finally, the algorithm induces the condition set C from the union of all atomic conditions inA (line 14) and returns the resulting BT system Φ. While exhaustive and correct, this algorithm faces significant lim- itations: (1) the exponential complexity of exploring H P , and (2) the practical difficulty of designing Π P and verify- ing policy consistency. Notably, automatically synthesizing low-level control policies to achieve specific effects remains a fundamental challenge in robotics (Kumar et al. 2023). To overcome these limitations, we propose CABTO, a principled framework designed for efficient BT ground- ing. As illustrated in Algorithm 2, CABTO decomposes the grounding process into three phases, leveraging multi- modal contexts to circumvent exhaustive search. The follow- ing sections detail the implementation and context acquisi- tion strategies employed in each phase. High-level Model Proposal CABTO initially initializes the grounded action setA as empty (Line 1) and defines the unexplored model spaceH U as the complete set of potential action modelsH P (Line 2). The process commences with an initial proposal phase, where the LLM receives a structured textual prompt defining the task set P . This context encap- sulates goal states and initial conditions formalized as first- order logic propositions, alongside the semantics descrip- tions of scene objects. Leveraging this task-specific context, the LLM identifies a subset of promising models H E from H P by specifying their symbolic preconditions and effects in a programmatic format (Line 3): H E = LLM(P,H P )(2) Empirical results demonstrate that for simple task sets, this initial proposal phase often yields sufficient action models to satisfy the majority of requirements inP . To accommodate complex scenarios where initial propos- als may be incomplete, CABTO employs a refinement loop that iterates until the task setP is verified as fully solvable using the validated grounded actions in A (Line 4). Within Algorithm 2: CABTO Input: Problem⟨P,C P ,H P , Π P ⟩ Output: Solution Φ =⟨C,A⟩ 1: A←∅▷ initialize grounded actions 2: H U ←H P ▷ initialize model search spaces 3: H E ←LLM(P,H P )▷ Equation 2 4: whileH U ̸=∅ and not AllSolvable(P ,A) do 5: // high-level model proposal 6:repeat 7: I fail ←I p | p∈P,BTPlanning(p,H E ) fails 8:ifI fail ̸=∅ then 9: H ′ ← LLM(P,H U ,I fail )▷ Equation 3 10: H U ←H U ′ ,H E ←H E ∪H ′ 11:end if 12:untilI fail =∅ orH U =∅ 13: // low-level policy sampling 14:for each h∈H E do 15:n← 0, π ← null, Consistent(h,π)← False 16:while n < N max and not Consistent(h,π) do 17:π ←VLM(h, Π P ,I e )▷ Equation 4 18:Sample a scenario s 0 where pre(h)⊆ s 0 19:s t ,I e ← Execute(π,s 0 ) 20:if s t ⊇ (pre(h)∪ add(h)\ del(h)) then 21: Consistent(h,π)← True 22:A←A∪⟨h,π⟩,H←H∪h 23:end if 24:n← n + 1 25:end while 26: // cross-level refinement 27:if not Consistent(h,π) then 28:h ′ ← VLM(h,H U ,I p , Π P ,I e ) ▷ Equation 5 29: H U ←H U \h ′ ,H E ←H E ∪h ′ 30:end if 31:end for 32: H E ←H▷ PruneH E to validated set 33: end while 34: C ← S ⟨h,π⟩∈A (pre(h)∪ add(h)∪ del(h)) 35: return Φ =⟨C,A⟩ this loop, the algorithm assesses the completeness of the cur- rent candidate set H E by attempting to synthesize BTs for all tasks in P through BT Planning. Any planning failure triggers the aggregation of diagnostic data into a failure set I fail (Line 7). Each entry I p ∈ I fail encapsulates critical diagnostics, such as the topological sketches of incomplete BTs and the count of expanded conditions. These metrics provide essential semantic cues, aiding the LLM in identify- ing symbolic gaps to propose more promising action models. H ′ ← LLM(P,H U ,I fail )(3) Subsequently, proposed models are transferred from H U to the candidate setH E (Line 10). This heuristic search iter- ates until the task setP is logically spanned by a complete BT system. Low-Level Policy Sampling This phase verifies the phys- ical consistency of candidate action models h ∈ H E (Line 14). For each model h, we initialize the trial counter n = 0 and the policy π as null. To bridge the gap between abstract symbolic reasoning and precise physical execution, we pro- pose a hierarchical framework that integrates Molmo(Deitke et al. 2025) with programmatic code generation. Specifi- cally, within a budget of N max attempts (Line 16), the VLM is a programmatic sampler (Line 17) that translates high- level semantic intentions into grounded control policies: π ← VLM(h, Π P ,I e )(4) where Π P represents the set of available control inter- faces. These interfaces comprise Molmo-based (Deitke et al. 2024) perception APIs for extracting environmental key- points, cuRobo-based (Sundaralingam et al. 2023) (a 7-DoF IK solver) motion control APIs for the robotic arm, and grip- per actuation commands. The execution contextI e serves as a critical nexus for closed-loop iterative refinement. It encap- sulates multi-modal diagnostic data, including egocentric vi- sual observations, previously synthesized control code, post- hoc visual feedback, and categorical success/failure signals. By maintaining this high-fidelity temporal trace, the VLM can effectively anchor its subsequent sampling within the physical constraints evidenced by prior execution attempts. Specifically, the VLM selectively invokes Molmo-based perception tools conditioned on the logical semantics of h. When precise spatial grounding is necessitated, the VLM leverages Molmo to extract functional affordances and task- relevant keypoints—such as optimal grasp points or tar- get placement coordinates—directly from visual observa- tionV . Subsequently, the VLM synthesizes these grounded keypoints and parameterized APIs into executable Pythonic code that instantiates the specific control policy π. To validate the policy π, we initialize a simulation s 0 such that the initial state satisfies the precondition pre(h) (Line 18). After execution (Line 19), we check if the terminal state s t achieves the expected symbolic effects: s t ⊇ (pre(h)∪ add(h) (h)) (Line 20). Upon verification, the grounded action⟨h,π⟩ is appended to the action setA. Cross-Level Refinement If a sufficient number of policies fails to yield a valid policy, the action model h is deemed physically inconsistent. While a naive approach would be to discard the model and restart the high-level proposal in the next iteration, we instead leverage both planning and execu- tion contexts to refine h (Line 28): h ′ ← VLM(h,H U ,I p , Π P ,I e )(5) Here, the VLM synthesizes the planning context I p , which defines the functional necessity of h within successful symbolic sequences (∀I p ∈ I p ,h ∈ T p ), and the execu- tion contextI e , which comprises multi-modal diagnostic data such as egocentric pre/post-action imagery and binary feedback. By integrating these cross-level insights, the VLM identifies underlying failures—such as omitted spatial pre- conditions or inaccurate symbolic effects—to synthesize a rectified action model h ′ . Upon completing the refinement loop, a knowledge syn- chronization step (Line 32) updates H E ← H, ensuring the explored pool consists exclusively of models verified by physical policies. This update provides a grounded, reliable action library for subsequent planning iterations (Line 7). The cycle repeats until a set of grounded actionsA renders all tasks P solvable, after which the condition set C is ex- tracted to define the final grounded state space (Line 34). RobotTask Set Task AttributesGPT-3.5-TurboGPT-4o Acts Conds StepsASR(w/o→ w)CSR(w/o→ w)FCASR(w/o→ w)CSR(w/o→ w) FC Franka Cover2.04.44.060.0%→ 66.7%40%→ 50%1.6 100.0%→ 100.0% 100%→ 100% 0.0 Blocks2.33.14.170.0%→ 70.0%30%→ 50%2.060.0%→ 80.0%50%→ 80%1.1 Dual- Franka Pour5.58.13.680.0%→ 96.7%70%→ 90%0.566.7%→ 100.0%60%→ 100%0.6 Handover5.03.32.780.0%→ 90.0%70%→ 90%0.556.7%→ 90.0%30%→ 90%1.3 Storage6.05.42.256.7%→ 73.3%0%→ 60%2.053.3%→ 76.7%20%→ 70%1.7 Fetch Tidy Home5.66.02.953.3%→ 56.7%40%→ 50%0.753.3%→ 90.0%30%→ 90%1.3 Cook Meal6.87.95.170.0%→ 70.0%50%→ 60%0.773.3%→ 100.0%60%→ 100%0.4 Total4.75.53.567.1%→ 74.8% 42.9%→ 64.3% 1.166.2%→ 91.0%50%→ 90.0%0.9 Table 1: High-level model proposal results (averaged over 10 trials, max FC=3) for GPT-3.5-turbo vs. GPT-4o. Note: “w/o” denotes without planning contexts, and “w” denotes with planning contexts. Figure 3: Configurations of the single-arm and dual-arm Franka manipulation tasks in Isaac Sim. Figure 4: The deployment of CABTO in OmniGibson: Given a task set, CABTO generates a complete and consistent BT system. For a specific task, BT planning is used to generate the solution BT. Then the BT is executed, enabling the robot to successfully achieve the goal. Experimental Setup Task Sets We evaluate the robustness and adaptability of CABTO on a comprehensive suite of seven robotic manipu- lation task sets, encompassing 21 unique goals (three goals per task) across three distinct robotic platforms. These sce- narios are strategically designed to cover a spectrum of phys- ical and logical challenges: Single-Arm Franka (T1: Cover, T2: Blocks), Dual-Arm Franka (T3: Pour, T4: Handover, T5: Storage), and Mobile Fetch (T6: Tidy Home, T7: Cook Meal). As summarized in Table 1, these tasks range from fundamental pick-and-place and stacking (T1–T2) to com- plex bimanual coordination for cooperative transport and ex- change (T3–T5), and long-horizon mobile manipulation in- volving articulated objects and semantic state changes (T6– T7). To quantify solution complexity, Table 1 reports the re- sulting BT attributes for each task set, including the num- ber of unique action predicates (Acts), condition predicates (Conds), and the total execution steps (Steps). Environment Fetch robot experiments were conducted in OmniGibson (Li et al. 2023) for its realistic physics, while Franka tasks were designed in Isaac Sim to enable flexible object configuration (Figures 3 and Figures 4). All experi- ments are conducted on a single NVIDIA RTX 4090 GPU. Metrics We evaluate the completeness of the high-level model using two primary metrics: (1) Average Planning Suc- cess Rate (ASR): The mean planning success rate across all individual tasks within a given task set. (2) Complete Plan- ning Success Rate (CSR): The success rate where all tasks within the set are successfully planned simultaneously. We also report the average number of Feedback Cycles (FC). Action End-to-end Hierarchical Rule-based Molmo+cuRobo+APIs OpenVLAVoxPoserReKepMolmo+cuRoboAPIsw/o Contextswith Contexts Pick(obj)4/104/106/105/106/106/107/10 Place(obj,loc)5/103/107/105/106/106/108/10 Open(container)1/101/101/103/101/102/104/10 Close(container2/102/103/104/102/103/105/10 Toggle(switch)2/101/104/106/105/10 5/107/10 Total28%22%42%46%40%44%62% Table 2: Evaluation results of low-level policy sampling using VLM for 5 typical action models. ActionDefect Type & Description Textual Baseline w/o Feedback with Feedback Avg. FC PutIn(obj,container) Pre: Missing IsOpen(container) due to closed lid10%40%80%1.1 Stack(obja,objb)Pre: Missing Clear(objb) due to surface obstruction20%30%70%2.1 Lift(box big ,r 1 ,r 2 )Pre: Missing Holding(r 2 ,box big ) in dual-arm coordination10%80%90%0.3 Pick(robot,obj)Add: Unverified InReach(robot,obj) (Kinematic constraint)20%50%90%0.8 Put(obj,loc)Del: Stale At(obj,loc old ) resulting in location redundancy0%20%40%2.4 Total12%44%74%1.3 Table 3: Success rate (SR%) of VLM-based cross-level refinement for action models. Results are averaged over 10 trials (N FC ≤ 3). Pre, Add, and Del represent action precondition, add effect, and delete effect, respectively. Evaluation of High-Level Model Proposal Ablating Planning Contexts Planning context feedback proved crucial for performance (Table 1). Its inclusion con- sistently boosted goal success rates and system complete- ness, most notably for GPT-4o, where completeness jumped from 50% to over 90%. The performance gains were most significant in the complex dual-arm and mobile manipula- tion tasks, demonstrating that structured, symbolic feedback from a formal planner can empower LLMs to resolve intri- cate logical challenges. Comparison of LLMs As shown in Table 1, while GPT- 3.5 and GPT-4o performed comparably without planning context feedback, GPT-4o’s superiority became evident with it. Guided by this feedback, GPT-4o achieved over 90% complete planning success rate, in stark contrast to approx- imately 60% for GPT-3.5. This underscores GPT-4o’s ad- vanced capacity for leveraging contextual feedback in com- plex reasoning tasks like BT grounding. Evaluation of Low-Level Policy Sampling We evaluate the performance of three policy types for low- level policy sampling. Details of these policy are shown in Appendix. We select five typical action models to test the performance of these polices, as shown in Table 2. The algorithms show different strengths in various ac- tions. Rekep and Rule-Based methods excel in grasping, while Molmo+cuRobo performs better in Open/Close and Toggle actions. This is due to the semantics-based key- point extraction that accurately identifies object handles and hinges. We utilize GPT-4o as VLM in the experiment. Ablating Execution Contexts Table 2 presents the Suc- cess Rate (SR) of control policy for five typical action mod- els, where the VLM samples the policy type and its hyperpa- rameters based on the execution contexts. The results show the SR without execution contexts and with up to three sam- ple attempts. It is evident that the VLM can effectively sam- ple low-level policies, and with execution contexts, the SR of the actions improves. Evaluation of Cross-Level Refinement Ablating Environment Feedback Table 3 catalogs action models that exhibited inconsistencies, where the predicted high-level effect diverged from the low-level execution out- come or resulted in an error. Through an iterative feedback process, the VLM demonstrated the potential to success- fully correct these high-level representations, underscoring the critical role of direct environmental feedback. However, the efficacy of this approach is currently limited for abstract concepts lacking direct visual correlates, such as the sym- bolic target in Put(obj,loc). Deployment Figure 4 depicts the deployment of our pipeline, where CABTO generates a complete and consis- tent BT system for the given task set. The robot successfully executes the planned BT actions sequentially for every task. Conclusion In this work, we first formalize the BT grounding problem and propose CABTO, a framework that leverages LMs to automatically construct complete and consistent BT systems guided by planning and environmental feedback. The effec- tiveness of our approach is validated across seven robotic manipulation task sets. Future work will focus on enhancing LM inference and low-level robotic skills via fine-tuning and addressing the transfer to physical systems. Acknowledgments This work was supported by the National Science Fund for Distinguished Young Scholars (Grant Nos. 62525213), the National Natural Science Foundation of China (Grant Nos. 62572480), and the University Youth Independent Innova- tion Science Foundation (Grant Nos. ZK25-11). References Arora, A.; Fiorino, H.; Pellier, D.; M ́ etivier, M.; and Pesty, S. 2018. A Review of Learning Planning Action Models. The Knowledge Engineering Review, 33: e20. Bachor, P.; and Behnke, G. 2024. Learning Planning Do- mains from Non-Redundant Fully-Observed Traces: Theo- retical Foundations and Complexity Analysis. In Proceed- ings of the AAAI Conference on Artificial Intelligence, vol- ume 38, 20028–20035. Bai, F.; Li, Y.; Chu, J.; Chou, T.; Zhu, R.; Wen, Y.; Yang, Y.; and Chen, Y. 2025. Retrieval dexterity: Efficient ob- ject retrieval in clutters with dexterous hand. arXiv preprint arXiv:2502.18423. Banerjee, B. 2018. Autonomous Acquisition of Behavior Trees for Robot Control. 3460–3467. Bonet, B.; Loerincs, G.; and Geffner, H. 1997. A Robust and Fast Action Selection Mechanism for Planning. In AAAI, 714–719. Cai, Y.; Chen, X.; Cai, Z.; Mao, Y.; Li, M.; Yang, W.; and Wang, J. 2025a. Mrbtp: Efficient multi-robot behavior tree planning and collaboration. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, 14548– 14557. Cai, Y.; Chen, X.; Mao, Y.; Li, M.; Yang, S.; Yang, W.; and Wang, J. 2025b. HBTP: Heuristic Behavior Tree Planning with Large Language Model Reasoning. In 2025 IEEE Inter- national Conference on Robotics and Automation (ICRA), 13706–13713. Cai, Z.; Li, M.; Huang, W.; and Yang, W. 2021. BT Expan- sion: A Sound and Complete Algorithm for Behavior Plan- ning of Intelligent Robots with Behavior Trees. In AAAI, 6058–6065. AAAI Press. Chen, X.; Cai, Y.; Mao, Y.; Li, M.; Yang, W.; Xu, W.; and Wang, J. 2024. Integrating Intent Understanding and Opti- mal Behavior Planning for Behavior Tree Generation from Human Instructions. In IJCAI. Chen, Y.; Wang, C.; Fei-Fei, L.; and Liu, C. K. 2023. Se- quential Dexterity: Chaining Dexterous Policies for Long- Horizon Manipulation. arXiv preprint arXiv:2309.00987. Colledanchise, M.; and ̈ Ogren, P. 2018. Behavior Trees in Robotics and AI: An Introduction. CRC Press. Colledanchise, M.; Parasuraman, R.; and ̈ Ogren, P. 2019. Learning of Behavior Trees for Autonomous Agents. IEEE Transactions on Games, 11(2): 183–189. Deitke, M.; Clark, C.; Lee, S.; Tripathi, R.; Yang, Y.; Park, J. S.; Salehi, M.; Muennighoff, N.; Lo, K.; Soldaini, L.; et al. 2024. Molmo and Pixmo: Open Weights and Open Data for State-of-the-Art Multimodal Models.arXiv preprint arXiv:2409.17146. Deitke, M.; Clark, C.; Lee, S.; Tripathi, R.; Yang, Y.; Park, J. S.; Salehi, M.; Muennighoff, N.; Lo, K.; Soldaini, L.; et al. 2025. Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models. In Proceedings of the Computer Vision and Pattern Recognition Conference, 91–104. Fikes, R. E.; and Nilsson, N. J. 1971. STRIPS: A New Ap- proach to the Application of Theorem Proving to Problem Solving. 2(3-4): 189–208. French, K.; Wu, S.; Pan, T.; Zhou, Z.; and Jenkins, O. C. 2019. Learning Behavior Trees from Demonstration. In 2019 International Conference on Robotics and Automation (ICRA), 7791–7797. IEEE. Haresh, S.; Dijkman, D.; Bhattacharyya, A.; and Memisevic, R. 2024. ClevrSkills: Compositional Language And Visual Reasoning in Robotics. Hoffmann, J.; and Nebel, B. 2001. The F Planning System: Fast Plan Generation through Heuristic Search. Journal of Artificial Intelligence Research, 14: 253–302. Huang, W.; Wang, C.; Li, Y.; Zhang, R.; and Fei-Fei, L. 2025. ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation. In Confer- ence on Robot Learning, 4573–4602. PMLR. Huang, W.; Wang, C.; Zhang, R.; Li, Y.; Wu, J.; and Fei-Fei, L. 2023. Voxposer: Composable 3d Value Maps for Robotic Manipulation with Language Models. arXiv preprint arXiv:2307.05973. Kaelbling, L. P.; and Lozano-P ́ erez, T. 2011. Hierarchical Task and Motion Planning in the Now. In 2011 IEEE In- ternational Conference on Robotics and Automation, 1470– 1477. IEEE. Kim, M. J.; Pertsch, K.; Karamcheti, S.; Xiao, T.; Balakr- ishna, A.; Nair, S.; Rafailov, R.; Foster, E.; Lam, G.; San- keti, P.; et al. 2024. OpenVLA: An Open-Source Vision- Language-Action Model. arXiv preprint arXiv:2406.09246. Kumar, N.; Ramos, F.; Fox, D.; and Garrett, C. R. 2024. Open-World Task and Motion Planning via Vision- Language Model Inferred Constraints.arXiv preprint arXiv:2411.08253. Kumar, V.; Shah, R.; Zhou, G.; Moens, V.; Caggiano, V.; Gupta, A.; and Rajeswaran, A. 2023. RoboHive: A Uni- fied Framework for Robot Learning. In Oh, A.; Naumann, T.; Globerson, A.; Saenko, K.; Hardt, M.; and Levine, S., eds., Advances in Neural Information Processing Systems, volume 36, 44323–44340. Curran Associates, Inc. Li, C.; Zhang, R.; Wong, J.; Gokmen, C.; Srivastava, S.; Mart ́ ın-Mart ́ ın, R.; Wang, C.; Levine, G.; Lingelbach, M.; Sun, J.; Anvari, M.; Hwang, M.; Sharma, M.; Aydin, A.; Bansal, D.; Hunter, S.; Kim, K.-Y.; Lou, A.; Matthews, C. R.; Villa-Renteria, I.; Tang, J. H.; Tang, C.; Xia, F.; Savarese, S.; Gweon, H.; Liu, K.; Wu, J.; and Fei-Fei, L. 2023. BEHAVIOR-1K: A Benchmark for Embodied AI with 1,000 Everyday Activities and Realistic Simulation. In Liu, K.; Kulic, D.; and Ichnowski, J., eds., Proceedings of The 6th Conference on Robot Learning, volume 205 of Proceed- ings of Machine Learning Research, 80–93. PMLR. Liang, Y.; Kumar, N.; Tang, H.; Weller, A.; Tenenbaum, J. B.; Silver, T.; Henriques, J. F.; and Ellis, K. 2025. Visu- alPredicator: Learning Abstract World Models with Neuro- Symbolic Predicates for Robot Planning. In The Thirteenth International Conference on Learning Representations. Lim, C.-U.; Baumgarten, R.; and Colton, S. 2010. Evolv- ing Behaviour Trees for the Commercial Game DEFCON. In Applications of Evolutionary Computation: EvoApplica- tons 2010: EvoCOMPLEX, EvoGAMES, EvoIASP, EvoIN- TELLIGENCE, EvoNUM, and EvoSTOC, Istanbul, Turkey, April 7-9, 2010, Proceedings, Part I, 100–110. Springer. Liu, B.; Jiang, Y.; Zhang, X.; Liu, Q.; Zhang, S.; Biswas, J.; and Stone, P. 2023. LLM+P: Empowering Large Language Models with Optimal Planning Proficiency. arXiv. Mahdavi, S.; Aoki, R.; Tang, K.; and Cao, Y. 2024. Lever- aging Environment Interaction for Automated PDDL Gen- eration and Planning with Large Language Models. arXiv preprint arXiv:2407.12979. Mordoch, A.; Scala, E.; Stern, R.; and Juba, B. 2024. Safe Learning of Pddl Domains with Conditional Effects. In Proceedings of the International Conference on Automated Planning and Scheduling, volume 34, 387–395. Mu, Y.; Chen, J.; Zhang, Q.; Chen, S.; Yu, Q.; Ge, C.; Chen, R.; Liang, Z.; Hu, M.; Tao, C.; et al. 2024. RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthe- sis. arXiv preprint arXiv:2402.16117. Neupane, A.; and Goodrich, M. 2019. Learning Swarm Be- haviors Using Grammatical Evolution and Behavior Trees. In IJCAI, 513–520. IJCAI Organization. Neupane, A.; and Goodrich, M. A. 2023. Designing Behav- ior Trees from Goal-Oriented LTLf Formulas. arXiv preprint arXiv:2307.06399. Neupane, A.; Goodrich, M. A.; and Mercer, E. G. 2018. GEESE: Grammatical Evolution Algorithm for Evolution of Swarm Behaviors. 999–1006. Newton, MA.; and Levine, J. 2010. Implicit Learning of Compiled Macro-Actions for Planning. In ECAI 2010, 323– 328. IOS Press. ̈ Ogren, P.; and Sprague, C. I. 2022. Behavior Trees in Robot Control Systems. Annual Review of Control, Robotics, and Autonomous Systems, 5: 81–107. Pan, L.; Albalak, A.; Wang, X.; and Wang, W. Y. 2023. Logic-Lm: Empowering Large Language Models with Sym- bolic Solvers for Faithful Logical Reasoning. arXiv preprint arXiv:2305.12295. Pereira, R. d. P.; and Engel, P. M. 2015. A Framework for Constrained and Adaptive Behavior-Based Agents. arXiv preprint arXiv:1506.02312. Rodrigues, C.; G ́ erard, P.; Rouveirol, C.; and Soldano, H. 2012. Active Learning of Relational Action Models. In In- ductive Logic Programming: 21st International Conference, ILP 2011, Windsor Great Park, UK, July 31–August 3, 2011, Revised Selected Papers 21, 302–316. Springer. Scheide, E.; Best, G.; and Hollinger, G. A. 2021. Behav- ior Tree Learning for Robotic Task Planning through Monte Carlo DAG Search over a Formal Grammar. 4837–4843. Xi’an, China: IEEE. ISBN 978-1-72819-077-8. Sundaralingam, B.; Hari, S. K. S.; Fishman, A.; Garrett, C.; Van Wyk, K.; Blukis, V.; Millane, A.; Oleynikova, H.; Handa, A.; Ramos, F.; et al. 2023. Curobo: Parallelized collision-free robot motion generation. In 2023 IEEE Inter- national Conference on Robotics and Automation (ICRA), 8112–8119. IEEE. Tadewos, T. G.; Newaz, A. A. R.; and Karimoddini, A. 2022. Specification-Guided Behavior Tree Synthesis and Execu- tion for Coordination of Autonomous Systems. Expert Sys- tems with Applications, 201: 117022. Thomason, W.; Kingston, Z.; and Kavraki, L. E. 2024. Mo- tions in Microseconds via Vectorized Sampling-Based Plan- ning. In 2024 IEEE International Conference on Robotics and Automation (ICRA), 8749–8756. Valmeekam, K.; Marquez, M.; Olmo, A.; Sreedharan, S.; and Kambhampati, S. 2023.PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Plan- ning and Reasoning about Change. In Oh, A.; Naumann, T.; Globerson, A.; Saenko, K.; Hardt, M.; and Levine, S., eds., Advances in Neural Information Processing Systems, vol- ume 36, 38975–38987. Curran Associates, Inc. Yang, Z.; Garrett, C.; Fox, D.; Lozano-P ́ erez, T.; and Kael- bling, L. P. 2024. Guiding Long-Horizon Task and Mo- tion Planning with Vision Language Models. arXiv preprint arXiv:2410.02193. Zare, M.; Kebria, P. M.; Khosravi, A.; and Nahavandi, S. 2024. A Survey of Imitation Learning: Algorithms, Recent Developments, and Challenges. IEEE Transactions on Cy- bernetics. Zhang, K.; Ren, P.; Lin, B.; Lin, J.; Ma, S.; Xu, H.; and Liang, X. 2024.PIVOT-R: Primitive-Driven Waypoint- Aware World Model for Robotic Manipulation.arXiv preprint arXiv:2410.10394. Zhen, H.; Qiu, X.; Chen, P.; Yang, J.; Yan, X.; Du, Y.; Hong, Y.; and Gan, C. 2024. 3d-Vla: A 3d Vision-Language-Action Generative World Model. arXiv preprint arXiv:2403.09631. Zhong, Y.; Bai, F.; Cai, S.; Huang, X.; Chen, Z.; Zhang, X.; Wang, Y.; Guo, S.; Guan, T.; Lui, K. N.; et al. 2025. A Survey on Vision-Language-Action Models: An Action To- kenization Perspective. arXiv preprint arXiv:2507.01925. Zhou, A.; Yan, K.; Shlapentokh-Rothman, M.; Wang, H.; and Wang, Y.-X. 2024. Language Agent Tree Search Unifies Reasoning, Acting, and Planning in Language Models. In Salakhutdinov, R.; Kolter, Z.; Heller, K.; Weller, A.; Oliver, N.; Scarlett, J.; and Berkenkamp, F., eds., Proceedings of the 41st International Conference on Machine Learning, vol- ume 235 of Proceedings of Machine Learning Research, 62138–62160. PMLR. Zhuo, H. H.; and Yang, Q. 2014. Action-Model Acquisition for Planning via Transfer Learning. Artificial intelligence, 212: 80–103.