← Back to papers

Paper deep dive

AtomicVLA: Unlocking the Potential of Atomic Skill Learning in Robots

Likui Zhang, Tao Tang, Zhihao Zhan, Xiuwei Chen, Zisheng Chen, Jianhua Han, Jiangtong Zhu, Pei Xu, Hang Xu, Hefeng Wu, Liang Lin, Xiaodan Liang

Year: 2026Venue: arXiv preprintArea: cs.ROType: PreprintEmbeddings: 76

Abstract

Abstract:Recent advances in Visual-Language-Action (VLA) models have shown promising potential for robotic manipulation tasks. However, real-world robotic tasks often involve long-horizon, multi-step problem-solving and require generalization for continual skill acquisition, extending beyond single actions or skills. These challenges present significant barriers for existing VLA models, which use monolithic action decoders trained on aggregated data, resulting in poor scalability. To address these challenges, we propose AtomicVLA, a unified planning-and-execution framework that jointly generates task-level plans, atomic skill abstractions, and fine-grained actions. AtomicVLA constructs a scalable atomic skill library through a Skill-Guided Mixture-of-Experts (SG-MoE), where each expert specializes in mastering generic yet precise atomic skills. Furthermore, we introduce a flexible routing encoder that automatically assigns dedicated atomic experts to new skills, enabling continual learning. We validate our approach through extensive experiments. In simulation, AtomicVLA outperforms $\pi_{0}$ by 2.4\% on LIBERO, 10\% on LIBERO-LONG, and outperforms $\pi_{0}$ and $\pi_{0.5}$ by 0.22 and 0.25 in average task length on CALVIN. Additionally, our AtomicVLA consistently surpasses baselines by 18.3\% and 21\% in real-world long-horizon tasks and continual learning. These results highlight the effectiveness of atomic skill abstraction and dynamic expert composition for long-horizon and lifelong robotic tasks. The project page is \href{this https URL}{here}.

Tags

ai-safety (imported, 100%)csro (suggested, 92%)preprint (suggested, 88%)

Links

PDF not stored locally. Use the link above to view on the source site.

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%

Last extracted: 3/13/2026, 12:35:48 AM

Summary

AtomicVLA is a unified robotic framework that integrates task planning and action execution using a Skill-Guided Mixture-of-Experts (SG-MoE) architecture. It addresses scalability and continual learning challenges in long-horizon robotic tasks by decomposing complex instructions into atomic skill abstractions, which are then executed by specialized experts, preventing catastrophic forgetting and improving performance in both simulation and real-world settings.

Entities (5)

AtomicVLA · framework · 100%CALVIN · benchmark · 95%LIBERO · benchmark · 95%SG-MoE · architecture · 95%InternVideo2.5 · model · 90%

Relation Signals (4)

AtomicVLA utilizes SG-MoE

confidence 100% · AtomicVLA constructs a scalable atomic skill library through a Skill-Guided Mixture-of-Experts (SG-MoE)

AtomicVLA evaluatedon LIBERO

confidence 95% · In simulation, AtomicVLA outperforms π0 by 2.4% on LIBERO

AtomicVLA evaluatedon CALVIN

confidence 95% · outperforms π0 and π0.5 by 0.22 and 0.25 in average task length on CALVIN.

AtomicVLA integrates InternVideo2.5

confidence 90% · we employ the InternVideo2.5 model [52] to interpret the corresponding video clips

Cypher Suggestions (2)

Find all benchmarks used to evaluate the AtomicVLA framework. · confidence 95% · unvalidated

MATCH (f:Framework {name: 'AtomicVLA'})-[:EVALUATED_ON]->(b:Benchmark) RETURN b.name

Identify the architecture components utilized by AtomicVLA. · confidence 95% · unvalidated

MATCH (f:Framework {name: 'AtomicVLA'})-[:UTILIZES]->(a:Architecture) RETURN a.name

Full Text

75,557 characters extracted from source content.

Expand or collapse full text

AtomicVLA: Unlocking the Potential of Atomic Skill Learning in Robots Likui Zhang 1 Tao Tang 1 Zhihao Zhan 1 Xiuwei Chen 1 Zisheng Chen 1 Jianhua Han 3 Jiangtong Zhu 3 Pei Xu 3 Hang Xu 3 Hefeng Wu 1 Liang Lin 1 † Xiaodan Liang 1,2 † 1 Sun Yat-sen University 2 Peng Cheng Laboratory 3 Yinwang Intelligent Technology Co. Ltd. zhanglk9@mail2.sysu.edu.cn Abstract Recent advances in Visual-Language-Action (VLA) models have shown promising potential for robotic manipulation tasks. However, real-world robotic tasks often involve long- horizon, multi-step problem-solving and require generaliza- tion for continual skill acquisition, extending beyond sin- gle actions or skills. These challenges present significant barriers for existing VLA models, which use monolithic ac- tion decoders trained on aggregated data, resulting in poor scalability. To address these challenges, we propose Atom- icVLA, a unified planning-and-execution framework that jointly generates task-level plans, atomic skill abstractions, and fine-grained actions. AtomicVLA constructs a scal- able atomic skill library through a Skill-Guided Mixture-of- Experts (SG-MoE), where each expert specializes in mas- tering generic yet precise atomic skills. Furthermore, we introduce a flexible routing encoder that automatically as- signs dedicated atomic experts to new skills, enabling con- tinual learning. We validate our approach through exten- sive experiments. In simulation, AtomicVLA outperforms π 0 by 2.4% on LIBERO, 10% on LIBERO-LONG, and out- performs π 0 and π 0.5 by 0.22 and 0.25 in average task length on CALVIN. Additionally, our AtomicVLA consis- tently surpasses baselines by 18.3% and 21% in real-world long-horizon tasks and continual learning. These results highlight the effectiveness of atomic skill abstraction and dynamic expert composition for long-horizon and lifelong robotic tasks. The project page is here. 1. Introduction Building on powerful Vision-Language Models [2, 6, 9, 21, 38, 50], Vision-Language-Action (VLA) models [3, 4, 22, 30] unify visual perception, language understanding, and action generation into a single framework, achieving signif- icant advances in robotic manipulation tasks. Despite this progress, current VLA models still face challenges in real- † Co-corresponding author VLM Action Head Skill Expandable Skill Decoupled SG-MoE Skill 1 Skill N ... New skill 휋 0 80 85 90 95 100 휋 0.5 LIBERO-LONG 85.2 95.2 92.4 96.2 휋 0 60 70 80 휋 0.5 CALVIN-5 Tasks 59.4 69.1 71.0 77.6 3540 45505560 65 Real-world 휋 0 휋 0.5 36.7 56.7 45.0 63.3 (1) Previous VLA (2) Our AtomicVLA Image Text Long-horizon Tasks Continual Learning Figure 1. Overview of AtomicVLA. Unlike previous VLA mod- els with a single action head, which suffer from limited scalabil- ity and severe interference among mixed skills, AtomicVLA em- ploys a SG-MoE architecture to build a scalable skill expert li- brary. By unifying task planning and action execution within this framework, it achieves strong performance on long-horizon and continual learning tasks in both simulation and real-world settings. world deployments for complex long-horizon tasks and the continual acquisition of new skills. To overcome these challenges, a robotic model must support both high-level reasoning and fine-grained action generation, while enabling scalable continual learning. To support high-level reasoning and task planning, some ex- isting approaches employ a two-stage architecture [1, 13, 20, 35, 41, 46], where a pretrained vision-language model (VLM) serves as a high-level planner to generate subtask instructions, while a separate VLA-based controller trans- lates these instructions into executable actions. However, recent studies [26, 59, 60] suggest that modular decoupling leads to a lack of mutual awareness between the planner and controller, causing suboptimal task coordination. More- over, in real-world applications, this can result in the gen- eration of outdated or irrelevant instructions due to sys- tem delays. In addition, most existing VLA models rely on a single action-decoding module, limiting their scalabil- ity. Incrementally learning new skills requires fine-tuning arXiv:2603.07648v1 [cs.RO] 8 Mar 2026 existing models, which demands substantial computational resources and large datasets. Given the current scarcity of robot data, fully leveraging well-pretrained VLA model weights is essential during the scaling process. Moreover, when learning new skills incrementally, these models often interfere with previously acquired skills, leading to catas- trophic forgetting and thereby hindering the lifelong learn- ing capabilities. To this end, we propose AtomicVLA, as illustrated in Fig. 1, an end-to-end framework that unifies task planning and action execution by adaptively generating either nat- ural language instructions or latent actions. AtomicVLA first infers the current execution state from the input obser- vations and dynamically activates either its thinking mod- ule or its acting module. At task initialization or during transitions between sub-skills, the model triggers thinking to produce a task chain, create a task chain plan based on the current state, and outputs atomic skill abstractions. In the acting execution phase, it dynamically selects the cor- responding skill-specific expert based on the most recent skill abstraction to generate precise robot control signals. Furthermore, to enable AtomicVLA with continual learning capability, we introduce a Skill-Guided Mixture-of-Experts (SG-MoE) architecture that constructs a scalable library of atomic skills. This library comprises a shared expert and multiple dedicated skill experts, each focusing on mastering a specific atomic skill. Through a well-designed skill en- coding mechanism and an extensible routing encoder, each atomic skill abstraction is mapped to a fixed embedding vector, allowing the routing module to rapidly adapt to new skills even as the skill library grows. When a new skill is introduced, only the corresponding expert and associated routing parameters need to be trained, leaving existing ex- perts unchanged. This effectively prevents catastrophic for- getting, ensuring efficient and stable lifelong skill growth. We conducted extensive experiments to validate the ef- fectiveness of AtomicVLA both in simulation platforms and real-world robots. In the LIBERO [28] benchmark, Atom- icVLA achieved an average performance improvement of 2.4% over baseline models, with a notable 10% improve- ment on the LIBERO-LONG. On the CALVIN [33] bench- mark, specifically on the task ABC-D training set, our method increased the average successful execution length by 0.22 and 0.25. Furthermore, we performed long-horizon task execution and continual learning experiments on a real- world Franka robot, where we observed performance im- provements of 18.3% and 21%, respectively. These re- sults further validate the potential of AtomicVLA’s pro- posed atomic skill dynamic combination mechanism in sup- porting long-term task completion and lifelong skill accu- mulation. Overall, our contributions are as follows: • We introduce AtomicVLA, an end-to-end framework that unifies task planning and action execution for long- horizon tasks and continual skill expansion. • We propose a Skill-Guided Mixture-of-Experts (SG- MoE) architecture and a scalable skill router for building a library of atomic skills. • We validate the effectiveness of AtomicVLA through ex- tensive experiments conducted in both simulated environ- ments and real-world robots. 2. Related Work 2.1. Vision-Language-Action Models Vision-Language Action Models (VLAs) have emerged as a dominant paradigm in general-purpose robotic learning by leveraging the rich semantic priors and strong cross- modal generalization of large-scale Vision-Language Mod- els (VLMs) pretrained on internet-scale data. Recent works [3, 4, 17, 22, 24, 36, 55, 66] fine-tune VLMs [2, 9, 21, 50] on diverse robotic datasets to directly map visual and lin- guistic inputs to motor actions, demonstrating impressive generalization to novel environments and tasks. However, constrained to some extent by the VLM’s in- herent hierarchical planning capability, most current VLAs exhibit limitations in structured task decomposition and long-horizon task planning. Several approaches introduce external high-level planners [10, 14, 41, 46, 62, 67] that de- compose long-horizon tasks into subgoals, which are then executed by a separate low-level policy. However, modular approaches often fail to unify action with vision and lan- guage in a shared latent space, resulting in misaligned deci- sions that compound loss. To address this problem, recent work [5, 11, 26, 29, 59, 60] propose integrated frameworks that jointly perform hierarchical reasoning and action gen- eration within a unified model. Our work aligns with this direction: we adopt a Think-Act unified architecture, where a VLM simultaneously performs high-level task planning and atomic action abstraction, thereby directly guiding a specialized action expert to produce executable, temporally coherent action sequences. 2.2. Multimodal Mixture-of-Experts The Sparse Mixture-of-Experts (MoE) architecture has be- come a mainstream approach for scaling large language models (LLMs). By replacing the standard feed-forward layers with expert modules [8, 19], MoE improves task specialization and representation capability through condi- tional computation, while maintaining inference efficiency. In the field of autonomous driving, models such as [57, 61] design specialized MoE architectures tailored to multi-view observations and action skills, improving both trajectory prediction accuracy and inference efficiency. Similarly, in robotics, some works [15, 39, 51, 58, 64, 68] employ MoE to tackle task heterogeneity and long-tailed data distribu- tions. While these approaches demonstrate the utility of VLM (a)AtomicVLAPipeline Task chains is [turn on the stove, pick up the mokapot, place mokapot on stove], This is step 1/3 of task, and the atomic skill is turn. Turn on the stove and put the moka pot on it. Share RMSNorm& Add Attention block RMSNorm& Add Expert1 Expert2 ExpertN ActionDecoder Skill Router (b)Skill-Guided Mixture of Experts ... Atomic Skills: Turn Expanded Skills Original Skills (c) Continual Learning with Skill Expansion (d) Task Planning Embodied Data Generation Original Experts Expanded Experts open pick place ∆T,∆R,∆Grip, ∆T,∆R,∆Grip, ...... Principal-axis Analysis Segmentation Coarse labels InternVideo2.5 Task Chains Primary Actions Timestamps TaskPrompt Frames relative actions Atomic Skills Abstract Figure 2. (a) AtomicVLA Pipline. AtomicVLA is a framework that unifies task planning and action execution. The VLM adaptively predicts atomic skill abstraction and latent action. Action Decoder in the SG-MoE architecture receives both the latent action and the newly inferred atomic skill abstraction, and generates fine grained motor actions. (b) Skill-Guided Mixture of Experts. SG-MoE includes a skill router, a shared expert, and multiple atomic-skill experts. The router selects the top skill expert based on the atomic skill, and the action token is processed by both the activated skill expert and the shared expert. (c) Continual Learning with Skill Expansion. New skills are added by training only the new expert and extending the router. (d) Task Planning Embodied Data Generation. High-quality embodied reasoning data are generated using principal-axis analysis with InternVideo2.5 [52] model. MoE for representation learning, they largely treat experts as interchangeable components within fixed architectural slots, without explicitly modeling structured, composable behaviors. In contrast, we reinterpret the MoE paradigm through the lens of skill modularity: we construct a dynam- ically scalable atomic skill library, where each expert corre- sponds to a semantically meaningful, reusable action primi- tive. Integrated with a pre-trained VLM that encodes atomic action abstractions, our approach enables a universal VLA model capable of both fine-grained skill decomposition and coherent long-horizon task composition. 2.3. Continual Learning with Skill Abstractions To adapt to new tasks that emerge in dynamic environ- ments, continual learning has become essential for de- veloping general-purpose intelligent agents.Prior stud- ies [7, 12, 32, 42, 43, 49] have leveraged unsupervised learning and hierarchical imitation learning to enable au- tonomous skill discovery from continuous data, which al- lows an agent to expand its skill set over time. Furthermore, to learn from streaming data without suffering from catas- trophic forgetting, several approaches [23, 34, 40, 63] in- troduce latent action representations that abstract different skills and preserve previously acquired capabilities without relying on experience replay. Current VLA Models primar- ily focus on learning generalizable skills from broad pre- training, while dedicated investigations into continual learn- ing remain limited. Although many VLAs [53, 54] have explored various motion decoding methods, such as diffu- sion models [30], flow matching [27], and discrete encod- ing [34, 48], they all use a single decoder. Their core focus is on the model’s accuracy on the current task rather than its scalability. We construct an expandable library of skill experts by using atomic units of robotic behavior together with a specialized routing module, which enhances the scal- ability of such models in skill acquisition. 3. Method 3.1. Overview As illustrated in Fig. 2, AtomicVLA integrates the thinking modality for task planning and the acting modality for ac- tion execution within a unified framework (Sec. 3.2). Build- ing upon this architecture, we develop a skill-guided li- brary of atomic action experts (Sec. 3.3) based on pi0 and introduce an extensible skill router that facilitates contin- ual learning of new skills (Sec. 3.4) in real-world environ- ments. To further ensure the generation of high-quality task planning data, we introduce an embodiment data genera- tion pipeline (Sec. 3.5) grounded in principal axis analysis, which provides structured and consistent data to support ef- fective task planning and execution. Algorithm 1 Inference Pipeline of AtomicVLA Require: VLA model π θ , language instruction ℓ 1: t← 0, O 1:n t ← initial image, Atomic← none 2: while “task not done” do 3: M ∼ π θ .PREDICT(·| O 1:n t ,ℓ) 4:if M = [think] then 5:[C 0−k ,C t ,σ]∼ π θ .THINKING(·| O 1:n t ,ℓ) 6:Atomic← σ 7:else if M = [act] then 8:w k ∼ Router(embeded(Atomic)) 9:A t ∼ π θ .ACTING(·| O 1:n t ,ℓ,s t ,w k ) 10:Execute A t 11:end if 12: t← t + 1 13: end while 3.2. Unified Task Planning and Action Execution Problem formulation. The central problem addressed in this section is to design a robot policy that simultaneously possesses task planning (thinking) and action execution (acting) capabilities, and can autonomously decide its out- put modality based on the current states. Specifically, in thinking mode, the policy takes multiple cameras observa- tions O 1:n t and a language instruction ℓ as input and outputs a high-level task plan [C 0−k ,C t ,σ] in textual form. In con- trast, in acting mode, the policy generates a concrete action command conditioned on the robot’s proprioceptive state S t and the most recent planning output σ. Adaptive thinking and acting. To enable seamless switch- ing between the two output modalities, we introduce two special output tokens: [think] and [act]. As illustrated in Algorithm 1, given the current visual observationsO 1:n t and task instruction ℓ, the model first predicts identifier either [think] or [act]. When the model outputs [think], it enters the thinking mode, in which it generates a task chain C 0−k that outlines the high-level plan, tracks the current execu- tion progress C t , and specifies the atomic skill abstraction σ to be performed. Typically, this mode is activated only at key time steps, such as task initiation or during the transition between sub-skills. Conversely, when [act] is predicted, the model switches to acting mode, where it produces a low- level action chunk A t based on the atomic skill abstract σ obtained in the most recent [think] step and the current pro- prioceptive state. 3.3. Skill-guided Mixture of Experts Architecture Atomic skill abstract embedding. To enhance the repre- sentational distinctiveness among atomic skills, we adopt an encoding strategy inspired by noise scheduling in diffusion- based denoising models. Specifically, each atomic skill ab- stract is mapped to a scalar noise level σ ∈ [0, 100], which is then embedded into a high-dimensional vector. This con- tinuous and structured embedding space facilitates seman- tic separation across skills and enables robust routing to the corresponding skill-specific experts. Z σ = E(norm(log(σ))),(1) where σ denotes the assigned noise level for the skill, and E(·) is a embedding function that maps the normalized scalar to a high-dimensional embedding vector Z σ . Skill-Guided dynamic routing. We build upon the π 0 vision-language-action (VLA) foundation model, a gener- alist robotic policy pretrained on large-scale multimodal data, and extend it with an atomic action abstraction-guided Mixture-of-Experts (MoE) architecture to construct a scal- able atomic skill library. As illustrated in Fig. 2(b), our skill library consists of three key components: (1) a skill router, (2) a shared expert that maintains the pre-trained action gen- eration capabilities of π 0 , and (3) multiple atomic skill ex- perts, each specialized in executing a distinct atomic skill. To maintain the specialized skills of individual atomic experts, we first derive an atomic action abstraction from the high-level task instruction and environmental observation via thinking pipeline. This abstraction is deterministically mapped to a fixed high-dimensional embedding Z σ ∈R d , which serves as the conditioning signal for the skill router. The router computes a probability distribution over experts as: w k = Router(Z σ ), k ∈1, 2,...,K,(2) where K denotes the number of atomic skill experts. We adopt a sparse activation strategy: only the top-scoring ex- pert is selected for action generation. Let k be the index of the activated expert, and let its raw score be w k . The final action chunk A t is computed as a weighted combination of the shared expert and the selected atomic expert: F out = (1− w k )· F share (x t ) + w k · F k (x t ),(3) wherex t denotes the current multimodal input [O 1:n t ,ℓ,s t ]. This architecture enables the system to retain the strong generalization capability of π 0 while achieving high-fidelity execution of specific skills through dedicated experts. 3.4. Continual Learning with Skill Expansion In real-world deployments, robots inevitably encounter new tasks that require atomic skills not previously observed dur- ing training. Directly incorporating these novel skills into the existing skill library and retraining the entire model of- ten leads to catastrophic forgetting, significantly impairing the performance of previously learned skills. AtomicVLA adopts a modular skill-expert mechanism, which enables continual scalability of the skill library. Specifically, as introduced in Sec. 3.3, each atomic skill is mapped to a fixed high-dimensional embedding vector Z σ , providing an explicit semantic abstraction of the skill. This ΔT ΔR Grip ΔT ΔR Grip ⋮ ΔT ΔR Grip ΔT ΔR Grip ΔT ΔR Grip ⋮ ΔT ΔR Grip ΔT ΔR Grip ΔT ΔR Grip ⋮ ΔT ΔR Grip Prompt : Turn on the stove and put the mokapot on it Prompt : Put yellow mug in microwave and close it ΔT ΔR Grip ΔT ΔR Grip ⋮ ΔT ΔR Grip ΔT ΔR Grip ΔT ΔR Grip ⋮ ΔT ΔR Grip ΔT ΔR Grip ΔT ΔR Grip ⋮ ΔT ΔR Grip Experts step This is step 1/3 of task, atomic skill is turn This is step 2/3 of task, atomic skill is pick This is step 3/3 of task, atomic skill is place This is step 1/3 of task, atomic skill is pick This is step 2/3 of task, atomic skill is place This is step 3/3 of task, atomic skill is close ΔT ΔR Grip ΔT ΔR Grip ⋮ ΔT ΔR Grip ΔT ΔR Grip ΔT ΔR Grip ⋮ ΔT ΔR Grip ΔT ΔR Grip ΔT ΔR Grip ⋮ ΔT ΔR Grip ΔT ΔR Grip ΔT ΔR Grip ⋮ ΔT ΔR Grip ΔT ΔR Grip ΔT ΔR Grip ⋮ ΔT ΔR Grip ΔT ΔR Grip ΔT ΔR Grip ⋮ ΔT ΔR Grip Thinking Thinking Thinking Thinking Thinking Thinking Experts step Color indicates activation Color indicates activation 0 36 0 36 Figure 3. Inference Example of AtomicVLA. We visualize two tasks from LIBERO-LONG. For each task, the top row shows the task progression, and the bottom row shows AtomicVLA’s inferred outputs. Gray blocks denote Thinking, while colored blocks indicate Acting, with colors corresponding to the activated skill experts. The left row shows the initial task state (top) and the skill-expert activation during inference (bottom). design inherently enables incremental learning in lifelong settings: when a new atomic skill is introduced, it is suffi- cient to add a corresponding expert module to the existing architecture and extend the routing network. To ensure smooth integration, the expanded router is ini- tialized by copying weights from the original router, while the new routing branch is initialized with small random val- ues. This initialization strategy allows the model to adapt to the enlarged skill set with minimal fine-tuning, while pre- serving the performance of previously acquired skills. Con- sequently, AtomicVLA achieves efficient and interference- free expansion of its atomic skill library, a crucial require- ment for scalable lifelong robotic learning. 3.5. Task Planning Embodied Data Generation To obtain accurate and reliable annotations of atomic ac- tions, we propose a trajectory-based atomic decomposition method grounded in principal-axis analysis. Traditional ap- proaches often rely on Vision-Language Models for video understanding or optical flow-based motion features to seg- ment action sequences. However, these methods are prone to ambiguity and noise, which typically require extensive manual post-processing to correct and refine the results. In contrast, our method analyzes the key kinematic di- mensions of the end-effector trajectory, including trans- lational displacements (∆x, ∆y, ∆z), rotational changes (∆roll, ∆pitch, ∆yaw), and binary gripper states, to achieve coarse but semantically meaningful segmentation of atomic actions. Specifically, for each short motion chunk, we iden- tify the dominant mode of motion by comparing the mag- nitudes of translational and rotational components. Concur- rently, gripper state transitions are tracked to infer action se- mantics and execution progress. For instance, a continuous decrease in the z-coordinate combined with a gripper clos- ing event indicates a “pick” action, whereas limited transla- tional movement accompanied by significant rotation with a closed gripper is classified as a “turn” operation. This physics-informed decomposition produces temporally pre- cise and semantically interpretable boundaries for atomic actions, substantially reducing the reliance on manual re- finement. Based on the output of principal-axis analysis, we de- compose a full task trajectory into a temporally ordered se- quence of atomic action segments. To refine and validate the semantic labels of these segments, we employ the Intern- Video2.5 model [52] to interpret the corresponding video clips, enabling automatic correction and enrichment of the initial atomic action annotations. By aligning these refined labels with the full trajectory, we construct a structured rea- soning chain comprising the sequence of executed atomic actions and the associated high-level plan for subsequent steps. This integrated representation not only improves the fidelity of atomic action annotation but also provides inter- pretable, step-by-step execution guidance that supports ro- bust long-horizon task planning and decision-making. 4. Experiments 4.1. Experiments Setup Benchmarks. We evaluate AtomicVLA and AtomicVLA* on two widely adopted robotic manipulation benchmarks: LIBERO [28] and CALVIN [33]. For the LIBERO bench- mark, we assess model performance across all four task suites. To further examine the model’s capability in long- Table 1. Comparison of Different Methods on LIBERO Benchmark(%). MethodSpatialObjectGoalLongAvg. Octo [47]78.985.784.651.175.1 OpenVLA [22]84.988.479.2 53.776.5 SpatialVLA [37]88.289.978.655.578.1 CoT-VLA [65]87.591.687.6 69.081.1 π 0 [3]96.498.895.885.294.2 π 0.5 [17]98.898.298.0 92.496.9 AtomicVLA (Ours)96.898.096.495.296.6 AtomicVLA* (Ours)98.898.897.296.297.8 Table 2. Long-horizon Robotic Manipulation Evaluation on CALVIN Benchmark(%). MethodTask Tasks Completed in a Row Avg. Len↑ 12345 π 0 [3]ABC→D94.387.077.968.559.43.87 π 0.5 [17]ABC→D91.984.679.475.571.04.02 AtomicVLA (Ours)ABC→D95.087.881.975.069.14.09 AtomicVLA* (Ours)ABC→D94.188.785.281.777.64.27 horizon planning and compositional generalization, we per- form additional experiments on the CALVIN benchmark us- ing the ABC-D split. Training setup. We build AtomicVLA and AtomicVLA* upon the pretrained π 0 and π 0.5 foundation model. The models were trained using robot trajectory data formatted according to the Lerobot standard. We use 5 skill experts for both the LIBERO benchmark suite and real-world robot experiments. For the CALVIN benchmark, we employ 8 skill experts to cover its broader task vocabulary. Further implementation details are provided in the Appendix. Real-world robot. We conduct real-world experiments us- ing a Franka robotic arm, which includes three long-horizon tasks and five different types of short tasks. For each short- horizon task, we collect 50 trajectories, while each long- horizon task contains 100 trajectories, resulting in a total of 550 real-world demonstration trajectories. The five short tasks cover different categories of manipulation actions, in- cluding Grasp block, Stack blocks, Close microwave, Press button, and Open drawer. The long-horizon tasks include: • Objects in plate: place all blocks on the table into a green plate. • Object into drawer: open the top drawer and place the block inside. • Object into microwave: place the plate into the mi- crowave and close the door. 4.2. Results on Simulation Results on LIBERO. As shown in Tab. 1, AtomicVLA achieves an average success rate of 96.6% across the four Calvin LIBERO Figure 4. Error Recovery Capability Demonstration. When encountering a skill execution failure, AtomicVLA automatically assesses the progress and re-executes the current skill. suites, outperforming the strong baseline by 2.4%. Notably, on the most challenging LIBERO-LONG suite, Atom- icVLA attains a success rate of 95.2%, representing a 10% improvement over the π 0 . Furthermore, AtomicVLA* demonstrates even stronger performance, reaching an aver- age success rate of 97.8% and 96.2% on LIBERO-LONG. This superior performance can be attributed to the core mechanism of AtomicVLA, which explicitly decomposes long-horizon tasks into a sequence of atomic skill abstrac- tions and dynamically activates the corresponding skill ex- perts. The “decompose–plan–compose” paradigm naturally aligns with the structure of multi-stage robotic tasks. As il- lustrated in Fig. 3, at the beginning of each atomic subtask, AtomicVLA generates a precise skill-level action abstrac- tion to guide the selection of the appropriate expert. Impor- tantly, when an execution failure occurs, for example, the butter is grasped but subsequently dropped as illustrated in Fig. 4, AtomicVLA can detect the task anomaly, regenerate a new atomic skill abstraction, and recover from the error to 흅 ퟎ . ퟓ AtomicVLA * Prompt: place the plate into the microwave and close the door.Prompt: open the top drawer and place the block inside. Figure 5. Demonstrations show the execution process of AtomicVLA* (second row) and baselines π 0.5 (first row). Table 3. Long-horizon Multi-task Experiments(%). InP, IntoD, and IntoM stand for Objects in plate, Object into drawer, Object into microwave, respectively. MethodInPIntoDIntoMAvg.∆Avg. π 0 [3]45551036.7– π 0.5 [17]65353545– AtomicVLA65604556.7+20.0↑ AtomicVLA*75605563.3+18.3↑ resume task execution. Results on Calvin.As shown in Tab. 2, AtomicVLA achieves an average task length of 4.09, outperforming the π 0 baseline by 0.22, while AtomicVLA* reaches an aver- age task length of 4.27, outperforming the π 0.5 baseline by 0.25. Notably, AtomicVLA* demonstrates superior overall task completion rate with relative improvements of 5.8%, 6.2%, and 6.6% on the last three stages of the evaluation sequence. These results indicate that AtomicVLA is par- ticularly effective in handling temporally extended and se- quential manipulation tasks. As illustrated in Fig. 4, we also observe that Atom- icVLA exhibits a capability for error recovery in experi- ments. However, due to the evaluation constraints of the CALVIN benchmark, successful recoveries after failures are not considered valid completions, which prevents sub- sequent tasks from being executed. As a result, the reported performance metrics may slightly underestimate the true ca- pability of the model. 4.3. Results on Real-world Robot Long-horizon Tasks. We perform mixed training using the collected data from three long-horizon tasks. As shown in Tab. 3, AtomicVLA and AtomicVLA* outperform the base- line model by 20% and 18.3%, respectively. As illustrated in Fig. 5, we present two representative long-horizon tasks. AtomicVLA* reliably completes the experimental config- urations that π 0.5 fails to accomplish, and this advantage becomes more evident in tasks involving door-closing oper- ations. Building on this observation, AtomicVLA* demon- strates stronger robustness and execution stability across complex manipulation sequences. Previous real-world studies on robotic manipulation typ- ically focus on training and evaluating a single specific task, while joint training across multiple heterogeneous tasks has been relatively uncommon. Our observations indicate that combining tasks with large differences can lead to mutual interference, which in turn limits overall performance. This effect becomes particularly pronounced in tasks that involve significant changes in gripper states across different execu- tion stages. For instance, in the “Object into drawer” task, the drawer-opening subtask does not require gripper clo- sure, which can adversely affect the model’s behavior on other grasping-related tasks, resulting in unintended gripper opening or closing actions, as illustrated in Fig. 6. By con- structing an explicit library of atomic skills, AtomicVLA ef- fectively mitigates such cross-task interference. Each skill precisely activates its corresponding expert to execute the required operation, which substantially alleviates the inter- ference between heterogeneous skills and overcomes the performance bottleneck of mixed multi-task training. Continual learning skills. To evaluate the effectiveness of our proposed lifelong skill expansion mechanism in real- world scenarios, we conduct training and evaluation on short-horizon task dataset consisting of five diverse manip- ulation categories. In this experiment, the “open” operation is treated as a new atomic skill, which is introduced as an additional capability during the continual learning phase af- ter the initial training stage. Specifically, we first perform mixed training on four short-horizon tasks and train the “open” skill independently on top of the pretrained model. Learning a new skill often causes substantial interference with previously acquired abilities in conventional baseline models, leading to noticeable performance degradation. As illustrated in Fig. 6, in a case that was originally expected to succeed, the task could not be completed after continual learning. The gripper failed to close promptly after reach- ing the target position. As shown in Tab. 4, the average success rate of π 0.5 decreases by approximately 15%, with the stack task exhibiting the most severe interference, show- ing a 20% decrease. In contrast, AtomicVLA* maintains stable performance after continual learning. Owing to its structured skill library management, the previously learned skills remain largely unaffected. Moreover, under the same Table 4. Continual Learning with Skill Expansion(%). ∆Avg. represents the average performance change on the four base tasks after learning new skills compared with their performance before learning.CL is continual learning MethodGraspStackClosePressOpen (new)Avg.∆Avg. π 0.5 [17]85657090-77.5– π 0.5 [17] (CL)704560755561-15.0↓ AtomicVLA*958070100-86.3– AtomicVLA* (CL)9080801007082−1.3↓ t = 0 t = 4t = 8 t = 10 t = 0 t = 8t = 10 t = 16 Figure 6. Mixed-Training Skill Interference and Continual- Learning Degradation. The top two rows illustrate skill interfer- ence in long-horizon tasks: the first shows successful single-skill executions, while the second shows failures after mixed training. The bottom two rows show degradation after continual learning: the first row presents the performance of π 0.5 before learning new skills, and the second shows its performance afterward. Red and green boxes highlight the key differences. number of training steps, AtomicVLA* acquires new skills more efficiently and achieves an overall improvement of 21% across all five tasks compared to π 0.5 . These findings highlight our effectiveness for continual learning. 4.4. Ablation Study We conduct ablation experiments on the LIBERO-LONG benchmark to evaluate the effectiveness of our skill-aware routing mechanism. Specifically, we compare AtomicVLA against three baselines: (i) a non-MoE π 0 -based baseline, (i) a standard token-level Mixture-of-Experts (MoE) that selects experts independently for each action token, and (i) a variant adapted from MoDE [39], which conditions expert selection on the denoising timestep t (i.e., using t as the routing signal). As shown in Tab. 5, AtomicVLA achieves a success rate of 95.2%, outperforming the MoE baseline by 6.6% and the timestep-conditioned MoDE variant by 5.7%. The experimental results indicate that the performance gap be- tween the MoE-based and MoDE-based methods is rela- Table 5. Results on LIBERO Benchmark(%). MethodLIBERO-LONG π 0 [3]85.2 + MoE88.6 + MoDE [39]89.5 + SG-MoE (Ours)95.2 tively small. This is primarily because both approaches rely on token-level expert routing, where the improvements largely stem from load balancing that distributes tokens across experts. As a result, each expert still learns a mixture of skills without clear specialization. In contrast, SG-MoE employs atomic skill abstractions as the routing criterion, which ensures that all tokens associated with a specific skill stage are consistently processed by the corresponding ex- pert network. Consequently, each expert focuses on a single skill with a similar action distribution, reducing interference among different skills. Moreover, this notable performance gain demonstrates that routing experts based on semanti- cally meaningful atomic skills, rather than on individual ac- tion tokens or denoising steps, leads to more coherent and efficient skill execution in long-horizon tasks. 5. Conclusion In this paper, we introduce AtomicVLA, an end-to-end framework that unifies task planning and action execution for long-horizon tasks and continual skill expansion. We design a unified architecture capable of adaptively decid- ing task plans and generating latent action outputs, and construct an atomic skill-guided expert library based on our proposed SG-MoE architecture and the specialized skill router. AtomicVLA is inherently scalable: when learn- ing new skills, it only requires extending the skill router and adding the corresponding new skill experts to rapidly acquire the novel capabilities. We validate AtomicVLA in both simulated and real-world robotic environments, demonstrating its superior performance in long-horizon tasks and continual learning. Notably, it effectively miti- gates skill interference arising from joint training and alle- viates knowledge forgetting and performance degradation during continual skill acquisition, highlighting its signif- icant potential for scalable continual learning in vision- language-action models. Acknowledgments This work was supported by the National Key Research and Development Program of China (2024YFE0203100), the Scientific Research Innovation Capability Support Project for Young Faculty (No.ZYGXQNJSKYCXNLZCXM- I28), the National Natural Science Foundation of China (NSFC) under Grants No. 62476293, No. 62372482 and No. 62272494, and in part by the Major Key Project of PCL (Grant No. PCL2025A17) and the General Embodied AI Center of Sun Yat-sen University. References [1] Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Cheb- otar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Daniel Ho, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Ir- pan, Eric Jang, Rosario Jauregui Ruano, Kyle Jeffrey, Sally Jesmonth, Nikhil J Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Kuang-Huei Lee, Sergey Levine, Yao Lu, Linda Luu, Carolina Parada, Peter Pastor, Jornell Quiambao, Kanishka Rao, Jarek Rettinghouse, Diego Reyes, Pierre Ser- manet, Nicolas Sievers, Clayton Tan, Alexander Toshev, Vin- cent Vanhoucke, Fei Xia, Ted Xiao, Peng Xu, Sichun Xu, Mengyuan Yan, and Andy Zeng. Do as i can, not as i say: Grounding language in robotic affordances, 2022. 1 [2] Lucas Beyer, Andreas Steiner, Andr ́ e Susano Pinto, Alexan- der Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, Thomas Unterthiner, Daniel Keysers, Skanda Koppula, Fangyu Liu, Adam Grycner, Alexey Gritsenko, Neil Houlsby, Manoj Kumar, Keran Rong, Julian Eisensch- los, Rishabh Kabra, Matthias Bauer, Matko Bo ˇ snjak, Xi Chen, Matthias Minderer, Paul Voigtlaender, Ioana Bica, Ivana Balazevic, Joan Puigcerver, Pinelopi Papalampidi, Olivier Henaff, Xi Xiong, Radu Soricut, Jeremiah Harmsen, and Xiaohua Zhai. Paligemma: A versatile 3b vlm for trans- fer, 2024. 1, 2 [3] Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky. π 0 : A vision-language-action flow model for general robot control, 2024. 1, 2, 6, 7, 8 [4] Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakr- ishnan, Kehang Han, Karol Hausman, Alexander Herzog, Jasmine Hsu, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Lisa Lee, Tsang-Wei Edward Lee, Sergey Levine, Yao Lu, Henryk Michalewski, Igor Mordatch, Karl Pertsch, Kan- ishka Rao, Krista Reymann, Michael Ryoo, Grecia Salazar, Pannag Sanketi, Pierre Sermanet, Jaspiar Singh, Anikait Singh, Radu Soricut, Huong Tran, Vincent Vanhoucke, Quan Vuong, Ayzaan Wahid, Stefan Welker, Paul Wohlhart, Jialin Wu, Fei Xia, Ted Xiao, Peng Xu, Sichun Xu, Tianhe Yu, and Brianna Zitkovich. Rt-2: Vision-language-action mod- els transfer web knowledge to robotic control, 2023. 1, 2 [5] Hao Chen, Jiaming Liu, Chenyang Gu, Zhuoyang Liu, Ren- rui Zhang, Xiaoqi Li, Xiao He, Yandong Guo, Chi-Wing Fu, Shanghang Zhang, and Pheng-Ann Heng. Fast-in-slow: A dual-system foundation model unifying fast manipulation within slow reasoning, 2025. 2 [6] Xi Chen, Xiao Wang, Lucas Beyer, Alexander Kolesnikov, Jialin Wu, Paul Voigtlaender, Basil Mustafa, Sebastian Goodman, Ibrahim Alabdulmohsin, Piotr Padlewski, Daniel Salz, Xi Xiong, Daniel Vlasic, Filip Pavetic, Keran Rong, Tianli Yu, Daniel Keysers, Xiaohua Zhai, and Radu Sori- cut. Pali-3 vision language models: Smaller, faster, stronger, 2023. 1 [7] Rohan Chitnis, Shubham Tulsiani, Saurabh Gupta, and Ab- hinav Gupta. Efficient bimanual manipulation using learned task schemas, 2020. 3 [8] Damai Dai, Chengqi Deng, Chenggang Zhao, R. X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y. Wu, Zhenda Xie, Y. K. Li, Panpan Huang, Fuli Luo, Chong Ruan, Zhifang Sui, and Wenfeng Liang. Deepseek- moe: Towards ultimate expert specialization in mixture-of- experts language models, 2024. 2 [9] Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tri- pathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, Taira Anderson, Erin Bransom, Kiana Ehsani, Huong Ngo, YenSung Chen, Ajay Patel, Mark Yatskar, Chris Callison- Burch, Andrew Head, Rose Hendrix, Favyen Bastani, Eli VanderBilt, Nathan Lambert, Yvonne Chou, Arnavi Chheda, Jenna Sparks, Sam Skjonsberg, Michael Schmitz, Aaron Sarnat, Byron Bischoff, Pete Walsh, Chris Newell, Piper Wolters, Tanmay Gupta, Kuo-Hao Zeng, Jon Borchardt, Dirk Groeneveld, Crystal Nam, Sophie Lebrecht, Caitlin Wittlif, Carissa Schoenick, Oscar Michel, Ranjay Krishna, Luca Weihs, Noah A. Smith, Hannaneh Hajishirzi, Ross Girshick, Ali Farhadi, and Aniruddha Kembhavi. Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models, 2024. 1, 2 [10] Lutfi Eren Erdogan, Nicholas Lee, Sehoon Kim, Suhong Moon, Hiroki Furuta, Gopala Anumanchipalli, Kurt Keutzer, and Amir Gholami.Plan-and-act:Improving plan- ning of agents for long-horizon tasks.arXiv preprint arXiv:2503.09572, 2025. 2 [11] Huang Fang, Mengxi Zhang, Heng Dong, Wei Li, Zixuan Wang, Qifeng Zhang, Xueyun Tian, Yucheng Hu, and Hang Li. Robix: A unified model for robot interaction, reasoning and planning, 2025. 2 [12] Roy Fox, Richard Shin, William Paul, Yitian Zou, Dawn Song, Ken Goldberg, Pieter Abbeel, and Ion Stoica. Hier- archical variational imitation learning of control programs, 2019. 3 [13] Yingdong Hu, Fanqi Lin, Tong Zhang, Li Yi, and Yang Gao. Look before you leap: Unveiling the power of gpt-4v in robotic vision-language planning, 2023. 1 [14] Haoxu Huang, Fanqi Lin, Yingdong Hu, Shengjie Wang, and Yang Gao. Copa: General robotic manipulation through spa- tial constraints of parts with foundation models, 2024. 2 [15] Suning Huang, Zheyu Zhang, Tianhai Liang, Yihan Xu, Zhehao Kou, Chenhao Lu, Guowei Xu, Zhengrong Xue, and Huazhe Xu. Mentor: Mixture-of-experts network with task-oriented perturbation for visual reinforcement learning, 2025. 2 [16] Physical Intelligence, Ali Amin, Raichelle Aniceto, Ash- win Balakrishna, Kevin Black, Ken Conley, Grace Connors, James Darpinian, Karan Dhabalia, Jared DiCarlo, Danny Driess, Michael Equi, Adnan Esmail, Yunhao Fang, Chelsea Finn, Catherine Glossop, Thomas Godden, Ivan Goryachev, Lachy Groom, Hunter Hancock, Karol Hausman, Gashon Hussein, Brian Ichter, Szymon Jakubczak, Rowan Jen, Tim Jones, Ben Katz, Liyiming Ke, Chandra Kuchi, Marinda Lamb, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Yao Lu, Vishnu Mano, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Allen Z. Ren, Charvi Sharma, Lucy Xiaoyang Shi, Laura Smith, Jost Tobias Springenberg, Kyle Stachow- icz, Will Stoeckle, Alex Swerdlow, James Tanner, Marcel Torne, Quan Vuong, Anna Walling, Haohuan Wang, Blake Williams, Sukwon Yoo, Lili Yu, Ury Zhilinsky, and Zhiyuan Zhou. π ∗ 0.6 : a vla that learns from experience, 2025. 13 [17] Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y. Gal- liker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Allen Z. Ren, Lucy Xiaoyang Shi, Laura Smith, Jost Tobias Springenberg, Kyle Stachowicz, James Tanner, Quan Vuong, Homer Walke, Anna Walling, Haohuan Wang, Lili Yu, and Ury Zhilinsky. π 0.5 : a vision-language-action model with open-world generaliza- tion, 2025. 2, 6, 7, 8, 13, 15 [18] Yuheng Ji, Huajie Tan, Jiayu Shi, Xiaoshuai Hao, Yuan Zhang, Hengyuan Zhang, Pengwei Wang, Mengdi Zhao, Yao Mu, Pengju An, Xinda Xue, Qinghang Su, Huaihai Lyu, Xi- aolong Zheng, Jiaming Liu, Zhongyuan Wang, and Shang- hang Zhang. Robobrain: A unified brain model for robotic manipulation from abstract to concrete, 2025. 13 [19] Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Deven- dra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guil- laume Lample, L ́ elio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, Th ́ eophile Gervet, Thibaut Lavril, Thomas Wang, Timoth ́ e Lacroix, and William El Sayed. Mixtral of experts, 2024. 2 [20] Tao Jiang, Tianyuan Yuan, Yicheng Liu, Chenhao Lu, Jian- ning Cui, Xiao Liu, Shuiqi Cheng, Jiyang Gao, Huazhe Xu, and Hang Zhao. Galaxea open-world dataset and g0 dual- system vla model, 2025. 1 [21] Siddharth Karamcheti, Suraj Nair, Ashwin Balakrishna, Percy Liang, Thomas Kollar, and Dorsa Sadigh. Prismatic vlms: Investigating the design space of visually-conditioned language models, 2024. 1, 2 [22] Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Fos- ter, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kol- lar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open- source vision-language-action model, 2024. 1, 2, 6 [23] Daehee Lee, Minjong Yoo, Woo Kyung Kim, Wonje Choi, and Honguk Woo. Incremental learning of retrievable skills for efficient continual task adaptation, 2025. 3 [24] Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, Winson Han, Wilbert Pumacay, Angelica Wu, Rose Hendrix, Karen Farley, Eli VanderBilt, Ali Farhadi, Di- eter Fox, and Ranjay Krishna. Molmoact: Action reasoning models that can reason in space, 2025. 2 [25] Haozhan Li, Yuxin Zuo, Jiale Yu, Yuhao Zhang, Zhaohui Yang, Kaiyan Zhang, Xuekai Zhu, Yuchen Zhang, Tianxing Chen, Ganqu Cui, Dehui Wang, Dingxiang Luo, Yuchen Fan, Youbang Sun, Jia Zeng, Jiangmiao Pang, Shanghang Zhang, Yu Wang, Yao Mu, Bowen Zhou, and Ning Ding. Simplevla- rl: Scaling vla training via reinforcement learning, 2025. 13 [26] Fanqi Lin, Ruiqian Nai, Yingdong Hu, Jiacheng You, Jun- ming Zhao, and Yang Gao. Onetwovla: A unified vision- language-action model with adaptive reasoning, 2025. 1, 2 [27] Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maxi- milian Nickel, and Matt Le. Flow matching for generative modeling, 2023. 3 [28] Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowl- edge transfer for lifelong robot learning, 2023. 2, 5 [29] Jiaming Liu, Hao Chen, Pengju An, Zhuoyang Liu, Renrui Zhang, Chenyang Gu, Xiaoqi Li, Ziyu Guo, Sixiang Chen, Mengzhen Liu, Chengkai Hou, Mengdi Zhao, KC alex Zhou, Pheng-Ann Heng, and Shanghang Zhang. Hybridvla: Col- laborative diffusion and autoregression in a unified vision- language-action model, 2025. 2 [30] Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipu- lation, 2025. 1, 3 [31] Guanxing Lu, Wenkai Guo, Chubin Zhang, Yuheng Zhou, Haonan Jiang, Zifeng Gao, Yansong Tang, and Ziwei Wang. Vla-rl: Towards masterful and general robotic manipulation with scalable reinforcement learning, 2025. 13 [32] Xiaofeng Mao, Gabriele Giudici, Claudio Coppola, Kas- par Althoefer, Ildar Farkhatdinov, Zhibin Li, and Lorenzo Jamone. Dexskills: Skill segmentation using haptic data for learning autonomous long-horizon robotic manipulation tasks, 2024. 3 [33] Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wol- fram Burgard.Calvin:A benchmark for language- conditioned policy learning for long-horizon robot manip- ulation tasks, 2022. 2, 5 [34] Atharva Mete, Haotian Xue, Albert Wilcox, Yongxin Chen, and Animesh Garg. Quest: Self-supervised skill abstractions for learning continuous control, 2024. 3 [35] NVIDIA, :, Johan Bjorck, Fernando Casta ̃ neda, Nikita Cher- niadev, Xingye Da, Runyu Ding, Linxi ”Jim” Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, Joel Jang, Zhenyu Jiang, Jan Kautz, Kaushil Kundalia, Lawrence Lao, Zhiqi Li, Zongyu Lin, Kevin Lin, Guilin Liu, Edith Llon- top, Loic Magne, Ajay Mandlekar, Avnish Narayan, Soroush Nasiriany, Scott Reed, You Liang Tan, Guanzhi Wang, Zu Wang, Jing Wang, Qi Wang, Jiannan Xiang, Yuqi Xie, Yinzhen Xu, Zhenjia Xu, Seonghyeon Ye, Zhiding Yu, Ao Zhang, Hao Zhang, Yizhou Zhao, Ruijie Zheng, and Yuke Zhu. Gr00t n1: An open foundation model for generalist humanoid robots, 2025. 1 [36] Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision- language-action models, 2025. 2 [37] Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, and Xuelong Li. Spatialvla: Exploring spatial repre- sentations for visual-language-action model, 2025. 6 [38] Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Jun- yang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tianyi Tang, Tingyu Xia, Xingzhang Ren, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yu Wan, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, and Zihan Qiu. Qwen2.5 technical report, 2025. 1 [39] Moritz Reuss, Jyothish Pari, Pulkit Agrawal, and Rudolf Li- outikov. Efficient diffusion transformer policies with mixture of expert denoisers for multitask learning, 2024. 2, 8 [40] Kaushik Roy, Akila Dissanayake, Brendan Tidd, and Pey- man Moghadam. M2distill: Multi-modal distillation for life- long imitation learning, 2025. 3 [41] Lucy Xiaoyang Shi, Brian Ichter, Michael Equi, Liyim- ing Ke, Karl Pertsch, Quan Vuong, James Tanner, Anna Walling, Haohuan Wang, Niccolo Fusai, Adrian Li-Bell, Danny Driess, Lachy Groom, Sergey Levine, and Chelsea Finn. Hi robot: Open-ended instruction following with hier- archical vision-language-action models, 2025. 1, 2 [42] Robin Strudel, Alexander Pashevich, Igor Kalevatykh, Ivan Laptev, Josef Sivic, and Cordelia Schmid. Learning to com- bine primitive skills: A step towards versatile robotic manip- ulation, 2020. 3 [43] Jiankai Sun, Aidan Curtis, Yang You, Yan Xu, Michael Koehle, Qianzhong Chen, Suning Huang, Leonidas Guibas, Sachin Chitta, Mac Schwager, and Hui Li. Arch: Hierar- chical hybrid learning for long-horizon contact-rich robotic assembly, 2025. 3 [44] Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Xiansheng Chen, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. Reason-rft: Reinforcement fine-tuning for visual reasoning of vision language models, 2025. 13 [45] BAAI RoboBrain Team, Mingyu Cao, Huajie Tan, Yuheng Ji, Xiansheng Chen, Minglan Lin, Zhiyu Li, Zhou Cao, Pengwei Wang, Enshen Zhou, Yi Han, Yingbo Tang, Xi- angqi Xu, Wei Guo, Yaoxu Lyu, Yijie Xu, Jiayu Shi, Mengfei Du, Cheng Chi, Mengdi Zhao, Xiaoshuai Hao, Junkai Zhao, Xiaojie Zhang, Shanyu Rong, Huaihai Lyu, Zhengliang Cai, Yankai Fu, Ning Chen, Bolun Zhang, Lingfeng Zhang, Shuyi Zhang, Dong Liu, Xi Feng, Songjing Wang, Xiaodan Liu, Yance Jiao, Mengsi Lyu, Zhuo Chen, Chenrui He, Yulong Ao, Xue Sun, Zheqi He, Jingshu Zheng, Xi Yang, Donghai Shi, Kunchang Xie, Bochao Zhang, Shaokai Nie, Chunlei Men, Yonghua Lin, Zhongyuan Wang, Tiejun Huang, and Shanghang Zhang. Robobrain 2.0 technical report, 2025. 13 [46] Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montserrat Gonzalez Are- nas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzijl, Steven Bohez, Konstanti- nos Bousmalis, Anthony Brohan, Thomas Buschmann, Arunkumar Byravan, Serkan Cabi, Ken Caluwaerts, Fed- erico Casarini, Oscar Chang, Jose Enrique Chen, Xi Chen, Hao-Tien Lewis Chiang, Krzysztof Choromanski, David D’Ambrosio, Sudeep Dasari, Todor Davchev, Co- line Devin, Norman Di Palo, Tianli Ding, Adil Dost- mohamed, Danny Driess, Yilun Du, Debidatta Dwibedi, Michael Elabd, Claudio Fantacci, Cody Fong, Erik Frey, Chuyuan Fu, Marissa Giustina, Keerthana Gopalakrishnan, Laura Graesser, Leonard Hasenclever, Nicolas Heess, Bran- don Hernaez, Alexander Herzog, R. Alex Hofer, Jan Hump- lik, Atil Iscen, Mithun George Jacob, Deepali Jain, Ryan Julian, Dmitry Kalashnikov, M. Emre Karagozler, Stefani Karp, Chase Kew, Jerad Kirkland, Sean Kirmani, Yuheng Kuang, Thomas Lampe, Antoine Laurens, Isabel Leal, Alex X. Lee, Tsang-Wei Edward Lee, Jacky Liang, Yixin Lin, Sharath Maddineni, Anirudha Majumdar, Assaf Hur- witz Michaely, Robert Moreno, Michael Neunert, Francesco Nori, Carolina Parada, Emilio Parisotto, Peter Pastor, Acorn Pooley, Kanishka Rao, Krista Reymann, Dorsa Sadigh, Ste- fano Saliceti, Pannag Sanketi, Pierre Sermanet, Dhruv Shah, Mohit Sharma, Kathryn Shea, Charles Shu, Vikas Sind- hwani, Sumeet Singh, Radu Soricut, Jost Tobias Springen- berg, Rachel Sterneck, Razvan Surdulescu, Jie Tan, Jonathan Tompson, Vincent Vanhoucke, Jake Varley, Grace Vesom, Giulia Vezzani, Oriol Vinyals, Ayzaan Wahid, Stefan Welker, Paul Wohlhart, Fei Xia, Ted Xiao, Annie Xie, Jinyu Xie, Peng Xu, Sichun Xu, Ying Xu, Zhuo Xu, Yuxiang Yang, Rui Yao, Sergey Yaroshenko, Wenhao Yu, Wentao Yuan, Jing- wei Zhang, Tingnan Zhang, Allan Zhou, and Yuxiang Zhou. Gemini robotics: Bringing ai into the physical world, 2025. 1, 2 [47] Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, Jianlan Luo, You Liang Tan, Lawrence Yunliang Chen, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy, 2024. 6 [48] Aaron van den Oord,Oriol Vinyals,and Koray Kavukcuoglu.Neural discrete representation learning, 2018. 3 [49] Weikang Wan, Yifeng Zhu, Rutav Shah, and Yuke Zhu. Lotus: Continual imitation learning for robot manipulation through unsupervised skill discovery, 2024. 3 [50] Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Jun- yang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution, 2024. 1, 2 [51] Yixiao Wang, Mingxiao Huo, Zhixuan Liang, Yushi Du, Lingfeng Sun, Haotian Lin, Jinghuan Shang, Chen- sheng Peng, Mohit Bansal, Mingyu Ding, and Masayoshi Tomizuka. Ver: Vision expert transformer for robot learn- ing via foundation distillation and dynamic routing, 2025. 2 [52] Yi Wang, Xinhao Li, Ziang Yan, Yinan He, Jiashuo Yu, Xi- angyu Zeng, Chenting Wang, Changlian Ma, Haian Huang, Jianfei Gao, Min Dou, Kai Chen, Wenhai Wang, Yu Qiao, Yali Wang, and Limin Wang. Internvideo2.5: Empowering video mllms with long and rich context modeling, 2025. 3, 5, 15 [53] Yating Wang, Haoyi Zhu, Mingyu Liu, Jiange Yang, Hao- Shu Fang, and Tong He.Vq-vla: Improving vision- language-action models via scaling vector-quantized action tokenizers, 2025. 3 [54] Junjie Wen, Minjie Zhu, Yichen Zhu, Zhibin Tang, Jinming Li, Zhongyi Zhou, Chengmeng Li, Xiaoyu Liu, Yaxin Peng, Chaomin Shen, and Feifei Feng. Diffusion-vla: General- izable and interpretable robot foundation model via self- generated reasoning, 2025. 3 [55] Junjie Wen, Yichen Zhu, Jinming Li, Zhibin Tang, Chaomin Shen, and Feifei Feng. Dexvla: Vision-language model with plug-in diffusion expert for general robot control, 2025. 2 [56] Philipp Wu, Yide Shentu, Zhongke Yi, Xingyu Lin, and Pieter Abbeel. Gello: A general, low-cost, and intuitive tele- operation framework for robot manipulators, 2024. 15 [57] Lu Xu, Jiaqian Yu, Xiongfeng Peng, Yiwei Chen, Weiming Li, Jaewook Yoo, Sunghyun Chunag, Dongwook Lee, Dae- hyun Ji, and Chao Zhang. Mose: Skill-by-skill mixture-of- experts learning for embodied autonomous machines, 2025. 2 [58] Jiange Yang, Haoyi Zhu, Yating Wang, Gangshan Wu, Tong He, and Limin Wang. Tra-moe: Learning trajectory predic- tion model from multiple domains for adaptive policy condi- tioning, 2025. 2 [59] Shuai Yang, Hao Li, Yilun Chen, Bin Wang, Yang Tian, Tai Wang, Hanqing Wang, Feng Zhao, Yiyi Liao, and Jiangmiao Pang. Instructvla: Vision-language-action instruction tuning from understanding to manipulation, 2025. 1, 2 [60] Yi Yang, Jiaxuan Sun, Siqi Kou, Yihan Wang, and Zhijie Deng. Lohovla: A unified vision-language-action model for long-horizon embodied tasks, 2025. 1, 2 [61] Zhenjie Yang, Yilin Chai, Xiaosong Jia, Qifeng Li, Yuqian Shao, Xuekai Zhu, Haisheng Su, and Junchi Yan. Drivemoe: Mixture-of-experts for vision-language-action model in end- to-end autonomous driving, 2025. 2 [62] Zhutian Yang, Caelan Garrett, Dieter Fox, Tom ́ as Lozano- P ́ erez, and Leslie Pack Kaelbling. Guiding long-horizon task and motion planning with vision language models. In 2025 IEEE International Conference on Robotics and Automation (ICRA), pages 16847–16853. IEEE, 2025. 2 [63] Yuanqi Yao, Siao Liu, Haoming Song, Delin Qu, Qizhi Chen, Yan Ding, Bin Zhao, Zhigang Wang, Xuelong Li, and Dong Wang. Think small, act big: Primitive prompt learning for lifelong robot manipulation, 2025. 3 [64] Jiawen Yu, Hairuo Liu, Qiaojun Yu, Jieji Ren, Ce Hao, Haitong Ding, Guangyu Huang, Guofan Huang, Yan Song, Panpan Cai, Cewu Lu, and Wenqiang Zhang. Forcevla: En- hancing vla models with a force-aware moe for contact-rich manipulation, 2025. 2 [65] Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, Ankur Handa, Ming-Yu Liu, Donglai Xi- ang, Gordon Wetzstein, and Tsung-Yi Lin. Cot-vla: Visual chain-of-thought reasoning for vision-language-action mod- els, 2025. 6 [66] Jinliang Zheng, Jianxiong Li, Dongxiu Liu, Yinan Zheng, Zhihao Wang, Zhonghong Ou, Yu Liu, Jingjing Liu, Ya-Qin Zhang, and Xianyuan Zhan. Universal actions for enhanced embodied foundation models, 2025. 2 [67] Zhehua Zhou, Jiayang Song, Kunpeng Yao, Zhan Shu, and Lei Ma. Isr-llm: Iterative self-refined large language model for long-horizon sequential task planning. In 2024 IEEE In- ternational Conference on Robotics and Automation (ICRA), pages 2081–2088. IEEE, 2024. 2 [68] Zhongyi Zhou, Yichen Zhu, Junjie Wen, Chaomin Shen, and Yi Xu. Chatvla-2: Vision-language-action model with open-world embodied reasoning from pretrained knowledge, 2025. 2 AtomicVLA: Unlocking the Potential of Atomic Skill Learning in Robots Supplementary Material Contents A.1. Video Demonstration . . . . . . . . . . . . . .13 A.2. Future Work and Limitations . . . . . . . . .13 A.3. Additional Details . . . . . . . . . . . . . . .13 A.3.1 . Training Setup . . . . . . . . . . . . . . . .13 A.3.2 . Simulations Setting . . . . . . . . . . . . .13 A.3.3 . Real-world Setting . . . . . . . . . . . . .15 A.3.4 . Continual Learning Setting . . . . . . . . .15 A.3.5 . Data Generation Setting . . . . . . . . . . .15 A.4. Additional Results . . . . . . . . . . . . . . .15 A.5. Additional Visualizations . . . . . . . . . . .17 A.1. Video Demonstration Please refer to the video file in the attachment for a quick overview of AtomicVLA. A.2. Future Work and Limitations Most current vision-language-action (VLA) models are typ- ically trained and evaluated on individual tasks. In this work, we investigate skill interference arising from multi- skill joint training through controlled experiments and intro- duce a Skill-Gated Mixture-of-Experts (SG-MoE) frame- work to construct a scalable atomic skill library, thereby ex- ploring the potential of VLA models in long-horizon tasks and continual learning. Although this paradigm shows clear promise, many advantages remain insufficiently explored. • AtomicVLA relies on a task planning module that pro- duces accurate atomic skill abstractions and on a set of well trained skill experts. The skill router relies on the VLM to produce accurate atomic skill abstractions dur- ing task execution, a capability constrained by the VLM’s reasoning and planning fidelity. Recent studies such as Embodied Brain [18, 44, 45] and π 0.5 [17] indicate that combining large scale web data with embodied experi- ence can effectively train VLMs that are capable of skill decomposition and task planning while enabling the con- struction of a high quality expert skill library, which can further enhance the performance of AtomicVLA. • By decoupling skill learning, AtomicVLA substantially mitigates interference during multi-skill training and demonstrates strong adaptability to new skills. However, acquiring new tasks still requires collecting substantial human demonstration data for imitation learning(IL). No- tably, recent works like π ∗ 0.6 [16], SimpleVLA-RL [25] Table 6. Atomic skill distribution in the LIBERO dataset. Atomic SkillCount Pick2462 Place761 Open201 Close152 Turn175 and VLA-RL [31] have shown that reinforcement learn- ing(RL) can effectively train VLA models and achieve strong performance. Integrating a pre-trained skill expert library with reinforcement learning(RL) may empower AtomicVLA to generalize to novel tasks under few-shot or even zero-shot settings. A.3. Additional Details A.3.1. Training Setup For all experiments, we construct the skill library using one shared expert together with multiple skill experts. Each skill expert follows the Gemma architecture, where the feedfor- ward module is implemented with an independent SwiGLU activated MLP. All skill experts are randomly initialized at the beginning of training to enable disentangled skill rep- resentations and support incremental learning. The model configuration is width = 2048, mlp dim = 4096, depth = 18, num heads = 8, and head dim = 256. Building on this con- figuration, the learning rate follows CosineDecaySchedul with a warm up phase of 1,000 steps, a peak learning rate of 2.5× 10 −5 , and a final learning rate of 5× 10 −6 . The op- timizer is AdamW with a gradient clipping norm = 1.0.To stabilize training, an exponential moving average (decay = 0.999) is used throughout optimization. Following this setup, we train the model for 100k it- erations on both the LIBERO and Calvin simulation plat- forms, and for 30k iterations in real world robotic experi- ments, with a batch size of 64. All training is performed on 8× H200 GPUs, and inference is conducted on a single NVIDIA RTX RPO6000 GPU. A.3.2. Simulations Setting LIBERO Setting. We use the public dataset provided by LIBERO and convert it into the Lerobot format for all exper- iments. Following the data processing method introduced in Sec. 3.5, we perform fine grained annotation and orga- nize the collected data into five atomic action abstractions: Role: Input: Task: Instructions: Output Format: Thought process and examples : You are an expert in robotics data analysis. You are analyzing a video clip of a robot performing a task, based on given taskinstructions. Theclip was detected based on the robot's movement patterns and segmented into basic skill segments. Your goal is to determine the task progress of the current segment and identify the specific atomic actions. 1. The complete task instructions and coarse labelsfor the video clip. 2. Image frames from a video clip (sampled every three frames) ...... Instruction:Turn on the stove and put the mokapot on it Coarse label: Turn Your task is to provide a complete task chain based on the task instructions, and analyze the current task progress and corresponding atomic tasks and actions based on the coarse labelsand video content. For task chain, it is a list of multiple atomic tasks. For each atomic task, the formatting and constraints is: 1. Output one imperative sentence starting with an action verb. 2. Use exactly one verb from this set: [Pick/Place/Turn/Open/Close/Push/Pull/Adjust]. 3. Focus on the final positions of the manipulated objects to avoid errors or repeated identification of atomic actions. 4. Sentence length limit: no more than 15 words. 5. Specify the manipulated object and key attributes (color, category, location/support surface/container). 6. For Place/Move, specify the destination or target container. 7. Describe only one atomic action and the final action (no multi-step sequences, no plans or intentions). 8. If this is final step(N/N), considering the completeness of the task, please avoid erroneous judgments such as "open" or "pick." For atomic action, it must be one verb from this set: [Pick/Place/Turn/Open/Close/Push/Pull/Adjust]. Examples: 1. The task chain is [pick up the yellow cup, place the yellow cup in microwave, close the microwave], This is step 1/3, pickupthe yellow cup, and the atomic abstraction is pick. 2. The task chain is [pick up the butter, place the butter in the basket], This is step 2/2, place the butter in the basket, andthe atomic abstraction is place. 3. The task chain is [open the top drawer, pick up the block, place the block into the top drawer], This is step 2/3, open the top drawer, and the atomic abstract- ion is open. Based on the text and video information above, please provide the task chain for this task, as well as the task progress and atomic actions corresponding to the video clip. Your judgment should be as detailed and accurate as possible, with reasoning supported by the video clip and task instructions. If the coarse labelis incorrect, ignore it and provide the correct label. The task chain is <the list of atomic tasks>, This is step x/N, <current atomic task>, and the atomic abstraction is <choose oneatomic action>. The task chain is [turn on the stone, pick up the mokapot, place the mokapot on the stone], This is step 1/3, turn on the stone, and the atomic abstraction is turn. : Figure 7. The prompts and examples of the InternVideo2.5. Pick, Place, Open, Close, and Turn. The data distribu- tion for these action categories is presented in Tab. 6. All skills are trained in a mixed manner, and therefore main- taining balanced data becomes essential. To achieve this, we increase the sampling frequency of the less represented actions, specifically Open, Close, and Turn, in order to equalize the data distribution and prevent insufficient train- ing of the corresponding skill experts. For a fair compar- ison, AtomicVLA is consistent with the evaluation of the baseline method, testing each task 50 times and reporting the average results. Calvin Setting. We use the task ABC-D public dataset provided by Calvin and divide the data according to the instruction annotations and the corresponding frame inter- vals. Each trajectory is capped at 64 frames and is con- verted into the Lerobot format for our experiments. Follow- ing the data processing method introduced in Sec.3.5, we perform fine grained annotation and organize the data into eight atomic action abstractions: Rotate, Push, Move, Open&Close, Lift, Place, Turn, and Stack. Based on these categories, we construct a skill expert library consist- ing of 8 skill experts. Building on this configuration, we ensure fair comparison by keeping the AtomicVLA evalu- ation protocol consistent with that of the baseline methods. In this setting, the robot executes 1,000 task sequences, and each sequence contains five consecutive tasks. We report Table 7. List of tasks and prompts used in our real-world experiments. Task TypeTask Prompt Long-horizon Tasks Objects in platePlace all blocks on the table into a green plate. Object into drawerOpen the top drawer and place the block inside. Object into microwavePlace the plate into the microwave and close the door. Short Tasks GraspGrasp the block from the table. StackStack the red block on the orange block. CloseClose the microwave on the table. PressPress the button on the table. OpenOpen the top drawer. Complex Scenes Objects in platePut the pepper and corn into the green plate. Objects in platePut the carrot and cucumber into the green plate. Objects in platePut the potato and eggplant into the green plate. Table 8. Results on Complex Scenes. MethodPepper/CornCarrot/CucumberPotatoe/EggplantAvg. π 0.5 [17]25403533.3 AtomicVLA*40454543.3 the average success rates together with the average length of the completed sequences. A.3.3. Real-world Setting Hardware. Our real-world experimental setup consists of a Franka Research3 robotic arm with two Realsense D435i cameras: one mounted on the wrist to provide a first-person perspective, and the other positioned opposite the robotic arm to offer a third-person view. Evaluation Tasks. In the real world, we collected three long-horizon tasks and five short tasks, and additionally gathered three long-horizon tasks in more complex scenar- ios to evaluate the performance of AtomicVLA. We em- ployed Gello [56] to control the Franka arm and record demonstration data. We collected 100 trajectories per long- horizon task and 50 per short task. The results reported in this paper were obtained using a multi-task mixed training protocol. Each task was evaluated 20 times with random- ized object placements, and the average performance across these trials was reported as the final test result. The full list of tasks is presented in Tab. 7. A.3.4. Continual Learning Setting We conducted experiments on continual learning for short tasks. Specifically, we used four tasks for mixed training, iterating for 20k steps. Then, we applied “open the top drawer” as a new skill for continual learning, fine-tuning on the weights learned from the four tasks. We used a learning rate of 5× 10 −6 and iterated for 7k steps, and reported the results by averaging over 20 validation runs for each of the five tasks. A.3.5. Data Generation Setting We use principal component analysis to obtain precise video segmentation and coarse labels. By analyzing the mo- tion changes across five consecutive frames, we determine the dominant motion axis. Specifically, the threshold for the translation axis(∆x, ∆y, ∆z) is set to 3 cm, the thresh- old for the rotation axis(∆roll, ∆pitch, ∆yaw) is set to 0.05 radians, and the gripper change( ∆Grip) threshold is set to 0.1. In Fig. 7, we provide detailed prompts and examples for VLM (InternVideo2.5 [52]). The VLM analyzes video clips and generates task chains, task progress, and atomic actions based on the input text instructions. A.4. Additional Results Detail Results on Calvin. As shown in Tab. 9, we report the performance of AtomicVLA* on the 34 evaluation tasks of the Calvin ABC-D dataset. The results indicate that the model achieves success rates close to 100 percent on most Table 9. Success rates for all evaluated tasks on CALVIN ABC-D dataset. Task NameSR (%)Task NameSR (%)Task NameSR (%) rotate blue block right97.4lift red block table99.4lift blue block table99.4 move slider right100.0lift pink block table94.5place in drawer100.0 lift red block slider99.3move slider left100.0rotate red block left98.5 place in slider98.6turn on lightbulb100.0push pink block left93.5 turn off lightbulb100.0rotate blue block left100.0lift blue block slider95.6 turn off led98.8push blue block left94.2lift pink block drawer100.0 push into drawer86.0turn on led100.0rotate pink block right98.6 lift blue block drawer100.0stack block98.4unstack block98.6 close drawer100.0push pink block right33.8push blue block right22.2 lift pink block slider97.8push red block right29.2rotate pink block left100.0 open drawer100.0push red block left89.9lift red block drawer100.0 rotate red block right97.3 Task: put both mokapots on the stove AtomicVLA 흅 ퟎ Task: rotate blue block right, move slider right, lift red block, place in slider, turn off lightbulb AtomicVLA 흅 ퟎ Figure 8. Demonstrations of LIBERO and Calvin experiments. Table 10. Parameter and inference-time. Expertsπ 0 K=5K=8K=12 Params3.24B4.17B4.81B5.65B Act71ms92 ms126 ms160 ms Think-104 ms104 ms104 ms tasks. However, performance on severalPushblocksright tasks is considerably lower, with average success rates only between 20 and 30 percent. Building on this observation, we find that in the training set the relevant blocks are typi- cally placed near the center of the table. In contrast, during evaluation the blocks are often positioned on the right side of the table. This distribution shift leads the model to push the block in the correct direction while failing to push it far enough to satisfy the success criterion, which results in task failure and prevents the execution of subsequent steps. Results on Complex Scenes. As shown in Tab. 9, we report the performance of AtomicVLA* and π 0.5 on three addi- tional real-world experiments designed to evaluate its abil- Figure 9. Error recovery cases in real-world experiments. ity to handle complex scenes and grasp irregular objects. AtomicVLA* achieved an average accuracy of 43.3%, which is 10% higher than the π 0.5 average. In addition, when picking corn, due to the color being similar to the background of the table, AtomicVLA* was able to make multiple corrections as it approached the target, resulting in a 15% improvement. Parameter and inference-time. Tab. 10 shows the param- eter counts and inference-time on a single H20 GPU. Even with 12 experts, the inference latency is only 160 ms, which is fully practical for real-world use. A.5. Additional Visualizations In Fig. 8, we present a comparison between AtomicVLA and π 0 across simulation environments.Representative task cases are selected from both LIBERO and Calvin. As shown, AtomicVLA successfully completes several task in- stances where pi0 fails, demonstrating its stronger robust- ness and execution reliability in simulated settings. In Fig. 9, we further illustrate AtomicVLA’s real-world error recovery capability. When a subtask fails during ex- ecution, AtomicVLA automatically replans and corrects its behavior to ensure successful completion of the overall task. Specifically, as highlighted in the red box in the figure, when execution errors occur, such as misgrasps due to in- accurate positioning or visual ambiguity between the target object and the background, AtomicVLA can assess the cur- rent task state, generate an updated task plan, and reattempt the failed subtask, thereby ensuring robust completion of the overall task. Additionally, we show more demonstrations of real- world experiments in Fig. 10. These experiments span a wide spectrum of scenarios, from simple to highly complex tasks and from regular to irregular objects. Across all set- tings, AtomicVLA consistently exhibits strong performance and robust generalization. Task:Place the plate into the microwave and close the door. Task:Place all blocks on the table into the green plate. Task:Open the top drawer and place the block inside. Task:Put the pepper and corn into the green plate. Task:Put the carrots and cucumbers into the green plate. Task:Put the potatoes and eggplants into the green plate. Figure 10. Demonstrations of real-world experiments(Long-horizon tasks).