Paper deep dive
Scaling Sim-to-Real Reinforcement Learning for Robot VLAs with Generative 3D Worlds
Andrew Choi, Xinjie Wang, Zhizhong Su, Wei Xu
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 96%
Last extracted: 3/22/2026, 6:05:42 AM
Summary
The paper introduces a method to scale sim-to-real reinforcement learning for Vision-Language-Action (VLA) models by using generative 3D world models to create diverse, interactive training environments. By leveraging a language-driven scene designer and PPOFlow (a PPO-based variant of ReinFlow), the authors demonstrate that increasing scene diversity significantly improves zero-shot generalization and sim-to-real transfer, achieving substantial performance gains in both simulation and real-world robot manipulation tasks.
Entities (5)
Relation Signals (3)
EmbodiedGen → generates → 3D interactive environments
confidence 98% · EmbodiedGen takes a structured scene tree layout as input and produces a fully interactive physical 3D world
PPOFlow → finetunes → VLA
confidence 95% · PPOFlow—a PPO-based variant of ReinFlow—effectively fine-tunes large pretrained flow matching VLAs.
ManiSkill 3 → hosts → RL fine-tuning
confidence 94% · we employ ManiSkill 3 as our GPU-parallelized simulator and extend EmbodiedGen into a comprehensive simulation environment generation backend.
Cypher Suggestions (2)
Identify the relationship between generative models and simulation environments · confidence 95% · unvalidated
MATCH (g:GenerativeModel)-[:GENERATES]->(e:Environment) RETURN g.name, e.type
Find all models used for robot manipulation fine-tuning · confidence 90% · unvalidated
MATCH (m:Model)-[:USED_FOR]->(t:Task {name: 'robot manipulation'}) RETURN m.nameAbstract
Abstract:The strong performance of large vision-language models (VLMs) trained with reinforcement learning (RL) has motivated similar approaches for fine-tuning vision-language-action (VLA) models in robotics. Many recent works fine-tune VLAs directly in the real world to avoid addressing the sim-to-real gap. While real-world RL circumvents sim-to-real issues, it inherently limits the generality of the resulting VLA, as scaling scene and object diversity in the physical world is prohibitively difficult. This leads to the paradoxical outcome of transforming a broadly pretrained model into an overfitted, scene-specific policy. Training in simulation can instead provide access to diverse scenes, but designing those scenes is also costly. In this work, we show that VLAs can be RL fine-tuned without sacrificing generality and with reduced labor by leveraging 3D world generative models. Using these models together with a language-driven scene designer, we generate hundreds of diverse interactive scenes containing unique objects and backgrounds, enabling scalable and highly parallel policy learning. Starting from a pretrained imitation baseline, our approach increases simulation success from 9.7% to 79.8% while achieving a 1.25$\times$ speedup in task completion time. We further demonstrate successful sim-to-real transfer enabled by the quality of the generated digital twins together with domain randomization, improving real-world success from 21.7% to 75% and achieving a 1.13$\times$ speedup. Finally, we further highlight the benefits of leveraging the effectively unlimited data from 3D world generative models through an ablation study showing that increasing scene diversity directly improves zero-shot generalization.
Tags
Links
- Source: https://arxiv.org/abs/2603.18532v1
- Canonical: https://arxiv.org/abs/2603.18532v1
Full Text
63,624 characters extracted from source content.
Expand or collapse full text
Scaling Sim-to-Real Reinforcement Learning for Robot VLAs with Generative 3D Worlds Andrew Choi 1 , Xinjie Wang 1 , Zhizhong Su 1 , and Wei Xu 1,† firstname.lastname@horizon.auto 1 Horizon Robotics † Corresponding author Sim-to-real Transfer Real-world Data put the blue block on red arch put the banana in the colander put the cup on the plate put the knife on the cutting board open the microwave fold the cloth from the bottom In-distribution put the cucumber in the colander put the tennis ball in the gray basket Out-of-distribution put the chess pawn on the chessboard put the lego brick in the bucket Diverse distractors put the green apple in the fruit bowl put the purple cup on the tray put the white mouse on the mousepad put the pepper grinder on the tray Generated Scenes "put the blue pen in the bowl." Task Description: "put the teapot on top of the notebook" Scene Designer3D World Generative Model Diverse 3D backgrounds Diverse 3D objects Pretrained prior RL Fine-tune in Sim Imitation Policy (s,r)a env Massive parallelization + domain randomization "put the broccoli in the mug." Figure 1: Overall pipeline diagram. Real-world data trains an imitation policyπ pre . Task descriptions are fed to a language-driven scene designer, which forwards layouts to a 3D world generative model to produce digital twin scenes.π θ is trained across these scenes, initialized fromπ pre , with massive parallelization and domain randomization. Finally, the trained π θ is deployed in the real world. Abstract: The strong performance of large vision–language models (VLMs) trained with reinforcement learning (RL) has motivated similar approaches for fine-tuning vision–language–action (VLA) models in robotics. Many recent works fine-tune VLAs directly in the real world to avoid addressing the sim-to-real gap. While real-world RL circumvents sim-to-real issues, it inherently limits the generality of the resulting VLA, as scaling scene and object diversity in the physical world is prohibitively difficult. This leads to the paradoxical outcome of transforming a broadly pretrained model into an overfitted, scene-specific policy. Training in simulation can instead provide access to diverse scenes, but designing those scenes is also costly. In this work, we show that VLAs can be RL fine-tuned without sacrificing generality and with reduced labor by leveraging 3D world generative models. Using these models together with a language-driven scene designer, we generate hundreds of diverse interactive scenes containing unique objects and backgrounds, enabling scalable and highly parallel policy learning. Starting from a pretrained imitation baseline, our approach increases simulation success from 9.7% to 79.8% while achieving a 1.25×speedup in task completion time. We further demonstrate successful sim-to-real transfer enabled by the quality of the generated digital twins together with domain randomization, improving real-world success from 21.7% to 75% and achieving a 1.13×speedup. Finally, we further highlight the benefits of leveraging the effectively unlimited data from 3D world generative models through an ablation study showing that increasing scene diversity directly improves zero-shot generalization. arXiv:2603.18532v1 [cs.RO] 19 Mar 2026 Keywords: sim-to-real reinforcement learning, vision-language-action models, robot manipulation, generative simulation 1 Introduction Though internet-scale training of VLMs has seen explosive success, the performance of VLAs has comparatively lagged behind. As of recently, the dominant pipeline for training robot foundation models consists of leveraging all layers of the “data pyramid”: 1) pretraining a VLM backbone on internet-scale data, 2) further pretraining via imitation learning on large-scale robot datasets [1,2], and 3) task- and embodiment-specific fine-tuning via imitation learning, real-world reinforcement learning, or sim-to-real reinforcement learning. While the first two stages are largely addressed by training on publicly available data, the third stage often requires significant human involvement. For imitation learning and offline RL, this involvement comes in the form of manual real-world data collection. For real-world RL, human-in-the-loop (HIL) supervision is often required to achieve sufficient sample efficiency [3,4]. Conversely, sim-to-real RL eliminates the need for manual data collection and HIL by instead generating an abundance of data through massive parallelization [5], but introduces new challenges in the form of visual and dynamic sim-to-real gaps. Furthermore, though elements of the learning pipeline such as environmental resets and reward detection are significantly easier in simulation, designing the 3D environments themselves—albeit a one-time cost—also requires substantial human effort. An often-overlooked drawback of real-world fine-tuning is the loss of generality that arises when training on only a small number of scenes. Because scaling scene diversity in the physical world is prohibitively expensive, most prior work fine-tunes VLAs in narrowly scoped settings. As a result, fine-tuning frequently transforms a broadly pretrained VLA into an overfitted, scene-specific policy. In this paper, we show that generative 3D world models can automatically construct large training distributions of interactive environments for RL fine-tuning of VLAs, enabling scalable sim-to-real learning beyond the small number of hand-designed environments used in prior work. A snapshot of the overall pipeline can be seen in Fig. 1. Overall, our main contributions involve showing that 1.Reinforcement learning and 3D world generative models strongly complement the large capacity of VLAs. We introduce a language-driven scene designer that converts task descriptions into structured scene layouts, which are then used by a 3D world generative model to produce fully interactive environments. Paired with highly parallelized RL, this enables VLAs to be fine-tuned across a wide variety of scenes without requiring manual scene design. In this work, we fine-tune a pretrained VLA in simulation, achieving a 70.1-percentage-point improvement in success rate and a 1.25× task completion speedup across 100 unique generated scenes. 2.Increasing scene diversity substantially improves zero-shot generalization. Through extensive ablations in simulation, we show that zero-shot generalization scales positively with training distribution diversity. Results show that fine-tuning on 50 scenes can increase zero-shot success rate by 24.8 percentage points compared to fine-tuning on a single scene. 3.Techniques for sim-to-real transfer of shallow models carry over to VLAs. We achieve positive sim-to-real deployment results by leveraging 1) the digital twin-like quality of the generated 3D objects and scenes, 2) simple domain randomization, and 3) using PD control with gravity compensation. We achieve a 53.3-percentage-point improvement in success rate and a 1.13× task completion speedup across 12 scenes and 240 real-world experiments. 4.RL fine-tuning flow matching VLAs into Gaussian policies is effective. We demonstrate that PPOFlow—a PPO-based [6] variant of ReinFlow [7]—effectively fine-tunes large pretrained flow matching VLAs. PPOFlow can transform a multi-step flow matching policy into a single-step Gaussian policy, which was previously thought to harm performance [8]. Instead, our findings directly support the recent work of Pan et al.[9]in that the multimodality of diffusion models is not an important factor for having a high-performing robot policy. Furthermore, we demonstrate an inference latency speedup of 2.36×through PPOFlow by removing the need for iterative denoising of actions, similar to consistency policies [10,11,3]. More details can be seen in Sec. B. 2 2 Related Work VLA models. Motivated by the successes of VLMs in computer vision and NLP, large open-source robot datasets, both physical [12,1] and synthetic [2,13], have been created to facilitate supervised pretraining of VLAs. Using such datasets, a series of open-source robot VLAs have been released, such as OpenVLA [14], SpatialVLA [15], GraspVLA [2],π 0 [16],π 0.5 [17], and many more. A common strategy is to then supervised fine-tune a pretrained model with imitation learning on task-specific demonstration data [18]. Though straightforward, such approaches often suffer in performance when encountering out-of-distribution states [19] and recent work has suggested that dataset fragmentation introduces unwanted shortcut learning of VLAs [20]. Given this, there has been an increasing trend of instead using reinforcement learning to fine-tune VLAs. Learning in real. One popular strategy for RL fine-tuning VLAs has been to perform RL directly in the real world [3,21,4,22,23,24,25]. Though RL in real circumvents issues arising from sim-to-real gap, such methods must handle tedious issues that don’t arise in simulation. One key consideration is how to administer the reward signal. For simplicity, most methods often rely on a sparse success reward signal, which is then administered by either a trained reward model [21,23,3,26], human engineered checkers [24], or a human labeler [4]. Some approaches have used more dense reward formulations such as steps-to-go [22]. Another consideration is how to perform scene resets, which can be handled by manual resets [3,4], scripted resets [3,25], or learning the reset task [21,23,24]. Lastly, it has become increasingly evident that a combination of offline RL and human-in-the- loop learning is a key necessity to get reasonable sample efficiency when executing RL on real hardware [23,3,4]. Though such methods have shown impressive results for learning tasks on a set scene and objects in a reasonable amount of time, the ability to maintain the generality of such policies in both scene and object space through scaling is a major question, especially given the necessity for human guidance. Learning in simulation. The other method of RL fine-tuning VLAs is to do so in simulation [27,28, 5]. Compared to real, training in simulation offers numerous benefits, such as oracle success detection, automatic resets, and access to privileged information for teacher-student training setups [29,30, 31,32]. Due to the domain difference, special consideration must be taken to ensure that such policies transfer from sim-to-real successfully. Several strategies for handling the sim-to-real gap exist including domain randomization [29,33], real-to-sim delta action models [34], real-to-sim physical property estimation [35], real-to-sim scene reconstruction [36,31,37], and more (albeit many of these methods have been used on smaller models, rather than foundation-scale). Training shallow models from scratch often requires heavily engineered reward functions [29,34,33]. With this, fine-tuning VLAs has immense benefits over shallow models as a well-trained VLA prior only requires a sparse reward and massive parallelization to learn [5]. Recent works for RL fine-tuning VLAs include leveraging real-to-sim-to-real robot video synthesis for leveraging both the real data and scalability of simulation [27] as well as learning within learned world models [28]. In contrast to these prior works, our main focus in this manuscript is to instead explore a different angle: can RL fine-tuning be done for as wide of a scene distribution as possible? And if so, how does increasing scene diversity improve zero-shot generalization? 3D world generative models. Constructing large-scale, realistic 3D simulation environments remains a fundamental bottleneck for sim-to-real reinforcement learning. Existing 3D generative paradigms (e.g., TRELLIS [38], WorldGen [39]) primarily focus on producing static visual assets that lack physical interactivity. Although recent works such as Holodeck [40] and RoboGen [41] attempt to procedurally construct interactive environments, end-to-end generation of task-level physical scenes remains challenging. Recently, emerging generative engines, exemplified by EmbodiedGen [42], have enabled natural language-driven generation of interactive 3D scenes. In this work, we explore the integration of generative 3D engines as a low-cost, highly parallelizable digital twin pipeline for scalable data generation, enabling large-scale RL fine-tuning of VLA models in simulation. 3 Figure 2: Overview of the simulation environment generation pipeline. A GPT-4o-powered scene designer converts task descriptions into structured scene graphs over semantic roles and spatial relations, which are instantiated into fully interactive 3D worlds. A quality assurance loop filters physically implausible configurations before simulation, enabling scalable, on-demand generation of diverse environments for RL training. 3 Methodology Generative simulation. To construct large-scale, physically interactive 3D environments for sim- to-real reinforcement learning, we employ ManiSkill 3 [43] as our GPU-parallelized simulator and extend EmbodiedGen [42] into a comprehensive simulation environment generation backend. As illustrated in Fig. 2, EmbodiedGen takes a structured scene tree layout as input and produces a fully interactive physical 3D world via its text-to-3D asset generation and layout composition interfaces. We develop a scene designer powered by GPT-4o [44] that automatically produces valid scene layouts from task descriptions. Given an instruction such as “put the pen in the pen holder”, the scene designer parses it into a structured scene graph comprising core semantic roles (background, context, distractors, manipulation targets, and the robot) along with their spatial relations (e.g., ON, IN). This graph-based representation enables flexible, compositional control over scene complexity and distractor density, allowing systematic modulation of training difficulty. Each generated scene is then passed through an automated GPT-4o-driven quality assurance loop that checks physical plausibility and geometric consistency, discarding configurations that would cause simulation instability. Markov decision process. We formulate our learning problem as a finite-horizon partially observable Markov decision process (POMDP) with state spaceS, action spaceA, and discount factorγ ∈ [0, 1]. At each timestept, the agent receives an observationo t and selects an actiona t according to a policyπ(a t | o t ). An observation consists of an RGB imageI ∈ R H×W×3 , language instruction e∈ R L , and proprioceptive information in the form of end-effector poseq∈ R 7 , whereLdenotes the instruction length andqis in (xyz, rpy, gripper) form. This full observation is given byo t = [I t ,e t ,q t ]. Actions are represented as (end-effector delta-pose, binary gripper) action chunksA∈ R C×7 , whereCdenotes the action chunk length. We treat each action chunk as a single decision step, resulting in a continuous action spaceA⊂ R C×7 . The objective is to learn a policy that maximizes the expected discounted return J (π) = E π h P T−1 t=0 γ t r t i , where T denotes the episode horizon. Flow matching models. For RL fine-tuning to work with solely a sparse reward signal, we look to use a pretrained robot foundation model to boost sample efficiency [5]. In this work, we useπ 0 [16] as our baseline imitation model, initialized from a checkpoint pretrained on the BridgeV2 dataset [12]. Theπ 0 model is composed of a VLM backboneE θ and an action expert headv θ (left of Fig. 3). The π 0 model is trained using a rectified flow-matching objective L flow (θ) = E o t ,A 1 t ∼D,ε∼N (0,I), τ∼U (0,1) ∥v θ (A τ t ,KV θ (o t ),τ )− (A 1 t −ε)∥ 2 2 ,(1) whereDdenotes the demonstration dataset,τ ∈ [0, 1]is the continuous integration time,KV θ (o t ) denotes the key-value tensors ofE θ (o t )for cross-attention, andA τ t = τA 1 t + (1− τ )ε . Robot 4 Gemma (2B)(300M) action expert SigLip (400M) [0.236, ..., 1.00] "put the cucumber in the colander" "put the cucumber in the colander" [0.236, ..., 1.00] value head noise head Gemma action expert stop-grad value SigLip Figure 3: Architecture diagram of theπ 0 model as a pretrained imitation modelπ pre (left) and then modified for RL fine-tuning, π θ (right). actions are generated by numerically integrating the learned ODE dA τ t dτ = v θ (A τ t ,KV θ (o t ),τ ),(2) where A 0 t ∼N (0,I). We denote the resulting pretrained imitation policy as π pre . PPOFlow. As flow matching models are deterministic, the importance ratio cannot be computed, preventing standard PPO updates. ReinFlow solves this by injecting learnable noiseσ φ (A τ t ,z t ,τ ) into the numerical integration process, wherez t denotes the stop-gradient hidden statez t = sg(E θ (o t )). This effectively converts each integration stepA τ +∆τ t = A τ t + v θ (A τ t ,KV θ (o t ),τ )∆τinto a Gaussian sample A = A τ t + v θ (A τ t ,KV θ (o t ),τ )∆τ,(3) A τ +∆τ t ∼N (A,σ φ (A τ t ,z t ,τ )).(4) Let us denote the number of numerical integration steps asK = 1/∆τ. This then allows us to compute the joint log probability of the denoising process for a particular step t as logπ(A 0 t ,...,A 1 t |o t ) = logN (0,I) + K−1 X k=0 logπ A (k+1)∆τ t |A k∆τ t ,o t ,(5) whereK = 1/∆τis the number of integration steps. In addition to the noise headσ φ , we also add a value headV ψ (z t )for estimating the state value. A full diagram ofπ θ can be observed on the right side of Fig. 3. We can then directly use the log probability (Eq. 5) to compute the following power-scaled importance ratio, which is then used along with the value fromV ψ in the original PPO clipped objective ˆr t = π θ (A 0 t ,...,A 1 t |o t ) π θ,old (A 0 t ,...,A 1 t |o t ) s ,(6) L PPOFlow (θ,φ) = E t h min ˆr t ˆ A t , clip(ˆr t , 1− ε, 1 + ε) ˆ A t i ,(7) where ˆ A t is the advantage estimate computed with GAE [45], ε is the clipping range, and s∈ (0, 1] is a scaling factor that reduces variance and helps enforce the importance ratio to stay within a more stable numerical range [46]. 4 Experiments Experiment design.To study the effect of scaling the number of scenesN, we gener- ate 100 unique scenes and denote this full set asW(see Sec. C for details and metrics). In this work, we focus on a pick-and-place task in which, given a language command, the policy must grasp an object and place it on top of another amid several distractors.For embodiment, we use the same Interbotix WidowX 250S manipulator (Fig. 4) as in Bridge. 5 Logitech C922 Webcam Interbotix WidowX 250S Experiment Area Figure 4: Experiment overview. A single external Logitech C922 webcam is used for vision. Real-world evaluation is conducted on an NVIDIA RTX 4090 GPU. To evaluate out-of-distribution (OOD) generalization, we construct three randomly sampled subsets ofW, each con- taining 50 scenes, denotedH i fori ∈ 0, 1, 2. For each subset, the corresponding OOD set is defined as ̄ H i =W i . Using eachH i , we train three independent runs for each N ∈1, 3, 10, 25, 50, where training uses the firstNscenes under a fixed ordering ofH i . In addition, we train a single run on the full setW. The initial flow-matching imitation policy π pre serves as a baseline and usesK = 10and action chunk sizeC = 4. To investigate the benefits of training on generated scenes, we introduce another baseline where we RL fine-tune a policy on three manually designed Bridge tabletop scenes from SimplerEnv [47]. Training details. For training, we use 8 NVIDIA RTX 6000 Ada GPUs to LoRA fine-tune the VLME θ and fully fine-tune the action expertv θ over 5 days. The value headV ψ and noise head σ φ are shallow MLPs and are also fully fine-tuned. Training curves are shown in Fig. 5. To enable successful sim-to-real transfer, we rely on three factors: (1) the digital twin-like fidelity of generated objects and scenes, (2) domain randomization, and (3) PD control with gravity compensation. Figure 5: Training curves for all N . With gravity compensation enabled, we found that explicit system identification was largely unneces- sary, provided that both simulation and real-world controllers could reach the majority of target poses with sufficiently low tracking error, thereby minimiz- ing the dynamics sim-to-real gap. We use an action chunk size ofC = 4and set the number of integration steps toK = 1. All other training hyperparameters and domain randomization ranges are provided in Tables A.1 and A.2, respectively. Note that in Fig. 5, the final training success rates decrease monotoni- cally asNincreases. This behavior is expected, as we use the same batch size and mini-batch size for all runs; consequently, each scene receives fewer samples, scaling linearly with1/N. We hypothesize that scaling the batch size and mini-batch size proportionally withNcould mitigate this effect. However, as shown in the next section, higher training success rates with low N do not necessarily translate to improved OOD performance. Effect of number of scenesN. Table 1 reports simulation results as a function of the number of training scenesN. We evaluate both average success rate (SR) and average time to finish (TF) across four different sets of evaluation scenes. Results reported with standard deviation are averaged over three independent runs trained on different subsetsH i . Each scene is evaluated over 100 episodes. For policies trained on EmbodiedGen (EG) scenes, increasingNproduces a clear trade-off between in-distribution (ID) and out-of-distribution (OOD) performance. AsNincreases, the success rate on the training scenesH i decreases, which is expected given the fixed mini-batch size used during training, while success rates on both EG OOD scenes ̄ H i and the SimplerEnv (SE) scenes increase steadily. For example, ID success decreases from 94.3% atN = 1to 78.9% atN = 50, while EG OOD success improves from 53.2% to 77.9%. Similarly, SE success increases from 36.1% to 68.4%. IncreasingNalso substantially reduces the gap between ID and OOD performance. AtN = 1, the gap between EG ID and EG OOD success rates is 41.1 percentage points (94.3% vs. 53.2%). This gap shrinks to 27.4 points atN = 3, 9.3 points atN = 10, and only 1.0 point atN = 50. A similar trend is observed for SE evaluation, where the gap between EG ID and SE success decreases from 58.2 points atN = 1to 10.5 points atN = 50. These results indicate that increasing scene diversity reduces over-specialization to training environments and improves cross-scene generalization. These 6 Table 1: Simulation evaluation results. EG: EmbodiedGen. SE: SimplerEnv. For a certain policy (row), the evaluation results are colored coded asOOD ,ID , andmixture . Policy EG ID scenesH i (N )EG OOD scenes ̄ H i (50)EG all scenesW (100)SE scenes (3) SR (%) [↑]TF (s) [↓]SR (%) [↑]TF (s) [↓]SR (%) [↑]TF (s) [↓]SR (%) [↑]TF (s) [↓] N = 0 (π pre )−9.610.39.710.023.79.6 N = 3 (SE)−36.58.536.08.696.76.7 N = 194.3± 2.67.7± 0.953.2± 5.59.0± 0.551.6± 5.39.0± 0.336.1± 10.49.3± 1.7 N = 388.7± 4.77.6± 0.561.3± 1.88.7± 0.560.0± 4.68.8± 0.547.4± 2.39.8± 0.6 N = 1081.7± 9.58.3± 0.872.4± 0.48.2± 0.272.1± 0.98.2± 0.154.3± 9.29.2± 0.4 N = 2585.2± 4.18.0± 0.477.6± 0.88.1± 0.278.3± 0.58.1± 0.170.1± 5.48.4± 0.1 N = 5078.9± 12.18.1± 0.877.9± 0.98.0± 0.279.2± 0.37.9± 0.168.4± 5.58.5± 0.6 N = 100−79.88.074.38.4 trends suggest that scene diversity encourages the policy to learn task-level manipulation strategies rather than scene-specific behaviors. Notably, theN = 100policy achieves the best performance on SE scenes (74.3%) despite never being trained on those environments. In contrast, the policy trained on the three manually designed SE scenes achieves very high success on those same scenes (96.7%) but exhibits the largest performance drop when evaluated on EG environments (36.5% OOD SR), a 60.2-point gap. This highlights the limited coverage of the SE scenes compared to the greater diversity provided by EG-generated environments. Finally, training on the full scene set yields the strongest overall performance. Compared to the imitation baseline π pre (N = 0), theN = 100policy improves success rate on EG scenes from 9.7% to 79.8%, while reducing average completion time from about 10 s to 8 s. Sim-to-real performance. Table 2 reports sim-to-real evaluation results across 12 scenes and 240 real-world trials. Each experiment consists of 10 trials for both the imitation baselineπ pre and the N = 100RL fine-tuned policy. We report partial success rate (PSR), defined as correctly grasping and lifting the target object, overall task success rate (SR), dynamics failure rate (DFR), semantics failure rate (SFR), and time to finish (TF). DFR measures failures caused by execution errors such as inaccurate grasp attempts or dropped objects, while SFR measures failures caused by incorrect task interpretation (e.g., interacting with the wrong object). These failure categories are not mutually exclusive, as a single failure may involve both semantic and dynamics-related errors. Overall performance improves substantially after RL fine-tuning. Partial success increases from 45% forπ pre to 88.3%, indicating a large improvement in reliable object acquisition. Overall task success improves even more, increasing from 21.7% to 75%. In addition to higher success rates, the RL policy produces more efficient behavior, reducing the average completion time from 11.5 s to 10.2 s. Examining the failure breakdown reveals that most errors of the imitation baseline arise from dynamics-related issues. Across all trials, the baseline exhibits a DFR of 66.7%, which decreases to 18.3% after RL fine-tuning, indicating a substantial improvement in manipulation robustness. Semantic failures are comparatively less frequent but also decrease from 18.3% to 6.7%, suggesting improved task grounding and object selection. Performance gains are consistent across nearly all scenes. For example, scene 10 introduces a screwdriver that does not appear in the RL training distribution, yet the RL policy achieves an SR of 50% compared to 0% for the baseline. Similarly, scene 11 evaluates a teacup stacking task involving unseen object instances and task composition, where the RL policy improves SR from 20% to 50% while maintaining perfect partial success (100%). Overall, these results demonstrate that policies fine-tuned in large-scale generative simulation transfer effectively to real-world manipulation. RL fine-tuning improves both low-level execution robustness and high-level task success while maintaining generalization to previously unseen objects, attributes, and task variations. Qualitative examples of the imitation and RL policy rollouts across different scenes are shown in Sec. H. 7 Table 2: Sim-to-real evaluation results. Objects or attributes that are OOD during RL training are highlighted. Attributes in parentheses were not included in the language command. Columns are color coded assuccess ,failure , andtime metrics. SceneLanguage CommandScene Distractors 0“put the banana on the bowl.”broccoli, strawberry, plate 1“put the broccoli on the mug.”knife, cutting board 2“put the mushroom on the bowl.”banana, broccoli, plate 3“put the spoon on the napkin.”fork, plate 4“put the (pink) eraser on the notebook.”black pen, mug 5“put the (yellow) brush on the bowl.”banana, yellow marker 6“put the white eraser on the mug.”pink eraser, black pen, blue pen 7“put the red knife on the cutting board.”regular knife 8“put the blue pen on the (blue) bowl.”black pen, white eraser, basket 9“put the green marker on the basket.”red marker, yellow marker, dry erase eraser, cardboard box 10“put the screwdriver on the basket.”knife, black marker 11“put the blue teacup on the yellow teacup.”purple teacup, pink teapot, blue plate, yellow plate Imitation baseline π pre N = 100 PSR [↑]SR [↑]DFR [↓]SFR [↓]TF (s) [↓]PSR [↑]SR [↑]DFR [↓]SFR [↓]TF (s) [↓] 00.80.60.40.010.7± 6.30.90.90.10.07.4± 1.5 1 0.40.20.80.013.3± 2.01.00.70.30.010.4± 7.5 2 0.30.10.90.011.50.90.90.10.08.4± 2.5 30.20.10.70.56.10.80.70.10.212.2± 5.1 40.40.30.70.111.0± 4.91.01.0−10.2± 4.2 5 0.60.30.70.010.5± 1.11.01.0−9.5± 3.9 60.40.20.80.07.8± 0.11.00.70.30.011.1± 5.5 70.70.20.20.614.6± 0.60.80.80.00.212.2± 6.5 80.30.10.90.029.60.90.70.20.112.2± 7.4 90.60.30.60.19.2± 1.80.70.60.40.08.8± 2.0 10 0.20.00.80.3−0.60.50.20.310.4± 5.5 110.50.20.50.611.6± 2.01.00.50.50.011.0± 2.0 All0.450.2170.6670.18311.5± 5.50.8830.750.1830.06710.2± 5.1 5 Conclusion In this work, we explored the use of 3D world generative models for scaling sim-to-real reinforcement learning of vision–language–action (VLA) policies. By leveraging generative simulation to automati- cally construct diverse interactive environments, we fine-tuned a pretrained VLA across 100 unique scenes using massively parallel RL while avoiding the cost of manual scene design. Our results show that increasing scene diversity significantly improves zero-shot generalization across multiple OOD evaluation scenes. Furthermore, RL fine-tuning on the generated scenes improves simulation success rates by 70.1 percentage points, while sim-to-real deployment yields a 53.3-percentage-point improvement in real-world task success. Beyond improved performance, our findings highlight the complementary roles of generative simulation and RL. Generative 3D world models provide a scalable source of diverse training environments, while RL enables efficient adaptation of large pretrained policies using only sparse rewards. Our language-driven scene designer further reduces engineering effort by automatically translating task descriptions into structured environments for simulation. Together, these components form a practical pipeline for scaling robot foundation models while minimizing the need for additional real-world data collection or human-in-the-loop training. Limitations and future work. We note that the experiments in this work focus on pick-and-place tasks, primarily due to current limitations in the types of objects and interactions that Embodied- Gen [42] can reliably generate. Looking forward, we aim to extend EmbodiedGen to support richer manipulation behaviors such as articulated object manipulation, tool use, and multi-stage tasks. Scaling the distribution of automatically generated scenes and tasks will enable RL fine-tuning of VLAs on a broader range of interaction distributions, further improving generalization and robust- ness. Nevertheless, we believe this work represents an important first step toward demonstrating the potential of combining generative simulation with RL to scale the fine-tuning of VLA models. 8 References [1]O. X.-E. Collaboration. Open X-Embodiment: Robotic learning datasets and RT-X models. https://arxiv.org/abs/2310.08864, 2023. [2]S. Deng, M. Yan, S. Wei, H. Ma, Y. Yang, J. Chen, Z. Zhang, T. Yang, X. Zhang, W. Zhang, H. Cui, Z. Zhang, and H. Wang. Graspvla: a grasping foundation model pre-trained on billion-scale synthetic action data, 2025. URL https://arxiv.org/abs/2505.03233. [3] Y. Chen, S. Tian, S. Liu, Y. Zhou, H. Li, and D. Zhao. Conrft: A reinforced fine-tuning method for vla models via consistency policy. In Proceedings of Robotics: Science and Systems, RSS 2025, Los Angeles, CA, USA, Jun 21-25, 2025, 2025. doi:10.15607/RSS.2025.XXI.019. [4]P. Intelligence, A. Amin, R. Aniceto, A. Balakrishna, K. Black, K. Conley, G. Connors, J. Darpinian, K. Dhabalia, J. DiCarlo, D. Driess, M. Equi, A. Esmail, Y. Fang, C. Finn, C. Glossop, T. Godden, I. Goryachev, L. Groom, H. Hancock, K. Hausman, G. Hussein, B. Ichter, S. Jakubczak, R. Jen, T. Jones, B. Katz, L. Ke, C. Kuchi, M. Lamb, D. LeBlanc, S. Levine, A. Li-Bell, Y. Lu, V. Mano, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, C. Sharma, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, W. Stoeckle, A. Swerdlow, J. Tanner, M. Torne, Q. Vuong, A. Walling, H. Wang, B. Williams, S. Yoo, L. Yu, U. Zhilinsky, and Z. Zhou. π ∗ 0.6 : a vla that learns from experience, 2025. URL https://arxiv.org/abs/2511.14759. [5]H. Li, Y. Zuo, J. Yu, Y. Zhang, Z. Yang, K. Zhang, X. Zhu, Y. Zhang, T. Chen, G. Cui, D. Wang, D. Luo, Y. Fan, Y. Sun, J. Zeng, J. Pang, S. Zhang, Y. Wang, Y. Mu, B. Zhou, and N. Ding. Simplevla-rl: Scaling vla training via reinforcement learning, 2025. URL https://arxiv.org/abs/2509.09674. [6]J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms, 2017. URL https://arxiv.org/abs/1707.06347. [7] T. Zhang, C. Yu, S. Su, and Y. Wang. Reinflow: Fine-tuning flow matching policy with online reinforcement learning. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=ACagRwCCqu. [8]A. Z. Ren, J. Lidard, L. L. Ankile, A. Simeonov, P. Agrawal, A. Majumdar, B. Burchfiel, H. Dai, and M. Simchowitz. Diffusion policy policy optimization. In arXiv preprint arXiv:2409.00588, 2024. [9]C. Pan, G. Anantharaman, N.-C. Huang, C. Jin, D. Pfrommer, C. Yuan, F. Permenter, G. Qu, N. Boffi, G. Shi, and M. Simchowitz. Much ado about noising: Dispelling the myths of generative robotic control, 2025. URL https://arxiv.org/abs/2512.01809. [10]Y. Chen, H. Li, and D. Zhao. Boosting continuous control with consistency policy, 2024. URL https://arxiv.org/abs/2310.06343. [11]A. Prasad, K. Lin, J. Wu, L. Zhou, and J. Bohg. Consistency policy: Accelerated visuomotor policies via consistency distillation, 2024. URL https://arxiv.org/abs/2405.07503. [12]H. Walke, K. Black, A. Lee, M. J. Kim, M. Du, C. Zheng, T. Zhao, P. Hansen-Estruch, Q. Vuong, A. He, V. Myers, K. Fang, C. Finn, and S. Levine. Bridgedata v2: A dataset for robot learning at scale. In Conference on Robot Learning (CoRL), 2023. [13] T. Chen, Z. Chen, B. Chen, Z. Cai, Y. Liu, Q. Liang, Z. Li, X. Lin, Y. Ge, Z. Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation. arXiv preprint arXiv:2506.18088, 2025. [14]M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. Openvla: An open-source vision-language-action model, 2024. URL https://arxiv.org/abs/2406.09246. 9 [15]D. Qu, H. Song, Q. Chen, Y. Yao, X. Ye, Y. Ding, Z. Wang, J. Gu, B. Zhao, D. Wang, and X. Li. Spatialvla: Exploring spatial representations for visual-language-action model, 2025. URL https://arxiv.org/abs/2501.15830. [16] K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky.π 0 : A vision-language- action flow model for general robot control, 2024. URLhttps://arxiv.org/abs/2410. 24164. [17] P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, M. Y. Galliker, D. Ghosh, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tanner, Q. Vuong, H. Walke, A. Walling, H. Wang, L. Yu, and U. Zhilinsky.π 0.5 : a vision-language-action model with open-world generalization, 2025. URL https://arxiv.org/abs/2504.16054. [18] M. J. Kim, C. Finn, and P. Liang. Fine-tuning vision-language-action models: Optimizing speed and success, 2025. URL https://arxiv.org/abs/2502.19645. [19]J. Liu, F. Gao, B. Wei, X. Chen, Q. Liao, Y. Wu, C. Yu, and Y. Wang. What can rl bring to vla generalization? an empirical study, 2026. URL https://arxiv.org/abs/2505.19789. [20]Y. Xing, X. Luo, J. Xie, L. Gao, H. Shen, and J. Song. Shortcut learning in generalist robot policies: The role of dataset diversity and fragmentation, 2025. URLhttps://arxiv.org/ abs/2508.06426. [21] A. Sharma, A. M. Ahmed, R. Ahmad, and C. Finn. Self-improving robots: End-to-end autonomous visuomotor reinforcement learning. In 7th Annual Conference on Robot Learning, 2023. URL https://openreview.net/forum?id=ApxLUk8U-l. [22]S. K. S. Ghasemipour, A. Wahid, J. Tompson, P. R. Sanketi, and I. Mordatch. Self-improving embodied foundation models. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=KXMIIVUB9U. [23]J. Yang, M. S. Mark, B. Vu, A. Sharma, J. Bohg, and C. Finn. Robot fine-tuning made easy: Pre-training rewards and policies for autonomous real-world reinforcement learning. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 4804–4811, 2024. doi:10.1109/ICRA57147.2024.10610421. [24] H. R. Walke, J. H. Yang, A. Yu, A. Kumar, J. Orbik, A. Singh, and S. Levine. Don’t start from scratch: Leveraging prior data to automate robotic reinforcement learning. In K. Liu, D. Kulic, and J. Ichnowski, editors, Proceedings of The 6th Conference on Robot Learning, volume 205 of Proceedings of Machine Learning Research, pages 1652–1662. PMLR, 14–18 Dec 2023. URL https://proceedings.mlr.press/v205/walke23a.html. [25]R. Mendonca and D. Pathak. Continuously improving mobile manipulation with autonomous real-world RL. In RSS 2024 Workshop: Data Generation for Robotics, 2024. URLhttps: //openreview.net/forum?id=qASoq07bXh. [26] Y. Du, K. Konyushkova, M. Denil, A. Raju, J. Landon, F. Hill, N. de Freitas, and S. Cabi. Vision- language models as success detectors, 2023. URL https://arxiv.org/abs/2303.07280. [27]Y. Fang, Y. Yang, X. Zhu, K. Zheng, G. Bertasius, D. Szafir, and M. Ding. Rebot: Scaling robot learning with real-to-sim-to-real robotic video synthesis, 2025. URLhttps://arxiv.org/ abs/2503.14526. 10 [28]H. Li, P. Ding, R. Suo, Y. Wang, Z. Ge, D. Zang, K. Yu, M. Sun, H. Zhang, D. Wang, and W. Su. Vla-rft: Vision-language-action reinforcement fine-tuning with verified rewards in world simulators, 2025. URL https://arxiv.org/abs/2510.00406. [29]H. Zhang, H. Yu, L. Zhao, A. Choi, Q. Bai, Y. Yang, and W. Xu. Learning multi-stage pick- and-place with a legged mobile manipulator. IEEE Robotics and Automation Letters, 10(11): 11419–11426, 2025. doi:10.1109/LRA.2025.3608425. [30] R. Singh, A. Allshire, A. Handa, N. Ratliff, and K. V. Wyk. Dextrah-rgb: Visuomotor policies to grasp anything with dexterous hands, 2025. URLhttps://arxiv.org/abs/2412.01791. [31]P. Dan, K. Kedia, A. Chao, E. Duan, M. A. Pace, W.-C. Ma, and S. Choudhury. X-sim: Cross- embodiment learning via real-to-sim-to-real. In 9th Annual Conference on Robot Learning, 2025. URL https://openreview.net/forum?id=BO7qo66YJ2. [32]Z. Chen, Q. Yan, Y. Chen, T. Wu, J. Zhang, Z. Ding, J. Li, Y. Yang, and H. Dong. Clutter- dexgrasp: A sim-to-real system for general dexterous grasping in cluttered scenes. In 9th Annual Conference on Robot Learning, 2025. URL https://openreview.net/forum?id= 4XKKUifQ9c. [33]T. He, Z. Wang, H. Xue, Q. Ben, Z. Luo, W. Xiao, Y. Yuan, X. Da, F. Casta ̃ neda, S. Sastry, C. Liu, G. Shi, L. Fan, and Y. Zhu. Viral: Visual sim-to-real at scale for humanoid loco-manipulation, 2025. URL https://arxiv.org/abs/2511.15200. [34]T. He, J. Gao, W. Xiao, Y. Zhang, Z. Wang, J. Wang, Z. Luo, G. He, N. Sobanbab, C. Pan, Z. Yi, G. Qu, K. Kitani, J. Hodgins, L. J. Fan, Y. Zhu, C. Liu, and G. Shi. Asap: Aligning simulation and real-world physics for learning agile humanoid whole-body skills, 2025. URL https://arxiv.org/abs/2502.01143. [35] M. Wang, S. Tian, A. Swann, O. Shorinwa, J. Wu, and M. Schwager. Phys2real: Fusing vlm priors with interactive online adaptation for uncertainty-aware sim-to-real manipulation, 2025. URL https://arxiv.org/abs/2510.11689. [36] H. Sun, H. Wang, C. Ma, S. Zhang, J. Ye, X. Chen, and X. Lan. Prism: Projection-based reward integration for scene-aware real-to-sim-to-real transfer with few demonstrations, 2025. URL https://arxiv.org/abs/2504.20520. [37] T. G. W. Lum, O. Y. Lee, K. Liu, and J. Bohg. Crossing the human-robot embodiment gap with sim-to-real RL using one human demonstration. In 9th Annual Conference on Robot Learning, 2025. URL https://openreview.net/forum?id=CgGSFtjplI. [38]J. Xiang, Z. Lv, S. Xu, Y. Deng, R. Wang, B. Zhang, D. Chen, X. Tong, and J. Yang. Structured 3d latents for scalable and versatile 3d generation, 2025. URLhttps://arxiv.org/abs/ 2412.01506. [39]Z. Xie. Worldgen: Generate any 3d scene in seconds.https://github.com/ZiYang-xie/ WorldGen, 2025. [40]Y. Yang, F.-Y. Sun, L. Weihs, E. VanderBilt, A. Herrasti, W. Han, J. Wu, N. Haber, R. Krishna, L. Liu, C. Callison-Burch, M. Yatskar, A. Kembhavi, and C. Clark. Holodeck: Language guided generation of 3d embodied ai environments, 2024. URLhttps://arxiv.org/abs/2312. 09067. [41] Y. Wang, Z. Xian, F. Chen, T.-H. Wang, Y. Wang, K. Fragkiadaki, Z. Erickson, D. Held, and C. Gan. Robogen: Towards unleashing infinite data for automated robot learning via generative simulation, 2024. URL https://arxiv.org/abs/2311.01455. [42]X. Wang, L. Liu, Y. Cao, R. Wu, W. Qin, D. Wang, W. Sui, and Z. Su. Embodiedgen: Towards a generative 3d world engine for embodied intelligence, 2025. URLhttps://arxiv.org/ abs/2506.10600. 11 [43]S. Tao, F. Xiang, A. Shukla, Y. Qin, X. Hinrichsen, X. Yuan, C. Bao, X. Lin, Y. Liu, T. kai Chan, Y. Gao, X. Li, T. Mu, N. Xiao, A. Gurha, V. N. Rajesh, Y. W. Choi, Y.-R. Chen, Z. Huang, R. Calandra, R. Chen, S. Luo, and H. Su. Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai. Robotics: Science and Systems, 2025. [44] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. [45] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel. High-dimensional continuous control using generalized advantage estimation, 2018. URLhttps://arxiv.org/abs/1506. 02438. [46] C. Zheng, S. Liu, M. Li, X.-H. Chen, B. Yu, C. Gao, K. Dang, Y. Liu, R. Men, A. Yang, J. Zhou, and J. Lin. Group sequence policy optimization, 2025. URLhttps://arxiv.org/abs/2507. 18071. [47]X. Li, K. Hsu, J. Gu, K. Pertsch, O. Mees, H. R. Walke, C. Fu, I. Lunawat, I. Sieh, S. Kir- mani, S. Levine, J. Wu, C. Finn, H. Su, Q. Vuong, and T. Xiao. Evaluating real-world robot manipulation policies in simulation, 2024. URL https://arxiv.org/abs/2405.05941. [48]A. Wagenmaker, M. Nakamoto, Y. Zhang, S. Park, W. Yagoub, A. Nagabandi, A. Gupta, and S. Levine. Steering your diffusion policy with latent space reinforcement learning, 2025. URL https://arxiv.org/abs/2506.15799. [49] D. McAllister, S. Ge, B. Yi, C. M. Kim, E. Weber, H. Choi, H. Feng, and A. Kanazawa. Flow matching policy gradients, 2025. URL https://arxiv.org/abs/2507.21053. 12 A Training Details In this section, we list the training parameters used for all policies in Table A.1. We also summarize several practical observations encountered during training: 1. Several training strategies were evaluated, including: (a) Freezing the VLM and LoRA fine-tuning the action head. (b) Freezing the VLM and fully fine-tuning the action head. (c)Freezing the SigLip vision encoder, LoRA fine-tuning Gemma, and LoRA fine-tuning the action head. (d) LoRA fine-tuning the VLM and LoRA fine-tuning the action head. (e) LoRA fine-tuning the VLM and fully fine-tuning the action head. Overall, we found that freezing the VLM (a, b) significantly lowers the training ceiling and often leads to model collapse during prolonged training, where task success rates initially peak, then saturate, and eventually decline toward zero. As a result, updating the VLM is crucial for stable learning. Fully fine-tuning the action head consistently outperformed the corresponding configuration that LoRA fine-tunes it. Freezing the SigLip vision encoder reduced success rates monotonically by a few percent compared to LoRA fine-tuning it. Since LoRA fine-tuning the entire VLM did not noticeably affect sim-to-real performance, we adopt option (e) in this work. 2. Adding the scaling exponentsto the importance ratio (Eq. 6) is crucial for stabilizing training and preventing immediate collapse. Without it, the computed log probabilities can spike to very large values, causing many updates to fall near the clipping threshold and resulting in the majority of gradients being clipped. Table A.1: Training parameters. ParameterValueParameterValue number of environments192action chunk size C4 batch size19200number of integration steps K1 mini batch size1920learning rate2e−5 episode length25gradient global norm clip0.5 discount factor γ0.99clipping ratio ε0.2 UTD ratio1importance ratio scale s0.2 LoRA rank32 σ φ log min−2.5 control frequency (Hz)5σ φ log max−2.0 Domain randomization. Domain randomization ranges can be seen in Table A.2. For lighting, we use the same lighting setup as the one in SimplerEnv’s Bridge setting [47]. For camera position randomization, we first record the intersection point between the camera’s principal ray and the table in the default configuration. We then compute a new camera orientation such that the principal ray of the perturbed camera re-aligns with this original intersection point. We found that control-delay randomization was unnecessary due to the use of action chunks (C = 4), which allow the majority of actions to be executed with minimal latency. Table A.2: Domain randomization ranges. ParameterValueParameterValue object x-position (m)[0.2, 0.4]robot z-height perturb (m) [0, 0.05] object y-position (m)[−0.15, 0.15]robot joint pos perturb (rad) [−0.1, 0.1] 6 object yaw orientation (rad) [0, 2π]camera xyz-position (m)[−0.05, 0.05] 3 ambient light RGB color[0, 0.6] 3 directional light brightness [0.5, 1.5] 13 Rewards. We employ a sparse success reward, enabled by the strong pre-training of the VLA during the imitation phase. An episode is considered successful if success = contact(A,B) ∧ ¬ contact(A, table) ∧ ¬ contact(A, robot),(8) where A denotes the manipulated object and B denotes the target object onto which A is placed. B Does Multimodality Matter? Effect ofK In this work, we setK = 1, effectively converting the flow-matching policyπ pre into a single-step Gaussian policyπ θ during RL fine-tuning. While prior work such as Diffusion Steering [48] and Flow Policy Optimization (FPO) [49] demonstrate that flow-matching models can be fine-tuned without collapsing into a Gaussian policy, we found that: (1) PPOFlow led to significantly more stable and repeatable training, and (2) single-step (K = 1) RL fine-tuning exhibited no observable degradation in policy performance compared to maintaining multimodality (K > 1). We validate the latter through an ablation overK ∈ 1, 2, 4on the entire scene–task setW, where largerKenables greater expressivity by chaining multiple Gaussian transitions, withK = 1 corresponding to a unimodal policy. From the left side of Table B.1, increasingKdoes not improve performance; in fact,K = 1achieves the highest success rate. Although the margin is modest (approximately 2–2.5%) and based on a single run, these results suggest that multimodality does not meaningfully benefit RL fine-tuning in this setting. Our findings align with recent work by Pan et al. [9], which argues that multimodality is not the primary driver of diffusion-based policy performance in robot manipulation. We hypothesize that multimodality is most beneficial during imitation learning when the demonstration distribution itself is multimodal [8]. In contrast, under reward-guided optimization, unimodal Gaussian policies appear sufficient for effective policy improvement. Beyond maintaining competitive performance, reducingKsubstantially improves computational efficiency. As shown in Table B.1, K = 1 achieves a 1.17× speedup in backward pass time relative toK = 2and a 1.45×speedup relative toK = 4. Inference gains are even more pronounced. Compared to the original imitation setting ofK = 10,K = 1yields a 2.72×speedup in inference latency under a regular forward pass and a 2.36×speedup when usingtorch.compile. These improvements stem from eliminating iterative denoising steps, effectively reducing the policy to a single forward pass, similar in spirit to consistency policies [10,11,3]. Overall, these results suggest that multimodality offers limited benefit during RL fine-tuning while incurring significant computational overhead. Table B.1: Effect of K on RL fine-tuning and inference. KRL SR (%) [↑]RL TF (s) [↓]Backward Time (s) [↓] Inference Latency (s) [↓] Reg. torch.compile K = 10N/AN/AN/A0.2670.172 K = 477.297.76108.800.1530.107 K = 277.238.3687.550.1200.086 K = 179.757.9474.740.0980.073 14 C Generative Simulation Metrics Table C.1 reports the profiling statistics of the generative simulation pipeline illustrated in Fig. 2. Across 100 generated environments, the system produces 516 unique object assets, averaging 5.16 in- teractive objects per scene. Because the asset generation pipeline proceeds from text to a background- removed foreground image and then to a 3D asset, we deploy a GPT-4o-based automated quality assurance (QA) loop at multiple stages. The Semantic Appearance checker serves as an early filter on the intermediate foreground image, verifying whether it matches the target object category and key visual attributes. If this check fails, the system immediately resamples the text-to-image seed and retries, preventing semantically incorrect intermediate outputs from propagating into the downstream image-to-3D stage. The Mesh Geometry checker then evaluates whether the generated 3D mesh is complete and free of major geometric defects. Finally, the Cross-modal Text-to-3D Alignment checker assesses whether the final 3D asset remains semantically consistent with the original text description, thereby capturing semantic drift introduced during 3D generation. Combined with this GPT-4o-based QA-driven rejection-and-retry mechanism, assets require only 1.37 generation attempts on average to satisfy all constraints. Manual inspection shows that 85% of the generated environments are directly usable for end-to-end reinforcement learning without human intervention. The remaining failures are mainly due to scale mismatches or imperfect initial object placements (representative failure case see Fig. C.1), which are generally minor and can be rectified with minimal manual adjustment. Under fully online sequential generation on a single NVIDIA RTX 4090 GPU, the pipeline requires 46.8± 5.0minutes per scene. When reusable interactive assets are drawn from a pre-built asset library, the scene generation time is reduced to approximately 2 minutes per environment. Table C.1: Profiling statistics of the generative simulation pipeline (Fig. 2). Results are aggregated over 100 generated scenes under sequential execution on a single NVIDIA RTX 4090 GPU. QA pass rates denote stage-wise single-attempt pass rates, and average attempts are measured per accepted object asset. CategoryMetricValue Generation ScaleTotal generated environments100 Total unique 3D object assets516 Avg. interactive assets per scene5.16 Total background assets100 Generation EfficiencyTotal time per scene46.8± 5.0 min ⌞ Time per background asset25.0± 3.2 min ⌞ Time per object asset3.9± 1.6 min Automated QA Pass RatesSemantic Appearance83.3% Mesh Geometry75.2% Cross-modal Text-to-3D Alignment91.9% Average attempts per valid asset1.37 Manual InspectionFinal environment acceptance rate85% Auto-scaling of objects. Since the WidowX 250S manipulator (Fig. 4) used in this work has a narrow gripper width of at most 74 m, many generated objects are too large to be grasped. To address this, we automatically scale down oversized objects (based on their mesh bounding boxes) to ensure they are graspable. We also reduce mesh resolution to simplify contact computation. 15 Prompt: matte black ceramic jar with a narrow neck and slightly textured finish. X Two objects are present, jar lacks narrow neck and textured finish. Semantic Appearance Checker XDue to a Scene Designer error, the carrot is initialized directly on the plate. Tasks desc: Put the carrot on the plate on the table. XThe mushroom is too large for the robotic arm to grasp. Tasks desc: Pick up the mushroom and place it in the saucepan. X Overlapping and redundant leg geometry. Prompt: simple round wooden table with a light natural finish. Mesh Geometry Checker X Geometry resembles a handle, not a remote control. Prompt: black plastic remote control with raised tactile buttons. Cross-modal Text-to-3D Alignment Checker XThe carrot starts at the table edge and easily rolls off, making it hard to grasp. Tasks desc: Pick up the carrot and put it in the soup pot. Figure C.1: Representative failure cases in the generative simulation pipeline. Top: failures detected by the automated QA modules, including semantic appearance mismatch, mesh defects, and text-to-3D semantic drift. Bottom: residual failures identified by manual inspection after all automated checks pass, including incorrect initialization, scale mismatch, and unstable object placement. D Foundation Model Candidates In addition toπ 0 , we also considered using OpenVLA [14], but observed a significant real-to-sim performance gap 1 . Beyond this gap, the larger size of OpenVLA (7B parameters) compared toπ 0 (3B) made the latter a more practical choice. We also evaluated SpatialVLA [15]. Although it had successful real-to-sim transfer, we found its inference latency to be substantially higher than that of π 0 . Overall, we choseπ 0 for its small model size (3B), good real-to-sim transfer, and low inference latency. We note that, rather than using theπ 0 model released by Physical Intelligence, we adopt a model provided by a third-party repository (https://github.com/allenzren/open-pi-zero) as it includes a checkpoint pretrained on BridgeV2 data. 1 Seehttps://github.com/openvla/openvla/issues/7#issuecomment-2330572696for analysis by the authors on the OpenVLA real-to-sim gap. 16 E SimplerEnv Scenes Figure E.2: SimplerEnv scenes with domain randomization used for training and evaluation. We use the three manually designed tabletop scenes from SimplerEnv [47] to both train aN = 3 baseline policy and as OOD evaluation scenes for EmbodiedGen scene trained policies. To be able to use the same domain randomization techniques as those in Table A.2, we remove the static png background so that camera pose and lighting randomization can properly take effect (Fig. E.2). F Simulation Scenes and Per-scene Results 013102550100 Number of training scenes, N 0.0 0.2 0.4 0.6 0.8 1.0 Success rate EG all scenes (100) EG OOD Scenes i (50) SE OOD Scenes (3) 013102550100 Number of training scenes, N 8 9 10 11 Time to finish (s) Figure F.3: Success rate and time to finish as a function ofN. EG: EmbodiedGen. SE: SimplerEnv. In this section, we provide a textual description of each scene inW(see Tables F.1 and F.2). We additionally report the individual success rates for each scene inWfor the policies listed in Table 1, shown in Figs. F.4, F.5, F.6, and F.7. Finally, a plot version of Table 1 (partially) is shown in Fig. F.3 which showcases the monotonic increase in average success rate as N increases. 17 Table F.1: Generated scenes 0-49. SceneLanguage CommandScene Distractors 0green cube→ yellow cubekeyboard 1carrot→ plateknife, bowl 2ceramic teapot→ traymug, plate 3orange→ blue napkinplate 4banana→ red napkinmug, plate 5broccoli→ white dishcutting board 6tomato→ saucepan– 7cucumber→ metal colanderknife, cutting board 8knife→ round containerspoon, plate 9orange→ blue napkinspoon, plate 10red cup→ trayfork, plate 11spoon→ black jarplate 12pen→ round containerbook 13pear→ potmug, plate 14apple→ platebook 15wooden spoon→ white plateglass cup, napkin 16teacup→ platespoon, napkin 17orange→ bowlspoon, plate 18marker→ round containercoffee mug, notebook 19yellow cup→ wooden trayremote control, book 20green apple→ basketknife, plate 21white ping-pong ball→ blue cupremote control, book 22cucumber→ basketknife, cutting board 23carrot→ plateknife, cutting board 24mushroom→ potknife, cutting board 25ceramic teapot→ traymug, plate 26cucumber→ metal colanderknife, cutting board 27knife→ utensil holdersalt shaker, plate 28orange→ napkinknife, plate 29lemon→ platefork, napkin 30orange→ bowlknife, cutting board 31tennis ball→ gray basketcoffee mug, book 32marker→ round containercoffee mug, notebook 33yellow cup→ wooden trayfork, napkin 34green eraser→ red boxpen, notebook 35fork→ metal traynapkin, plate 36apple→ basketknife, plate 37spoon→ black jarnapkin, plate 38pen→ pen holdernotebook 39pear→ potknife, plate 40teacup→ platespoon 41banana→ napkinfork, plate 42wooden spoon→ white platesalt shaker, napkin 43marker→ pen holdercoffee mug, notebook 44green apple→ basketsalt shaker, plate 45apple→ basketfork, plate 46red cup→ traysalt shaker, napkin 47cucumber→ basketknife, cutting board 48green apple→ fruit bowlred apple, knife, plate 49purple cup→ trayspoon, red cup, plate 18 Table F.2: Generated scenes 50-99. SceneLanguage CommandDistractors 50white mouse→ mouse padblack mouse, keyboard 51water bottle→ traymug, plate 52apple→ fruit plateglass cup 53orange→ round platefork, square plate 54spoon→ plateglass, fork, napkin 55cup→ trayspoon, plate, bowl 56cup→ traynapkin, plate 57banana→ platefork, orange, napkin 58potato→ basketknife, onion, cutting board 59tomato→ bowlspoon, potato, plate 60black pen→ round containerstapler, red pencil, notebook 61cucumber→ plateknife, carrot, cutting board 62blue cup→ trayspoon, plate 63spoon→ white platefork, glass 64apple→ fruit bowlknife, plate 65orange→ green napkinsalt shaker, plate 66banana→ platefork, glass 67lemon→ metal bowlknife, plate 68tomato→ white platefork, napkin 69pear→ basketmug, plate 70red marker→ pen holdermug, notebook 71eraser→ notebookpen 72green apple→ trayfork, plate 73teacup→ saucerspoon, napkin 74knife→ wooden cutting boardspoon, plate 75banana→ basketmug, plate 76lemon→ napkinfork, plate 77mushroom→ saucepanspoon, plate 78potato→ platefork, glass 79yellow cup→ basketspoon, plate 80apple→ cutting boardknife, plate 81spoon→ saucercup, plate 82spatula→ platesalt shaker, bowl 83red block→ blue boxlamp, book 84banana→ bowlspoon, plate 85pear→ napkinspoon, plate 86red apple→ wicker basketmug, plate 87green lime→ white bowlknife, plate 88yellow banana→ wooden cutting boardknife, bowl 89strawberry→ small saucerfork, napkin 90carrot→ soup potknife, cutting board 91soup spoon→ empty bowlplate, napkin 92pepper grinder→ metal traysalt shaker, napkin 93red marker→ white mugmouse, keyboard 94eraser→ open notebookpen 95glue stick→ plastic binstapler, mouse pad 96pencil sharpener→ green traybook 97lego brick→ plastic bucketremote control, book 98chess pawn→ chessboardmug, book 99sponge→ sink basintowel 19 Figure F.4: Success rates for scenes 0-24. 20 Figure F.5: Success rates for scenes 25-49. 21 Figure F.6: Success rates for scenes 50-74. 22 Figure F.7: Success rates for scenes 75-99. 23 Move the green cube and put it into the yellow cube Pick up the orange and place it on the blue napkin Put the broccoli on the white dish on the table Pick up the knife and put it in the utensil holder Put the pen in the pen holder on the table Put the red cup on the tray on the table Put the pear in the pot on the table Put the teacup on the plate on the table Put the wooden spoon on the white plate Put the teacup on the plate on wooden striped table Put the yellow cup on the wooden tray Put the cucumber on the plate Put the tomato into the bowl Put the potato into the basket Put the cup onto the tray Put the purple cup onto the tray Put the green apple into the fruit bowl Pick up the banana and place it on the napkin Put the teacup on the plate Put the marker in the pen holder Put the lemon on the plate Pick up the orange and place it on the napkin Put the cucumber in the metal colander Put the carrot on the plate Put the blue cup on the tray Pick up the spoon and place it on the white plate Put the apple into the fruit bowl Place the orange on the green napkin Pick up the pear and place it in the basket Place the eraser on the notebook Pick up the orange and place it in the bowl Put the knife on the wooden cutting board Place the spoon on the saucer Pick up the spatula and put it on the plate Put the red block into the blue box Pick up the strawberry and put it on the small saucer Put the glue stick in the plastic bin Put the pencil sharpener on the green tray Put the legobrick in the plastic bucket Put the chess pawn on the chessboard square Figure F.8: Generated example simulation scenes for RL fine-tuning. Zoom in for details. 24 G Sim-to-real Scenes Figure G.9: All objects used in the sim-to-real experiments. Figure G.10: Example of 10 trial object randomization (scene 1 in Table 2). 25 H Real-world Rollout Examples RL IL Figure H.11: IL: semantic failure. RL: success. A trial example of the red knife→cutting board task (scene 7 from Table 2). The imitation learning (IL) policy first grasps the red knife but immediately drops it after entering an out-of-distribution state. It then grasps and lifts the knife once more before dropping it again and eventually timing out. In comparison, the reinforcement learning (RL) policy grasps the red knife and deliberately places it on the cutting board. RL IL Figure H.12: IL: dynamics failure. RL: success. A trial example of the green marker→basket task (scene 9 from Table 2). The IL policy initially misses the grasp on the green marker and then pushes the basket away. It makes two regrasp attempts before finally lifting the marker, but places it next to the basket instead and eventually times out. In comparison, the RL policy correctly places the green marker inside the basket. 26 IL RL Figure H.13: IL: dynamics and semantic failure. RL: success. A trial example of the screwdriver →basket task (scene 10 from Table 2). The IL policy attempts to grasp the screwdriver but misses. It then becomes confused and grasps the knife instead, placing it in the basket and failing due to incorrect task execution. In comparison, the RL policy correctly places the screwdriver in the basket. IL RL Figure H.14: IL: dynamics and semantic failure. RL: success. A trial example of the blue teacup →yellow teacup stacking task (scene 11 from Table 2). The IL policy first misses the grasp on the blue teacup and then moves toward the yellow teacup twice. It subsequently becomes confused and attempts to grasp the purple teacup instead, eventually timing out. In comparison, the RL policy quickly grasps the correct teacup and stacks it onto the yellow one. 27 IL RL Figure H.15: IL: dynamics failure. RL: dynamics failure. A trial example of the white eraser→ mug task (scene 6 from Table 2). The IL policy misses the grasp on the white eraser several times before ultimately hovering, leading to a timeout. In contrast, the RL policy successfully grasps and lifts the white eraser, but it slips from the grasp and falls onto the black pen, confusing the policy. The policy then grasps and clears the black pen before returning to the white eraser, but still times out. RL IL Figure H.16: IL: dynamics failure. RL: dynamics failure. A trial example of the broccoli→mug task (scene 1 from Table 2). The IL policy repeatedly misses the grasp on the broccoli and eventually times out. In contrast, the RL policy quickly grasps the broccoli and attempts to place it into the mug, but it hits the edge and rolls out of reach of the manipulator. A significant portion of failures was caused by objects moving out of reach, preventing regrasp attempts. 28