← Back to papers

Paper deep dive

From Passive Observer to Active Critic: Reinforcement Learning Elicits Process Reasoning for Robotic Manipulation

Yibin Liu, Yaxing Lyu, Daqi Gao, Zhixuan Liang, Weiliang Tang, Shilong Mu, Xiaokang Yang, Yao Mu

Year: 2026Venue: arXiv preprintArea: cs.ROType: PreprintEmbeddings: 85

Abstract

Abstract:Accurate process supervision remains a critical challenge for long-horizon robotic manipulation. A primary bottleneck is that current video MLLMs, trained primarily under a Supervised Fine-Tuning (SFT) paradigm, function as passive "Observers" that recognize ongoing events rather than evaluating the current state relative to the final task goal. In this paper, we introduce PRIMO R1 (Process Reasoning Induced Monitoring), a 7B framework that transforms video MLLMs into active "Critics". We leverage outcome-based Reinforcement Learning to incentivize explicit Chain-of-Thought generation for progress estimation. Furthermore, our architecture constructs a structured temporal input by explicitly anchoring the video sequence between initial and current state images. Supported by the proposed PRIMO Dataset and Benchmark, extensive experiments across diverse in-domain environments and out-of-domain real-world humanoid scenarios demonstrate that PRIMO R1 achieves state-of-the-art performance. Quantitatively, our 7B model achieves a 50% reduction in the mean absolute error of specialized reasoning baselines, demonstrating significant relative accuracy improvements over 72B-scale general MLLMs. Furthermore, PRIMO R1 exhibits strong zero-shot generalization on difficult failure detection tasks. We establish state-of-the-art performance on RoboFail benchmark with 67.0% accuracy, surpassing closed-source models like OpenAI o1 by 6.0%.

Tags

ai-safety (imported, 100%)csro (suggested, 92%)preprint (suggested, 88%)

Links

Your browser cannot display the PDF inline. Open PDF directly →

Intelligence

Status: not_run | Model: - | Prompt: - | Confidence: 0%

Entities (0)

No extracted entities yet.

Relation Signals (0)

No relation signals yet.

Cypher Suggestions (0)

No Cypher suggestions yet.

Full Text

84,372 characters extracted from source content.

Expand or collapse full text

From Passive Observer to Active Critic: Reinforcement Learning Elicits Process Reasoning for Robotic Manipulation Yibin Liu 1,2⋆ , Yaxing Lyu 3⋆ , Daqi Gao 1⋆ , Zhixuan Liang 4 , Weiliang Tang 5 , Shilong Mu 6 , Xiaokang Yang 1 , and Yao Mu 1⋆ 1 Shanghai Jiao Tong University 2 Northeastern University 3 Xiamen University Malaysia 4 The University of Hong Kong 5 The Chinese University of Hong Kong 6 Xspark AI liuyibin@stumail.neu.edu.cn, muyao@sjtu.edu.cn Abstract. Accurate process supervision remains a critical challenge for long-horizon robotic manipulation. A primary bottleneck is that current video MLLMs, trained primarily under a Supervised Fine-Tuning (SFT) paradigm, function as passive “Observers” that recognize ongoing events rather than evaluating the current state relative to the final task goal. In this paper, we introduce PRIMO R1 (Process Reasoning Induced MOnitoring), a 7B framework that transforms video MLLMs into ac- tive “Critics”. We leverage outcome-based Reinforcement Learning to in- centivize explicit Chain-of-Thought generation for progress estimation. Furthermore, our architecture constructs a structured temporal input by explicitly anchoring the video sequence between initial and current state images. Supported by the proposed PRIMO Dataset and Bench- mark, extensive experiments across diverse in-domain environments and out-of-domain real-world humanoid scenarios demonstrate that PRIMO R1 achieves state-of-the-art performance. Quantitatively, our 7B model achieves a 50% reduction in the mean absolute error of specialized reason- ing baselines, demonstrating significant relative accuracy improvements over 72B-scale general MLLMs. Furthermore, PRIMO R1 exhibits strong zero-shot generalization on difficult failure detection tasks. We establish state-of-the-art performance on RoboFail benchmark with 67.0% accu- racy, surpassing closed-source models like OpenAI o1 6.0%. Keywords: Vision-Language Models· Language Models Reasoning· Embodied AI· Task Progress Estimation 1 Introduction The pursuit of general-purpose robots capable of performing long-horizon manip- ulation tasks remains a central challenge in embodied AI. A critical bottleneck ⋆ Equal contribution. This work was done during Yibin Liu’s and Yaxing Lyu’s in- ternships at ScaleLab in Shanghai Jiao Tong University. ⋆ Corresponding author: muyao@sjtu.edu.cn arXiv:2603.15600v1 [cs.RO] 16 Mar 2026 2Y. Liu et al. 퓥 seq 횰 curr 푹풆풇풆풓풆풏풄풆 퓥 seq Progress: 50 Progress: 60 푪풖풓풆풏풕 퓥 seq Progress: 85 Language Instruction퓛 PRIMO-R1 ImageOnly VideoOnly ReferenceBased Ours: ProgressReasoningVLM Task Progress퓥 seq Initstate휤 풊풏풊풕 Current state휤 풄풖풓 Thinking Planning: 1. Move, 2. Retrieve, 3.Chopping Observation: The agent is doing x Reasoning: Based on the task and my observation, I think x Answer: NoTemporalContent NoStateAnchor Requires Pre-definedDemo Language-Guided Generalization&Zero-Shot 횰 init NoDemoLowPerformance Progress: 85 Fig. 1: Paradigm comparison: Prior approaches vs. our PRIMO R1. in acquiring such skills is to derive effective reward signals. While sparse rewards (e.g., binary success/failure) are easy to specify, they are often insufficient for efficient policy learning in complex environments. Conversely, dense rewards, which provide granular feedback on task progress, typically rely on laborious manual engineering or privileged access to ground-truth states unavailable in the real world. Recent advances in Vision-Language Models (VLMs) and Mul- timodal Large Language Models (MLLMs) [19, 20, 35] have sparked hope for learning universal reward functions directly from visual observations. However, deploying these models as reliable "process supervisors" reveals a fundamental limitation in their current paradigm. Existing video MLLMs, despite their impressive capabilities in captioning and QA, primarily function as passive “Observers”. They excel at describing what is happening but struggle with the rigorous quantitative reasoning required to judge how well the task is proceeding. Most prior approaches treat progress estimation as a standard regression or classification problem via supervised fine- tuning. Within this paradigm, models are optimized to recognize and describe ongoing events, rather than to measure the actual distance between the cur- rent state and the final task goal. Consequently, these "Observers" are brittle: they fail to generalize to unseen objects, cannot explain their predictions, and crucially, often assign high progress scores to failed attempts simply because the visual trajectory resembles a successful motion. This exposes a critical structural deficit: without explicit temporal boundary anchoring and continuous reasoning pathways, models are unable to align continuous visual trajectories with the discrete logical conditions required for task success. To bridge the gap between passive perception and active evaluation, we ar- gue that a reliable reward model must evolve from an Observer into an active “Critic”. We introduce PRIMO R1 (Process Reasoning Induced MOnitoring), a 7B model framework that elicits explicit process reasoning from video MLLMs. Instead of supervising the model with a single scalar label, we leverage Re- inforcement Learning (RL) to incentivize the generation of Chain-of-Thought. Furthermore, to address the loss of detail in continuous dynamic feature spaces, our architecture employs a structural prompting strategy by explicitly anchoring Reinforcement Learning Elicits Process Reasoning for Robotic Manipulation3 the video sequence between initial and current state images. This design provides clear visual boundary conditions that transform the reasoning task from generic temporal perception into a structured state-alignment verification. By condition- ing this reasoning process on diverse natural language task goals, we establish a structural connection between the objective input and the reasoning execution, effectively exploiting the inherent linguistic generalization capabilities of founda- tional MLLMs. To support this paradigm, we construct the PRIMO Dataset, encompassing both SFT and RL post-training data with CoT annotations, and PRIMO Benchmark, designed to systematically evaluate out-of-domain gen- eralization across cross-task and cross-environment settings. Our experiments reveal that optimizing a policy model for continuous progress reasoning intrinsically constructs the temporal context representations required for discrete failure detection. By enforcing rigorous temporal alignment and self- reasoning, PRIMO R1 achieves state-of-the-art performance across multiple do- mains. Quantitatively, our 7B model attains an average Mean Relative Accuracy (MRA) of 82.90 and a Mean Absolute Error (MAE) of 15.52, effectively outper- forming 72B-scale general MLLMs by a margin of +9.10 absolute MRA points. Furthermore, it exhibits robust zero-shot generalization in execution anomaly verification, achieving 67.0% accuracy on the RoboFail benchmark and surpass- ing parameter-heavy closed-source models including GPT-4o and OpenAI o1. Our contributions can be summarized as follows: – We introduce PRIMO R1, a 7B reasoning model that effectively transforms video MLLMs from passive Observers into interpretable Critics. It achieves SOTA performance in task progress estimation and failure detection. – We present the PRIMO Dataset for task progress detection, covering both SFT and RL post-training data with CoT annotations, alongside PRIMO Bench, which systematically evaluates the out-of-domain generalization ca- pabilities of post-training methods in video-based MLLMs. – We propose a structured temporal input strategy that explicitly anchors video sequences between initial and current state images. This boundary an- choring facilitates high-precision state alignment, achieving a 50% reduction in the mean absolute error compared to specialized baselines. – We demonstrate that optimizing for progress reasoning intrinsically enables robust zero-shot generalization for failure detection. This is validated on the RoboFail benchmark, where PRIMO R1 achieves a state-of-the-art 67.0% accuracy, surpassing closed-source models like OpenAI o1 by 6.0%. 2 Related Work 2.1 Multimodal Large Language Models for Video Understanding Early video MLLMs adapted static architectures via temporal aggregation [15, 21] and mitigated memory bottlenecks through context compression and hierar- chical structures [6,14,24,28,38]. These architectures operate predominantly as passive “Observers,” excelling at perceptual QA but lacking quantitative tem- poral reasoning. The structural transition toward an active “Critic” paradigm 4Y. Liu et al. necessitates explicit temporal localization, prompting recent designs to integrate evidence searching and timestamp encoding [10,23,34]. Our framework completes this transition by enabling progress reasoning for rigorous temporal judgment. 2.2 Vision-Based State Estimation and Reward Modeling Semantic reward modeling leverages VLMs to encode universal value functions via representation distances [19,20] or frozen embeddings [29]. For explicit progress estimation, recent mechanisms employ frame-ordering [18], multi-modal integra- tion [35, 37], and synthetic trajectory augmentation [36]. A primary structural limitation of VLAC [35], Robo-Dopamine [30], and PROGRESSLM [37] is their functional dependency on explicit reference demonstrations. Furthermore, fram- ing estimation as direct regression via Supervised Fine-Tuning (SFT) restricts the model’s capacity for causal failure analysis. Our method bypasses reference dependency by explicitly eliciting process reasoning chains. 2.3 Reinforcement Learning for Reasoning Elicitation Inference-time scaling and outcome-based Reinforcement Learning (RL) induce Chain-of-Thought (CoT) behaviors without dense annotations [9]. This “R1 paradigm” has expanded into multimodal domains, enhancing static visual rea- soning [27, 33] and dynamic temporal grounding [7, 31]. Parallel architectural optimizations include dynamic frame sampling [8] and current-state image an- choring for planning [5]. We map this capability to robotic process supervision, formulating task completion metrics as outcome rewards to elicit verifiable, self- correcting reasoning paths for task assessment. 3 Method 3.1 Problem Formulation We formalize the task of robotic process supervision as a multi-modal state estimation problem. The input space consists of four key variables: an initial state image I init capturing the environment prior to execution, a process video sequence V seq = v 1 ,v 2 ,...,v T representing the temporal state transitions, a current state image I curr reflecting the latest observed outcome, and a language instruction I specifying the task goal (the specific structure and content of I are detailed in Figure 10). The objective is to learn a mapping function F that evaluates the visual tuple (I init ,V seq ,I curr ) conditioned on the instruction I as the semantic reference, outputting a scalar progress indicator y ∈ [0, 100], where y = 0 denotes the initial state and y = 100 signifies successful state. As demon- strated in our ablation study (Table 4), explicitly modeling both boundary states (I init and I curr ) alongside the temporal transition (V seq ) is a necessary structural prerequisite for accurate progress estimation across varying task horizons. In the standard paradigm, existing video MLLMs function as passive “Ob- servers”, treating progress estimation as a direct regression or classification task. Reinforcement Learning Elicits Process Reasoning for Robotic Manipulation5 Initstate횰 init Task Progress퓥 seq Initstate횰 curr Real Humanoid In-Domain EnvironmentOODEnvironment Language Instruction퓛 PRIMO-R1 <think> Planning: The high-level goal is to chop an onion. The initial state shows the agent in a living room, which is not directly related to the task. The steps for this task would be:1. Move to the kitchen where the onion is located.2. Retrieve the onion.3. Prepare the chopping board and knife. 4. Chop the onion into small pieces. 5. Place the chopped onion in a container or on the counter. Observation: The video shows ... The agent has completed the chopping step and is likely preparing to place the chopped onion in a container or on the counter. Reasoning: Let me think. The agent has successfully moved to the kitchen, retrieved the onion, and is in the process of chopping it. ... The agent has completed the main steps of the task. Therefore, the task is nearly complete. <think> <answer>85.7%<answer> Fig. 2: Overall framework of PRIMO R1. Evaluated across in-domain simulations (AgiBot, BEHAVIOR, RoboTwin) and OOD real humanoid environments, the model processes a video sequence (V seq ) anchored by initial (I init ) and current (I curr ) states. It generates an explicit Chain-of-Thought to output the final progress estimate. They directly model the distribution of the target y gt conditioned on the input tuple (I init ,V seq ,I curr ,I) via Supervised Fine-Tuning (SFT). This formulation isolates visual features at a surface level, bypassing the underlying causal struc- ture of the state transitions. To transform the model into an active “Critic”, we reformulate the predic- tion process from direct scalar regression into a multi-step generative reasoning task. We define a policy π θ that sequentially generates a latent reasoning chain (Chain-of-Thought) C, followed by the final progress estimate ˆy. Rather than relying on dense annotations to supervise the intermediate variable C, we opti- mize π θ using Reinforcement Learning. The optimization objective maximizes the expected reward R(ˆy,y gt ), which is computed solely based on the accuracy of the final prediction ˆy. This structural dependency incentivizes the policy to self-organize the intermediate reasoningC to accurately align the temporal tran- sition V seq between the boundary states I init and I curr . Crucially, conditioning this generative reasoning process on diverse natural language task goals (I) es- tablishes a direct structural mapping between the semantic objective and the visual execution logic, explicitly exploiting the linguistic generalization capabili- ties of foundational MLLMs to process varying evaluation criteria. The complete architectural workflow of this framework is illustrated in Fig. 2. 6Y. Liu et al. 17.1% 15.6% 19.2% 13.1% 13.7% 5.5% 6.7% 5.4% 3.7% PRIMO-R1-CoT-116k BEHAVIOR (20,001) Agibot (18,269) RoboTwin (22,419) EgoPlan (15,311) RoboVQA (15,953) ShareRobot (6,408) STAR (7,843) NExT-QA (6,309) Perception Test (4,278) 32.9% 21.1% 24.9% 14.5% 6.6% PRIMO-R1-182k BEHAVIOR (60,000) Agibot (38,536) RoboTwin (45,436) RoboVQA (26,453) ShareRobot (12,000) 13.2% 13.2% 13.2% 13.2% 13.2% 13.2% 21.0% PRIMO Bench In-Domain (Same task) BEHAVIOR (3,000) AgiBot (3,000) RoboTwin (3,000) Out-of-Domain (Cross task, environment) BEHAVIOR (3,000) AgiBot (3,000) RoboTwin (3,000) Real Humanoid (4,791) Fig. 3: Dataset distribution for SFT (left), RL (middle), and PRIMO Bench (right). Charts show sample counts and domain percentages (e.g., BEHAVIOR, Agi- Bot, RoboTwin). The PRIMO Bench highlights the data split between In-Domain and Out-of-Domain evaluation sets. See Appendix A for details. 3.2 PRIMO Dataset and Benchmark To systematically elicit and evaluate the temporal reasoning capabilities of Video MLLMs for robotic process supervision, we present the PRIMO Dataset and the accompanying PRIMO Bench. PRIMO Dataset for Post-Training. The PRIMO Dataset is meticulously constructed to support our two-stage post-training paradigm, covering both Su- pervised Fine-Tuning (SFT) and Reinforcement Learning (RL) data. Unlike standard video QA datasets, our data features fine-grained progress indicators annotated with Chain-of-Thought reasoning paths. The training corpus aggre- gates multi-source trajectories from a real-world environment (AgiBot) and two high-fidelity simulations (BEHAVIOR-1k and RoboTwin). Additionally, to main- tain data diversity during the SFT phase, we incorporate several general video reasoning datasets to augment the training corpus, yielding a comprehensive collection partitioned into a 116k-sample SFT dataset (PRIMO-R1-CoT-116k) and a 182k-sample RL dataset (PRIMO-R1-182k), as illustrated in Figure 3. PRIMO Bench for Generalization Evaluation. To systematically evaluate the robustness of post-training methods against varying degrees of distribution shift, we introduce PRIMO Bench, which categorizes evaluation into two splits: – In-Domain (ID) - Same Task: Evaluates the model’s estimation accuracy on task categories that were exposed during the training phase within the three seen environments. – Out-of-Domain (OOD) - Cross-Task & Cross-Environment: De- signed to test zero-shot generalization. Cross-Task evaluates the model on entirely unseen tasks within the familiar environments. Cross-Environment introduces a stringent unseen domain transfer challenge, evaluating the model on real-world trajectories collected via teleoperation of a different physical humanoid robot (Leju KUAVO-MY) in unstructured physical environments (e.g., factories and service scenarios). Reinforcement Learning Elicits Process Reasoning for Robotic Manipulation7 Comprehensive details regarding raw data collection, semantic annotation synthesis, data processing methodologies, and exact dataset statistics of our dataset and benchmark are provided in Appendix A. 3.3 Process Reasoning RL with Group Relative Policy Optimization To transform the Video MLLM from a passive observer into an active critic capa- ble of self-correction, we employ Group Relative Policy Optimization (GRPO) [9]. Unlike standard Proximal Policy Optimization (PPO) [25], which relies on a computationally expensive value function critic to estimate the baseline, GRPO leverages the group statistics of sampled outputs to estimate the baseline. This is particularly advantageous for Video MLLMs, where the memory overhead of maintaining a separate value network alongside the policy model is prohibitive. Group Sampling and Advantage Estimation. Formally, for a given task tuple (I init ,V seq ,I curr ,I), we sample a group of G outputso 1 ,o 2 ,...,o G from the policy π θ old . Each output o i consists of a reasoning chain C i (enclosed in <think> tags) and a final progress estimate ˆy i . Instead of training a value func- tion V (x), GRPO computes the advantage A i for each output o i by normalizing its reward r i against the group’s distribution: A i = r i − mean(r 1 ,...,r G ) std(r 1 ,...,r G ) + ε , (1) where ε is a small constant for numerical stability. This relative advantage encourages the model to generate reasoning paths that yield higher rewards than the average of its current stochastic explorations, effectively filtering out "hallucinated" progress estimates. Rule-Based Reward Design. A core challenge in eliciting reasoning is defin- ing an effective reward signal without dense annotation. We define a composite reward function R(o i ,y gt ) = r fmt + r acc targeting both structure and precision: (1) Format Reward (r fmt ). To explicitly induce a Chain-of-Thought, we en- force a strict structural constraint. The model receives a positive reward (e.g., +1) only if the output strictly follows the pattern <think>reasoning</think> followed by <answer>prediction</answer>. This prevents the policy from col- lapsing into direct guessing. (2) Accuracy Reward (r acc ). Since the task progress y is continuous, treating it as a binary outcome is insufficient. To provide dense feedback for numerical reasoning, we adopt a bounded linear decay reward function: r acc = max 0, 1− |ˆy i − y gt | R max ,(2) where R max (e.g., 100.0) represents the maximum error range. This formulation ensures the reward starts at 1.0 for an exact match and linearly decreases to 0.0 as the error approaches R max , strictly confining the score to the [0, 1] interval. Optimization Objective. The policy π θ is updated to maximize the expected advantage while remaining close to the reference policy π ref to prevent reward hacking or language degeneration. The GRPO objective is formulated as: 8Y. Liu et al. L GRPO (θ) =− 1 G G X i=1 [min (ρ i A i , clip(ρ i , 1− ε, 1 + ε)A i ) −β· D KL (π θ (o i |x)||π ref (o i |x))], (3) where ρ i = π θ (o i |x) π θ old (o i |x) is the probability ratio, and β controls the strength of the KL divergence penalty. By optimizing this objective, PRIMO R1 implicitly learns that generating detailed, causal reasoning inC is the most reliable strategy to maximize the accuracy reward in ˆy, thereby emerging as a robust Critic. 4 Experiments In this section, we systematically evaluate the performance of PRIMO R1. The experiments are structured to assess two primary capabilities: continuous task progress estimation across both in-domain simulations and out-of-domain real- world environments, and zero-shot generalization in discrete execution failure detection. Furthermore, we conduct ablation studies to isolate the impact of temporal context modalities on estimation accuracy, followed by qualitative case studies analyzing the structural logic of the generated reasoning chains. Through- out our evaluations, we employ Qwen2.5-VL-7B-Instruct as the foundation model for all training phases. Detailed experimental setups, including hardware spec- ifications, training parameters, and inference configurations like sampled frame count and frame resolution, are provided in Appendix G. 4.1 Evaluation Metrics We evaluate task progress estimation using Mean Relative Accuracy (MRA) and Mean Absolute Error (MAE). Mean Relative Accuracy (MRA). Given a prediction ˆy, ground-truth progress y, and a set of accuracy thresholds T , Mean Relative Accuracy (MRA) is defined as MRA = 1 |T| X τ∈T I |ˆy− y| |y| < 1− τ , (4) where I(·) denotes the indicator function. Mean Absolute Error (MAE). MAE is defined as MAE = E |ˆy− y| ,(5) and is reported to provide a clear measure of absolute prediction error. 4.2 Main Results: Generalization in Progress Estimation We present the evaluation of our proposed method against state-of-the-art base- lines. The results are analyzed in two parts: a comprehensive performance com- parison across all domains (Table 1) and an ablation study focusing on the impact of SFT and RL strategies on generalization (Table 3). Reinforcement Learning Elicits Process Reasoning for Robotic Manipulation9 Table 1: Comparison on Progress Estimation. We report the Mean Relative Accuracy (MRA↑, higher is better) and Mean Absolute Error (MAE↓, lower is better) across four distinct environments. The best results are highlighted in bold. Model AgiBotBehaviorRoboTwinReal HumanoidAverage MRA (↑) MAE (↓)MRA (↑) MAE (↓)MRA (↑) MAE (↓)MRA (↑) MAE (↓)MRA (↑) MAE (↓) Closed-Source Models GPT-5 mini 74.2724.8179.6020.0880.5218.3467.1432.5975.3823.96 GPT-4o81.0118.9979.9220.0881.7318.2774.65 25.3579.3320.67 Gemini 2.5 Flash 73.5426.4178.0121.9981.0418.5867.3732.6374.9924.90 Claude-Haiku-4.574.4025.5970.9329.0774.1325.8772.6827.3273.0426.96 Open-Source General MLLMs Qwen2.5-VL-7B77.4322.5669.9130.0667.3732.6256.4634.7367.7929.99 InternVL 3.5 8B78.5221.4772.1927.1570.8129.1865.4434.5571.7428.09 Qwen2.5-VL-72B78.0022.0079.4920.5075.4124.5962.2928.1073.8023.80 Reasoning & Video MLLMs ProgressLM-3B-RL42.9030.8346.4328.2331.8436.4832.0034.9038.2932.61 Video R1 7B72.4227.5870.6329.3770.8629.1453.5731.8766.8729.49 Robobrain 7B72.9925.9172.5226.9770.4128.8555.8328.5167.9427.56 Cosmos-Reasoning 7B72.4827.0167.0632.3573.1425.8559.3931.4166.5229.12 PRIMO R1 (Ours)87.6712.3387.0812.9084.5215.4872.3221.3782.9015.52 Overall Performance. Table 1 reports the performance of all models across the four evaluation environments: AgiBot, Behavior, RoboTwin, and Real Hu- manoid. We utilize Mean Relative Accuracy (MRA) and Mean Absolute Error (MAE) as defined in Sec. 4.1. As shown, our method, PRIMO R1, consistently outperforms all evaluated open-source baselines, achieving the highest average MRA (82.90) and the lowest average MAE (15.52) across all domains. When compared to Open-Source General MLLMs, PRIMO demonstrates sig- nificant superiority. Notably, despite being built upon a 7B parameter founda- tion, PRIMO surpasses the massive Qwen2.5-VL-72B model (average MRA of 73.80) by a substantial margin of 9.10 absolute points in MRA. Furthermore, against specialized reasoning & video MLLMs such as Video R1 7B and Robo- brain 7B, PRIMO R1 effectively halves the absolute estimation error, dropping the average MAE from approximately 27-29 down to 15.52. To further investigate the source of this error reduction, Figure 4 provides a fine-grained MAE analysis across five distinct task completion intervals. While baseline models exhibit severe error spikes and hallucinations during the final ex- ecution phase (80−100%), PRIMO maintains consistently low error rates across all stages. This demonstrates that explicit process reasoning effectively prevents the model from prematurely hallucinating task completion based on superficial visual similarities at the end of a trajectory. A critical highlight of PRIMO is its robustness in the Sim-to-Real transfer setting. In the unseen, highly unstructured "Real Humanoid" environment, general MLLMs and video models experience a severe performance drop (e.g., Qwen2.5-VL-7B drops to 56.46 MRA). In con- trast, PRIMO maintains a strong MRA of 72.32, underscoring the effectiveness of generating an explicit reasoning chain before progress prediction to bridge the distribution gap between simulation and the real world. 10Y. Liu et al. Table 2: Failure Detection Capabilities. Accuracy (%) on the RoboFail benchmark [17]. The evaluation measures the capability of models to effectively detect and quantify task execution failures. Benchmark details are provided in Appendix A. ModelRoboFail (↑)ModelRoboFail (↑)ModelRoboFail (↑) Closed-SourceOpen-SourceOurs Gemini 2.0 Flash67.0Qwen2.5-VL-7B57.6PRIMO (SFT)51.0 GPT-4o63.0 Nemotron-H-56B64.0PRIMO (RL)63.0 OpenAI o161.0Cosmos-Reason1-7B60.0PRIMO R167.0 Claude-haiku-4.559.0 Cosmos-Reason1-56B66.2 Table 3: Ablation and Generalization Analysis. All reported metrics represent Mean Relative Accuracy (MRA↑). We compare the Base model, SFT-only model, RL- only model, and our final model (SFT + RL). The results are split into In-Domain (ID) tasks and Out-of-Domain (OOD) tasks to highlight generalization capabilities. Model In-Domain (ID)Out-of-Domain (OOD) Avg. Seen TasksCross-TaskCross-Environment Agibot Behavior RoboTwinAgibot Behavior RoboTwinReal Humanoid Qwen2.5-VL-7B (Base)70.83 69.1371.1974.45 77.4761.0148.1267.46 Our Model (SFT)83.37 80.3880.6382.02 82.6179.1367.3079.35 Our Model (RL)86.05 85.8273.2782.95 81.3975.4352.1276.72 PRIMO R1 (SFT+RL)87.8389.4288.1587.6787.0884.5272.3285.28 Impact of RL on Generalization. To validate the effectiveness of our training pipeline, we analyze the performance evolution from the Base model to the SFT stage, and finally to the RL-finetuned stage. Table 3 details the performance on In-Domain (ID) seen tasks versus Out-of-Distribution (OOD) unseen tasks. The Base Qwen2.5-VL-7B model exhibits a weak zero-shot capability for precise progress estimation (average MRA of 67.46). While Supervised Fine- Tuning (SFT) significantly improves overall performance to 79.35, it primarily overfits to the semantic features of the training distribution. This is evident in the performance degradation observed in the Cross-Environment (Real Humanoid) OOD setting, where the SFT model only achieves 67.30. Interestingly, applying RL directly without SFT (RL-only) yields suboptimal results (76.72 average MRA), as the model struggles to autonomously discover the correct output format and structural reasoning paths from scratch. However, the integration of Group Relative Policy Optimization (GRPO) after the SFT phase, our complete PRIMO (SFT+RL) pipeline creates a powerful synergy. The RL phase pushes ID performance to near 90% (e.g., 89.42% on Behav- ior) and, more importantly, drastically enhances generalization capabilities. The self-correction and rigorous causal reasoning learned via RL transfer effectively to OOD settings, boosting Cross-Task performance across all simulated environ- ments and lifting the Cross-Environment accuracy to 72.32%. This confirms that RL process supervision fundamentally shifts the model from a passive pattern- matcher to an active, generalizing critic. Reinforcement Learning Elicits Process Reasoning for Robotic Manipulation11 Table 4: Ablation on Input Modalities. We analyze the necessity of temporal context by varying the input information. I init : Initial state image. V seq : Process video clip. I curr : Current state image. Results show that temporal context (V seq ) is crucial for reducing error (MAE). Input ModalityAgibotBehaviorRobotwinAvg I init V seq I curr MAE ↓ Acc@10 ↑MAE ↓ Acc@10 ↑MAE ↓ Acc@10 ↑MAE ↓ Acc@10 ↑ ✓59.64 0.0051.91 8.1766.94 0.7759.50 2.98 ✓ 43.97 18.5249.59 9.7345.93 11.4546.50 13.23 ✓27.58 31.3434.85 18.3347.41 8.0736.61 19.25 ✓25.04 35.2927.59 29.2140.24 17.3730.96 27.29 ✓ 24.94 33.9832.55 23.0145.37 11.9734.29 22.99 ✓29.3927.1522.7331.8342.1627.6931.4328.89 4.3 Generalization Enhancement in Failure Detection For a Vision-Language Model engaged in process supervision, tracking contin- uous task progress and detecting discrete execution failures represent coupled dimensions of temporal reasoning. The capability to identify physical constraints or execution errors structurally depends on an underlying representation of in- tended state transitions. To evaluate the zero-shot generalization of this capa- bility, we test our model on the RoboFail benchmark (details in Appendix A), a completely unseen dataset designed to evaluate "action affordance" and "task completion verification" under complex physical constraints. Table 2 details the quantitative performance across different model architec- tures. The base Qwen2.5-VL-7B model exhibits a baseline accuracy of 57.6%. Applying Supervised Fine-Tuning (SFT) alone results in a performance regres- sion to 51.0%, indicating that isolating continuous progress estimation during SFT compromises the discrete failure identification capabilities. The integration of Process Supervision RL with GRPO corrects this degradation, elevating the accuracy to 63.0%. The final PRIMO R1 formulation achieves an accuracy of 67.0%, matching the closed-source Gemini 2.0 Flash and outperforming larger parameter models, including GPT-4o (63.0%), OpenAI o1 (61.0%), and Cosmos- Reason1-56B (66.2%). Prior benchmark analyses, such as those conducted for Cosmos-Reason1, in- dicate that standard Reinforcement Learning targeting physical AI yields limited improvements on RoboFail. The core difficulty of the benchmark stems from the prerequisite for highly observant perception and comprehensive temporal context processing, which operate as distinct variables alongside static physi- cal common sense. The performance delta between Cosmos-Reason1-7B (60.0%) and PRIMO R1 (67.0%) establishes a specific functional relationship: optimizing a policy model for continuous progress reasoning explicitly constructs the tem- poral context representations necessary for failure verification. The capacity for embodied error correction structurally necessitates process reasoning capability as a parallel condition to physical common sense. 12Y. Liu et al. 0-20%20-40%40-60%60-80%80-100% Task Progress Range 0 10 20 30 40 50 60 70 80 Mean Absolute Error (%) (a) Real Humanoid 0-20%20-40%40-60%60-80%80-100% Task Progress Range 0 10 20 30 40 50 60 Mean Absolute Error (%) (b) Behavior-1K 0-20%20-40%40-60%60-80%80-100% Task Progress Range 0 10 20 30 40 50 60 70 Mean Absolute Error (%) (c) RoboTwin 0-20%20-40%40-60%60-80%80-100% Task Progress Range 0 10 20 30 40 50 60 Mean Absolute Error (%) (d) AgiBot Ours_SFT_RL Ours_SFT Ours_RL Cosmos-Reasoning-7B InternVL-8B ProgressLM-3B-RL Qwen2.5-VL-72BQwen2.5-VL-7BRoboBrain-7BVideo-R1-7B Fig. 4: Fine-Grained Error Analysis Across Task Progress Intervals. MAE evaluation across five completion stages in four environments ((a)-(d)). Compared to baselines, our RL-finetuned model (Ours_SFT_RL) maintains lower error rates, par- ticularly mitigating severe hallucinations in the final execution stage (80−100%). 4.4 Ablation Study: The Necessity of Temporal Context To isolate the impact of temporal context and state representations on progress estimation, we conduct an ablation study analyzing three input variables: the initial state image I init , the process video sequence V seq , and the current state image I curr . The quantitative relationships and performance trade-offs are de- tailed in Table 4. Relying exclusively on the current state I curr yields the highest estimation error, marked by an Average MAE of 59.50. This indicates that isolated static snapshots lack the prerequisite causal context for accurate progress evaluation. Substituting the static image with the temporal sequence V seq reduces the Aver- age MAE to 36.61. However, its performance remains suboptimal across the ma- jority of tasks, demonstrating that pure temporal processing without anchored reference states is structurally insufficient for precise estimation. Prior research [4,5] identifies the advantages of explicitly incorporating I curr in video-centric planning tasks. Consistent with this premise, our results show that introducing either I init or I curr alongside V seq enhances estimation capa- bilities across different task dimensions. Specifically, the dual combination of V seq and I curr achieves the lowest MAE on RoboTwin at 40.24, while pairing I init with V seq minimizes MAE on AgiBot to 24.94. The final architecture of PRIMO R1 integrates all three modalities: I init , V seq , and I curr . The empirical rationale for this configuration relies on the variable relationship observed in Reinforcement Learning Elicits Process Reasoning for Robotic Manipulation13 Clear the countertop waste Start OpenPick and DropPick and Drop Pick、HandOver and Drop Close Fig. 5: Continuous Progress Estimation. Average predicted progress trajectory over 105 episodes for the “Clear the countertop waste” task, comparing temporal state alignment against baselines. long-horizon tasks. While dual-modality combinations exhibit marginally better MAE on shorter-horizon datasets like AgiBot, explicitly modeling both the ini- tial baseline state and the current state yields a substantial performance gain on the long-horizon Behavior dataset, reducing MAE to 22.73 and increasing Acc@10 to 31.83. This establishes that explicit boundary state modeling is a necessary prerequisite for tracking progress over extended temporal sequences. The necessity of the triad configuration highlights a representation trade-off within the model architecture. Continuous video features optimized for tracking temporal dynamics dilute precise spatial details critical for evaluating static im- ages. Progress estimation fundamentally requires clear representations of both the starting conditions and the current observed outcome. The explicit inclu- sion of I init and I curr anchors the temporal transition encoded by V seq between defined spatial boundaries. 4.5 Case Study To evaluate continuous tracking capabilities in long-horizon scenarios, Figure 5 plots the predicted progress trajectory for the composite task “Clear the coun- tertop waste”, mapping the average outputs across 105 episodes. The evaluation isolates the relationship between the predicted progress variable and the ground- truth temporal execution sequence. The baseline RoboBrain-7B demonstrates a decoupling from actual physical states; its prediction overshoots during the ini- tial 0-10s phase and subsequently loses sensitivity to further temporal advance- ment. Qwen2.5-VL-7B tracks the initial sub-stages accurately, but its prediction variable plateaus near 60%-70% during the latter half, failing to map visual con- vergence to the final task state. Our model maintains a monotonically increasing trajectory that correlates linearly with the ground truth across discrete sub-task 14Y. Liu et al. Video: Question:Task info:Fold shorts.Estimate the proportion of the completed task.Answer format:Please output a numerical number between 1 and 100 indicating the percentage of task completion. Qwen2.5-VL-7B:<think>The task involves folding the shorts. The video shows the robotic arms starting to fold the shorts, but the process is not yet complete. The shorts are still laid out flat, and the robotic arms are in the initial stages of folding. The task is not yet 50% complete as the folding process has only begun.</think> <answer>25</answer> RoboBrain-7B:<think>In the visual input, the pair of shorts is laid out horizontally on the bed, with the fabric appearing flat and smooth. The task involves folding the shorts, which typically involves bringing the sides of the shorts together to form a compact shape. Observing the current state, the shorts are positioned such that the legs are parallel to the bed's surface, and the waistband is at the top. The robotic arms are positioned on either side of the shorts, suggesting they are ready to engage with the fabric. The arms are equipped with grippers, which can be used to fold the shorts by manipulating the fabric. The task's progress can be assessed by observing whether the shorts have been folded or not. Given the current state, the shorts have not been folded yet, as they remain flat and unaltered. The task requires bringing the legs together, which would involve folding the fabric inwards. This action would reduce the shorts' length and width, making them more compact. Therefore, based on the visual analysis and the task requirements, the shorts are currently at the initial stage of the folding process. The proportion of the task completed is estimated to be 0%, as the shorts have not been folded yet. This conclusion is supported by the visual evidence of the shorts remaining flat and unaltered.</think> <answer>0</answer> Ours:<think>planning:The high-level goal is to fold the shorts. The initial state shows the shorts laid out horizontally on the bed. Successful completion would be the shorts neatly folded, likely in a compact form. The logical sequence of steps for folding shorts would be: 1. Gather the bottom hem of the shorts. 2. Fold the bottom hem upwards. 3. Gather the top hem of the shorts. 4. Fold the top hem downwards. 5. Adjust the folds for neatness. This plan will be used to interpret the actions in the video and assess the progress towards the goal. observation:The video shows the shorts initially laid out on the bed. The robotic arms then move into the frame and begin interacting with the shorts. The arms gather the bottom hem of the shorts and fold it upwards. This action corresponds to the first two steps of the plan. The arms then move to the top hem, but the folding action is not completed yet. The shorts are partially folded, indicating that the task is in progress but not yet complete. reasoning:Let me think. The robotic arms have successfully gathered and folded the bottom hem of the shorts, which covers about half of the task. However, the top hem has not been folded yet, and the shorts are not fully compacted. This suggests that the task is about halfway complete. The arms appear to be in the process of folding the top hem, but it's not clear if they will be successful in completing the task. Reflecting on the observations, it seems that the task is partially completed, but there's still work to be done. The progress so far is significant, but the task is not yet finished.</think> <answer>50.0</answer> Init state 퐀 curr Init state 퐀 퐀 Task Progress 퐀 seq Fig. 6: Qualitative Comparison of Reasoning Processes. This case study il- lustrates the generated reasoning chains of Qwen2.5-VL-7B, RoboBrain-7B and our PRIMO R1 during the Fold shorts task in AgiBot environment. transitions. In the terminal phase, it accurately maps the visual state change of the final action to a progress metric approaching 100%, verifying a stable alignment between long-range temporal sequences and progress estimation. Figure 6 details the structural decomposition of the explicit reasoning chain generated by the model for a “Fold shorts” video. Baselines struggle with fine- grained state tracking: RoboBrain-7B overlooks ongoing dynamic manipulations (0% progress), while Qwen2.5-VL-7B lacks a structured evaluation metric (25% progress). Conversely, our PRIMO R1 generates an explicit reasoning chain via three modules. The Planning module establishes a reference topology by breaking down the high-level semantic goal into a linear five-step execution plan (Gather bottom hem → Fold upwards → Gather top hem → Fold downwards → Adjust). The Observation module discretizes the continuous visual input, ex- Reinforcement Learning Elicits Process Reasoning for Robotic Manipulation15 tracting specific dynamic variables and verifying part-level object state changes (e.g., isolating the state of the bottom hem from the top hem). Finally, the Reasoning module executes state alignment by mapping the extracted visual primitives against the planned execution topology. It identifies precise execution boundaries. Specifically, confirming the successful manipulation of the bottom hem while explicitly verifying the incomplete status of the top hem, which acts as a structural constraint for quantitative evaluation. The final numerical pre- diction (50.0%) is formulated by calculating the ratio of the verified execution steps against the complete reference plan. Since Inference latency and real-time performance are critical for robotic manipulation, we also provide an analysis and comparison of reasoning chain lengths and inference times in Appendix C.2. 5 Conclusion In this work, we introduced PRIMO R1, a 7B framework that transforms video MLLMs into active critics for robotic process supervision via outcome-based reinforcement learning (GRPO). By explicitly anchoring temporal sequences between initial and current state images and incentivizing Chain-of-Thought generation, our approach mitigates spatial detail dilution and enables rigorous temporal reasoning. Furthermore, conditioning this reasoning process on diverse natural language task goals explicitly exploits the language generalization capa- bilities of foundational LLMs. Experimental results across simulation and real- world humanoid domains demonstrate that PRIMO R1 achieves state-of-the-art performance, empirically establishing that optimizing continuous progress track- ing intrinsically constructs the prerequisite representations for zero-shot discrete failure detection, suggesting a pathway toward deriving reward signals essential for future autonomous policy learning in long-horizon manipulation. 16Y. Liu et al. References 1. Azzolini, A., Bai, J., Brandon, H., Cao, J., Chattopadhyay, P., Chen, H., Chu, J., Cui, Y., Diamond, J., Ding, Y., et al.: Cosmos-reason1: From physical common sense to embodied reasoning. arXiv preprint arXiv:2503.15558 (2025) 2. Bu, Q., Cai, J., Chen, L., Cui, X., Ding, Y., Feng, S., Gao, S., He, X., Hu, X., Huang, X., et al.: Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems. arXiv preprint arXiv:2503.06669 (2025) 3. Chen, T., Chen, Z., Chen, B., Cai, Z., Liu, Y., Li, Z., Liang, Q., Lin, X., Ge, Y., Gu, Z., et al.: Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation. arXiv preprint arXiv:2506.18088 (2025) 4. Chen, Y., Ge, Y., Ge, Y., Ding, M., Li, B., Wang, R., Xu, R., Shan, Y., Liu, X.: Egoplan-bench: Benchmarking multimodal large language models for human-level planning. International Journal of Computer Vision 134(3), 118 (2026) 5. Chen, Y., Ge, Y., Wang, R., Ge, Y., Cheng, J., Shan, Y., Liu, X.: Grpo-care: Consistency-aware reinforcement learning for multimodal reasoning. arXiv preprint arXiv:2506.16141 (2025) 6. Cheng, Z., Leng, S., Zhang, H., Xin, Y., Li, X., Chen, G., Zhu, Y., Zhang, W., Luo, Z., Zhao, D., et al.: Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms. arXiv preprint arXiv:2406.07476 (2024) 7. Feng, K., Gong, K., Li, B., Guo, Z., Wang, Y., Peng, T., Wu, J., Zhang, X., Wang, B., Yue, X.: Video-r1: Reinforcing video reasoning in mllms. arXiv preprint arXiv:2503.21776 (2025) 8. Ge, H., Wang, Y., Chang, K.W., Wu, H., Cai, Y.: Framemind: Frame-interleaved video reasoning via reinforcement learning. arXiv preprint arXiv:2509.24008 (2025) 9. Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al.: Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025) 10. Huang, B., Wang, X., Chen, H., Song, Z., Zhu, W.: Vtimellm: Empower llm to grasp video moments. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. p. 14271–14280 (2024) 11. Ji, Y., Tan, H., Shi, J., Hao, X., Zhang, Y., Zhang, H., Wang, P., Zhao, M., Mu, Y., An, P., et al.: Robobrain: A unified brain model for robotic manipulation from abstract to concrete. arXiv preprint arXiv:2502.21257 (2025) 12. Li, C., Zhang, R., Wong, J., Gokmen, C., Srivastava, S., Martín-Martín, R., Wang, C., Levine, G., Lingelbach, M., Sun, J., et al.: Behavior-1k: A benchmark for em- bodied ai with 1,000 everyday activities and realistic simulation. In: Conference on Robot Learning. p. 80–93. PMLR (2023) 13. Li, Y., Wang, L., Wang, T., Yang, X., Luo, J., Wang, Q., Deng, Y., Wang, W., Sun, X., Li, H., et al.: Star: A first-ever dataset and a large-scale benchmark for scene graph generation in large-size satellite imagery. IEEE Trans. Pattern Anal. Mach. Intell. 47(3), 1832–1849 (2025) 14. Li, Y., Wang, C., Jia, J.: Llama-vid: An image is worth 2 tokens in large lan- guage models. In: European Conference on Computer Vision. p. 323–340. Springer (2024) 15. Lin, B., Ye, Y., Zhu, B., Cui, J., Ning, M., Jin, P., Yuan, L.: Video-llava: Learning united visual representation by alignment before projection. In: Proceedings of the 2024 conference on empirical methods in natural language processing. p. 5971– 5984 (2024) Reinforcement Learning Elicits Process Reasoning for Robotic Manipulation17 16. Liu, Y., Liang, Z., Chen, Z., Chen, T., Hu, M., Dong, W., Xu, C., Han, Z., Qin, Y., Mu, Y.: Hycodepolicy: Hybrid language controllers for multimodal monitoring and decision in embodied agents. arXiv preprint arXiv:2508.02629 (2025) 17. Liu, Z., Bahety, A., Song, S.: Reflect: Summarizing robot experiences for failure explanation and correction. arXiv preprint arXiv:2306.15724 (2023) 18. Ma, Y.J., Hejna, J., Fu, C., Shah, D., Liang, J., Xu, Z., Kirmani, S., Xu, P., Driess, D., Xiao, T., et al.: Vision language models are in-context value learners. In: The Thirteenth International Conference on Learning Representations (2024) 19. Ma, Y.J., Kumar, V., Zhang, A., Bastani, O., Jayaraman, D.: Liv: Language-image representations and rewards for robotic control. In: International Conference on Machine Learning. p. 23301–23320. PMLR (2023) 20. Ma, Y.J., Sodhani, S., Jayaraman, D., Bastani, O., Kumar, V., Zhang, A.: Vip: Towards universal visual reward and representation via value-implicit pre-training. arXiv preprint arXiv:2210.00030 (2022) 21. Maaz, M., Rasheed, H., Khan, S., Khan, F.: Video-chatgpt: Towards detailed video understanding via large vision and language models. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). p. 12585–12602 (2024) 22. Patraucean, V., Smaira, L., Gupta, A., Recasens, A., Markeeva, L., Banarse, D., Koppula, S., Malinowski, M., Yang, Y., Doersch, C., et al.: Perception test: A di- agnostic benchmark for multimodal video models. Advances in Neural Information Processing Systems 36, 42748–42761 (2023) 23. Ren, S., Yao, L., Li, S., Sun, X., Hou, L.: Timechat: A time-sensitive multi- modal large language model for long video understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. p. 14313– 14323 (2024) 24. Ren, W., Yang, H., Min, J., Wei, C., Chen, W.: Vista: Enhancing long-duration and high-resolution video understanding by video spatiotemporal augmentation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. p. 3804–3814 (2025) 25. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) 26. Sermanet, P., Ding, T., Zhao, J., Xia, F., Dwibedi, D., Gopalakrishnan, K., Chan, C., Dulac-Arnold, G., Maddineni, S., Joshi, N.J., et al.: Robovqa: Multimodal long- horizon reasoning for robotics. In: 2024 IEEE International Conference on Robotics and Automation (ICRA). p. 645–652. IEEE (2024) 27. Shen, H., Liu, P., Li, J., Fang, C., Ma, Y., Liao, J., Shen, Q., Zhang, Z., Zhao, K., Zhang, Q., et al.: Vlm-r1: A stable and generalizable r1-style large vision-language model. arXiv preprint arXiv:2504.07615 (2025) 28. Song, E., Chai, W., Wang, G., Zhang, Y., Zhou, H., Wu, F., Chi, H., Guo, X., Ye, T., Zhang, Y., et al.: Moviechat: From dense token to sparse memory for long video understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. p. 18221–18232 (2024) 29. Sontakke, S., Zhang, J., Arnold, S., Pertsch, K., Bıyık, E., Sadigh, D., Finn, C., Itti, L.: Roboclip: One demonstration is enough to learn robot policies. Advances in Neural Information Processing Systems 36, 55681–55693 (2023) 30. Tan, H., Chen, S., Xu, Y., Wang, Z., Ji, Y., Chi, C., Lyu, Y., Zhao, Z., Chen, X., Co, P., et al.: Robo-dopamine: General process reward modeling for high-precision robotic manipulation. arXiv preprint arXiv:2512.23703 (2025) 18Y. Liu et al. 31. Wang, Y., Wang, Z., Xu, B., Du, Y., Lin, K., Xiao, Z., Yue, Z., Ju, J., Zhang, L., Yang, D., et al.: Time-r1: Post-training large vision language model for temporal video grounding. arXiv preprint arXiv:2503.13377 (2025) 32. Xiao, J., Shang, X., Yao, A., Chua, T.S.: Next-qa: Next phase of question-answering to explaining temporal actions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. p. 9777–9786 (2021) 33. Yang, Y., He, X., Pan, H., Jiang, X., Deng, Y., Yang, X., Lu, H., Yin, D., Rao, F., Zhu, M., et al.: R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization. arXiv preprint arXiv:2503.10615 (2025) 34. Yu, S., Cho, J., Yadav, P., Bansal, M.: Self-chained image-language model for video localization and question answering. Advances in Neural Information Processing Systems 36, 76749–76771 (2023) 35. Zhai, S., Zhang, Q., Zhang, T., Huang, F., Zhang, H., Zhou, M., Zhang, S., Liu, L., Lin, S., Pang, J.: A vision-language-action-critic model for robotic real-world reinforcement learning. arXiv preprint arXiv:2509.15937 (2025) 36. Zhang, J., Luo, Y., Anwar, A., Sontakke, S.A., Lim, J.J., Thomason, J., Biyik, E., Zhang, J.: Rewind: Language-guided rewards teach robot policies without new demonstrations. arXiv preprint arXiv:2505.10911 (2025) 37. Zhang, J., Qian, C., Sun, H., Lu, H., Wang, D., Xue, L., Liu, H.: Pro- gresslm: Towards progress reasoning in vision-language models. arXiv preprint arXiv:2601.15224 (2026) 38. Zhang, P., Zhang, K., Li, B., Zeng, G., Yang, J., Zhang, Y., Wang, Z., Tan, H., Li, C., Liu, Z.: Long context transfer from language to vision. arXiv preprint arXiv:2406.16852 (2024) Reinforcement Learning Elicits Process Reasoning for Robotic Manipulation19 Appendix A Benchmark and Datasets Details A.1 Dataset Construction Methodology To build a comprehensive and diverse benchmark for embodied agents, we con- struct our dataset by aggregating multi-source data from both high-fidelity sim- ulations and real-world humanoid robot manipulation. The dataset composition covers a wide range of complexity, from atomic actions to long-horizon composite tasks. Simulation Data Collection. The simulation data is derived from two primary high-fidelity sources: BEHAVIOR-1k and RoboTwin. – BEHAVIOR-1k: We source data from the 2025 BEHAVIOR Challenge [12]. To enrich the semantic annotations, we employ Large Language Models (LLMs) to convert the original BDDL-based annotations into natural lan- guage captions, followed by timestamp-based segmentation to derive fine- grained sub-tasks (ranging from 4 to 76 steps). – RoboTwin: We adopt the code generation methodology proposed in Hy- codepolicy [16] for the RoboTwin [3] simulator. By automatically injecting sub-task and timestamp markers into the generated code, we efficiently syn- thesize and split the data into trajectory segments. Real-World Data Collection. To capture the complexity of physical environ- ments and bridge the sim-to-real gap, we incorporate real-world data from two distinct platforms, serving different phases of our post-training and evaluation paradigm: – AgiBot (Training & In-Domain): Serving as the primary real-world com- ponent of our training corpus, we utilize the AgiBot dataset [2]. We process the raw real-world teleoperation data by utilizing timestamps to segment task progress and extract sub-task demonstrations. – Real Humanoid (OOD Evaluation): To construct a stringent Cross- Environment generalization benchmark, we collect a supplementary real- world, multi-task dataset via teleoperation of the Kuavo 4 Pro full-size hu- manoid robot from LejuRobotics Technology Co., Ltd. This dataset encom- passes multi-scenario and multi-type operations targeting robot manipula- tion, locomotion, and interaction tasks. It is designed to support scalable robot learning in diverse unstructured physical environments, including ho- tel services, manufacturing factories, fast-moving consumer goods (FMCG) scenarios, and automotive assembly lines. A.2 Dataset Statistics and Distribution Table 5 summarizes the statistics of the constructed dataset, detailing the task distribution, video counts, and the scale of processed trajectory segments across different domains. Figure 3 visualizes the data distribution splits across the SFT phase, RL phase, and the PRIMO Bench evaluation sets. 20Y. Liu et al. Table 5: Statistics of the constructed dataset. The table details the number of tasks, raw video demonstrations, sub-task complexity, and the final volume of processed data samples for training and evaluation. DatasetSource# Tasks Split (Train / Test) Raw Videos Sub-tasks (Min-Max) Processed Samples AgiBotAgiBot World [2]3630 / 67,5761 - 1648,276 BehaviorBEHAVIOR-1k [12]5040 / 109,9924 - 76235,441 RoboTwinRoboTwin [3]4935 / 1424,5001 - 971,708 Real Humanoid Real World (KUAVO-MY)7- / 72,8002 - 52,800 ∗ Total-150-32,868-326,453 *For Real Humanoid, processed samples represent the validation set count utilized for evaluation. A.3 Other Benchmark and Datasets Details RoboFail Benchmark: Curated and annotated by Cosmos-Reason1 [1], this benchmark originates from the RoboFail dataset introduced in REFLECT [17]. It comprises an evaluation split of 100 examples focusing on harder “action af- fordance” and “task completion verification” scenarios. The hardness of these samples is dictated by the necessity for highly observant perception or compre- hensive temporal context processing, requiring models to identify physical con- straints blocking the follow-through for an action and to reason about nuanced questions. Beyond the primary benchmarks used for progress estimation, we incorporate several multimodal datasets during the training process to enhance the model’s capabilities in task planning, temporal reasoning, and scene understanding. These datasets provide the diverse semantic and structural supervision necessary for the transition from a passive observer to an active critic. – ShareRobot Dataset [11]: A high-quality heterogeneous dataset featuring multi-dimensional annotations including object affordance and end-effector trajectories. In our framework, we exclusively utilize the task planning data, which includes high-quality heterogeneous labels used to enhance the model’s abstract reasoning and goal decomposition capabilities. – EgoPlan-Bench [4]: A comprehensive benchmark designed to evaluate the planning abilities of MLLMs in real-world scenarios from an egocentric per- spective. It focuses on human-level planning through diverse action plans and intricate visual observations to mirroring human perception. – RoboVQA [26]: A large-scale, diverse dataset containing video-text pairs for robotics-focused visual question answering. It supports the development of models capable of grounded, high-level reasoning across long-horizon tasks and multiple embodiments. – Perception Test [22]: A diagnostic benchmark that evaluates perception and reasoning skills—such as memory, abstraction, physics, and semantics—using real-world videos densely annotated with multiple-choice and grounded video question-answers. – STAR [13]: A large-scale dataset for scene graph generation in high-resolution satellite imagery. It promotes geospatial scenario understanding by requiring Reinforcement Learning Elicits Process Reasoning for Robotic Manipulation21 2026/3/4 20:032195d344-e7d7-4402-9478-82650b62985b (1808×1528) blob:https://wandb.ai/2195d344-e7d7-4402-9478-82650b62985b1/1 2026/3/4 20:05b3b427dc-b7c3-4738-b3ad-2e91c96c4f5 (1828×1528) blob:https://wandb.ai/b3b427dc-b7c3-4738-b3ad-2e91c96c4f51/1 Fig. 7: Training Dynamics of GRPO Across Input Modalities. (Left) Average reward curves during GRPO training. The triad modality configuration (I init +V seq + I curr ) accelerates convergence and achieves higher optimal rewards compared to the pure temporal baseline. (Right) Standard deviation of rewards. The integration of static boundary states explicitly limits the policy variance, ensuring a more stable optimization trajectory. long-range contextual reasoning to mine triplets of subjects, relationships, and objects. – NExT-QA [32]: A video question-answering benchmark designed to ad- vance video understanding beyond shallow descriptions toward explaining temporal actions. It specifically targets causal action reasoning, temporal action reasoning, and common scene comprehension. B Training Dynamics of GRPO Across Input Modalities To further validate the necessity of explicit boundary state modeling in reinforce- ment learning, we analyze the training dynamics of Group Relative Policy Opti- mization (GRPO) under two distinct input modality conditions: the pure tem- poral sequence (V seq ) and our proposed triad configuration (I init +V seq +I curr ). The evaluation metrics focus on the average reward convergence and the stan- dard deviation of rewards across training steps. As illustrated in Figure 7, the inclusion of static boundary images fundamen- tally alters the optimization landscape. Reward Convergence: Figure 7(a) demonstrates that the triad configu- ration achieves faster convergence and a higher final reward plateau compared to the V seq -only baseline. The explicit inclusion of I init and I curr provides the policy model with structural spatial anchors, mitigating the credit assignment difficulties typically associated with evaluating long-horizon temporal features. Training Stability: Figure 7(b) plots the standard deviation of the rewards within the GRPO sampling groups. The V seq -only model exhibits higher vari- ance, indicating instability in policy updates due to the lack of static alignment 22Y. Liu et al. constraints. In contrast, explicitly tracking the transition from I init to I curr confines the policy search space, significantly reducing the reward variance and yielding a more stable fine-tuning process. Video: Question:Task info:Fold shorts.Estimate the proportion of the completed task.Answer format:Please output a numerical number between 1 and 100 indicating the percentage of task completion. Qwen2.5-VL-7B:<think>The task involves folding the shorts. The video shows the robotic arms starting to fold the shorts, but the process is not yet complete. The shorts are still laid out flat, and the robotic arms are in the initial stages of folding. The task is not yet 50% complete as the folding process has only begun.</think> <answer>25</answer> RoboBrain-7B:<think>In the visual input, the pair of shorts is laid out horizontally on the bed, with the fabric appearing flat and smooth. The task involves folding the shorts, which typically involves bringing the sides of the shorts together to form a compact shape. Observing the current state, the shorts are positioned such that the legs are parallel to the bed's surface, and the waistband is at the top. The robotic arms are positioned on either side of the shorts, suggesting they are ready to engage with the fabric. The arms are equipped with grippers, which can be used to fold the shorts by manipulating the fabric. The task's progress can be assessed by observing whether the shorts have been folded or not. Given the current state, the shorts have not been folded yet, as they remain flat and unaltered. The task requires bringing the legs together, which would involve folding the fabric inwards. This action would reduce the shorts' length and width, making them more compact. Therefore, based on the visual analysis and the task requirements, the shorts are currently at the initial stage of the folding process. The proportion of the task completed is estimated to be 0%, as the shorts have not been folded yet. This conclusion is supported by the visual evidence of the shorts remaining flat and unaltered.</think> <answer>0</answer> Ours:<think>planning:The high-level goal is to fold the shorts. The initial state shows the shorts laid out horizontally on the bed. Successful completion would be the shorts neatly folded, likely in a compact form. The logical sequence of steps for folding shorts would be: 1. Gather the bottom hem of the shorts. 2. Fold the bottom hem upwards. 3. Gather the top hem of the shorts. 4. Fold the top hem downwards. 5. Adjust the folds for neatness. This plan will be used to interpret the actions in the video and assess the progress towards the goal. observation:The video shows the shorts initially laid out on the bed. The robotic arms then move into the frame and begin interacting with the shorts. The arms gather the bottom hem of the shorts and fold it upwards. This action corresponds to the first two steps of the plan. The arms then move to the top hem, but the folding action is not completed yet. The shorts are partially folded, indicating that the task is in progress but not yet complete. reasoning:Let me think. The robotic arms have successfully gathered and folded the bottom hem of the shorts, which covers about half of the task. However, the top hem has not been folded yet, and the shorts are not fully compacted. This suggests that the task is about halfway complete. The arms appear to be in the process of folding the top hem, but it's not clear if they will be successful in completing the task. Reflecting on the observations, it seems that the task is partially completed, but there's still work to be done. The progress so far is significant, but the task is not yet finished.</think> <answer>50.0</answer> Init state 퐀 curr Init state 퐀 퐀 Task Progress 퐀 seq Fig. 8: Qualitative Comparison of Reasoning Processes. This case study il- lustrates the generated reasoning chains of Qwen2.5-VL-7B, RoboBrain-7B and our PRIMO R1 during the Fold shorts task in AgiBot environment. Reinforcement Learning Elicits Process Reasoning for Robotic Manipulation23 C Detailed Analysis of Reasoning Processes C.1 Qualitative Comparison of Reasoning Processes Figure 8 compares the reasoning processes during the “Fold shorts” task. Base- lines struggle with fine-grained state tracking: RoboBrain-7B overlooks ongoing dynamic manipulations (0% progress), while Qwen2.5-VL-7B lacks a structured evaluation metric (25% progress). Conversely, our PRIMO R1 generates an ex- plicit reasoning chain via three modules. The planning module decomposes the semantic goal into a five-step reference topology (Gather bottom hem → Fold upwards → Gather top hem → Fold downwards → Adjust). The observation module discretizes visual inputs, verifying the execution of the first two steps. Finally, the reasoning module performs state alignment by mapping visual prim- itives against the planned topology. By confirming the successful manipulation of the bottom hem alongside the incomplete top hem, PRIMO R1 calculates the ratio of verified steps to formulate a precise and interpretable prediction of 50.0%. Figure 9 further demonstrates PRIMO R1’s reasoning capability in a physical Real Humanoid environment during a “Sequential Part Sorting” task. Unlike the rigid sequential constraints of the previous example, this task requires assessing iterative cyclic actions. The planning module formulates a loop-based execu- tion strategy (Identify → Pick → Locate → Move → Repeat). The observation module continuously monitors this iterative process, successfully recognizing the recurring pick-and-place actions alongside the changing state of the source crate. During the reasoning phase, the model synthesizes these visual cues to evalu- ate global progress. By logically recognizing that multiple parts have been suc- cessfully sorted while others visibly remain in the crate, PRIMO R1 deduces the partial completion state, outputting a grounded 50.0% progress estimation without requiring explicit part counting. C.2 Quantitative Analysis: Chain Length and Inference Latency To comprehensively evaluate the feasibility of models for real-time robotic oper- ations, we analyze the inference efficiency (Table 6) in conjunction with the task progress estimation performance (MRA) from our main results. Relying solely on raw latency or token count is insufficient, as an optimal model must strike a balance between reasoning efficiency and predictive accuracy. By jointly considering MRA and inference costs, a clear performance-efficiency trade-off emerges. Notably, Cosmos-Reasoning 7B demonstrates severe ineffi- ciency; despite generating the most verbose reasoning chains (averaging 1109.75 tokens) and incurring the highest latency (1.30s), it yields the lowest average MRA (66.52) among the evaluated reasoning MLLMs. This indicates that its prolonged generation fails to translate into effective structural reasoning. Con- versely, while Robobrain 7B is the fastest (0.50s) with the shortest token lengths (242.3), it sacrifices substantial accuracy, trailing our model by nearly 15% in average MRA. 24Y. Liu et al. Video: Question:Task info: Sequential Part Sorting. How complete is the operation displayed?Answer format:Please output a numerical number between 1 and 100 indicating the percentage of task completion. Ours:<think> planning:The high-level goal of the agent is to sort parts for an automobile. The initial state shows a robotic arm interacting with a blue crate containing various parts. Successful completion would involve the robotic arm accurately sorting these parts into their respective categories or locations. The logical sequence of steps for this task could be: 1. Identify the parts in the crate.2. Use the robotic arm to pick up a part.3. Determine the correct sorting location for the part.4. Move the part to the correct location.5. Repeat steps 2-4 for all parts in the crate. This plan will be used to interpret the actions in the video and assess the progress of the task. observation:The video shows the robotic arm interacting with the blue crate. Initially, the arm picks up a part from the crate and moves it to a different location. The arm then places the part on a surface, possibly a table or a different part of the crate. The arm repeats this process, picking up and moving parts from the crate.The relevant objects are the robotic arm, the blue crate, and the parts inside the crate. The arm's movements indicate that it is in the process of sorting the parts. The video shows the arm successfully moving parts from the crate to their intended locations. reasoning:Let me think. The robotic arm has been actively sorting parts from the crate. It has picked up parts and moved them to different locations, which suggests that the task is progressing as planned. However, the video does not show the completion of the task, as there are still parts remaining in the crate. Considering the progress made so far, the task appears to be about halfway complete. The robotic arm has successfully sorted some parts, but there are still parts left in the crate that need to be sorted. Oh, I see. The task is not yet complete, but significant progress has been made. The robotic arm is on the right track, and it seems likely that it will continue to sort the remaining parts.</think> <answer>50.0</answer> Fig. 9: An example of our PRIMO R1’s reasoning out put on Real Humanoid environ- ment. Our PRIMO R1 achieves an optimal balance. With an average inference la- tency of 0.62s and a concise reasoning chain of 359.18 tokens, its computational overhead remains strictly competitive with fast baselines like Video R1 7B. How- ever, PRIMO R1 delivers a massive leap in accuracy, achieving an average MRA of 82.90. When evaluating efficiency as the accuracy gained per unit of inference time or token, PRIMO R1 stands out as the most cost-effective solution, proving highly effective and responsive for real-time robotic manipulation tasks. Table 6: Comparison of Inference Efficiency. We report the inference latency (time in seconds) and reasoning chain length (token count) across four distinct envi- ronments. This quantitative analysis evaluates the feasibility and efficiency of models for real-time robotic operations. Model AgiBotBehaviorRoboTwinHumanoidAverage time tokentime tokentime tokentime tokentime token Video R1 7B0.53 381.30.58 364.90.56 375.50.50 383.740.54 376.36 Robobrain 7B0.50 247.320.47 201.30.53 262.740.48 257.840.50 242.3 Cosmos-Reasoning 7B1.42 1394.401.30 868.921.13 795.701.36 1379.961.30 1109.75 PRIMO R1 (Ours)0.64360.160.61363.90.60350.980.61361.660.62359.18 Reinforcement Learning Elicits Process Reasoning for Robotic Manipulation25 Prompt Template: Task Progress Estimation Task info:task_info Init Scene:init_scene_text (Option) Question:Estimate the completion percentage of the task. (0-100%) Answer format:Please output a numerical value between 1 and 100 indicating the percentage of task completion. Fig. 10: The prompt template used for querying the Video MLLM to estimate task progress. Dynamic variables are highlighted in blue. The displayed question “Estimate the completion percentage of the task. (0-100%)” serves as a representative example. To ensure prompt robustness and improve instruction generalization, we synthesized 100 distinct question variations for querying task progress. The comprehensive list of these variations is provided in Appendix F. D Inference and RL Training Prompt In this section, we detail the comprehensive prompt structure used to elicit Chain-of-Thought (CoT) reasoning for Process Reasoning Induced Monitoring (PRIMO R1). This prompt is designed to enforce a rigorous internal thought process across three specific dimensions: planning, observation, and reasoning. E Inference and RL Training Prompt The System Prompt defines the foundational persona of PRIMO R1 and es- tablishes the multi-modal input processing protocol. It explicitly instructs the model to synthesize the initial state, the temporal video sequence, and the cur- rent state to ensure a grounded understanding of the task progression. User Prompt for Embodied Procedure Reasoning A conversation between User and Assistant. The Assistant is an expert AI specializing in embodied procedure and event reasoning based on visual input. You will be provided with three types of visual information: (1) Initial State - an image showing the starting condition, (2) Video - capturing the procedure from Initial State to Current State, (3) Current State - an image showing the ending condition. You must analyze all three inputs together to understand the complete task progression and answer the question. The assistant must strictly follow a specific thought process and output format. The reasoning process is enclosed within <think> 26Y. Liu et al. </think> tags, and the final answer is within <answer> </answer> tags. The <think> block must contain three ordered subsections: <planning>, <observation>, and <reasoning>. The <answer> block must contain only the final output required by the question type and no other commentary. User Prompt for Embodied Procedure Reasoning QUESTION: Question QUESTION TYPE: question_type Analyze the provided visual data and reason about the ongoing task. Please think about this question as if you were a human pondering deeply. Provide your detailed reasoning between the <think> and </think> tags, following the subsections <planning>, <observation>, and <reasoning>. Then give your final answer between the <answer> and </answer> tags. Below is the required template: <think> <planning> Identify the high-level goal of the agent, what is the initial state? What does successful completion look like? Break down the high-level goal into a logical sequence of canonical steps. This serves as your mental plan for interpreting the task. Use this plan to interpret actions, map observed behaviors to steps, assess progress, detect anomalies, and predict what happens next. </planning> <observation> View the video as a temporal sequence of actions contributing to the procedure. Objectively describe what is occurring in the current moment, noting evidence of progress or state changes. Identify fine-grained actions and explain how they move the task forward. List relevant objects, tools, and environmental context, emphasizing functional states and transformations. Note cues—repetition, transitions, or completion indicators—that situate the action in the procedural script. </observation> <reasoning> Think through the question as a human would, engage in an internal Reinforcement Learning Elicits Process Reasoning for Robotic Manipulation27 dialogue using expressions such as ’let me think’, ’wait’, ’hmm’, ’oh, I see’, ’let’s break it down’, etc. Connect observations to the procedural plan to determine which step is being executed, progress, correctness, or anomalies. Reflect on assumptions, verify interpretations, and, if appropriate, predict the agent’s next likely action. Synthesize understanding of what the agent is doing, how it fits into the broader task, and whether the process seems successful. You are encouraged to include self-reflection or verification in your reasoning process. </reasoning> </think> <answer> Final answer here — strictly follow the question_type output format and include no extra commentary. </answer> To ensure the model outputs are verifiable and parsable for reinforcement learning rewards, we enforce strict output constraints based on the task category. Table 7 lists the specific instructions injected into the question_type variable. Table 7: Question Type Instructions (TYPE_TEMPLATE). CategoryAppended Instruction within <answer> Block Multiple Choice Please provide only the single option letter (e.g., A, B, C, D, etc.). NumericalPlease provide the numerical value (e.g., 42 or 3.14). OCRPlease transcribe text from the image/video clearly. Free-formPlease provide your text answer directly. BooleanPlease provide only ’Yes’ or ’No’. ProgressPlease output a numerical number between 1 and 100. F Question Variations for Task Progress Estimation To enhance the robustness of PRIMO R1 and ensure its generalization across diverse linguistic phrasings, we curated a set of 100 distinct question variations. These prompts range from direct inquiries to context-aware evaluations, pre- venting the model from over-fitting to a single instruction template. The full list of questions used during training and evaluation is provided below: 1. How much of the task has been completed? 2. What percentage of the task is finished? 3. How complete is the task in the video? 4. Estimate the completion percentage of the task. 28Y. Liu et al. 5. How far along is the agent in completing the task (in percent)? 6. To what extent has the task been completed? 7. Please estimate how much of the task has been done (0-100%). 8. What fraction of the task appears to be finished? 9. How much progress has been made toward completing the task? 10. Give the approximate percentage of task completion. 11. Based on the video, what is the task’s completion percentage? 12. Considering the ongoing actions, how complete is the task execution? 13. From the current progress shown, estimate how much of the task is done. 14. According to the visual evidence, what is the completion rate of the task? 15. Based on the observed steps, how far has the task progressed? 16. Judging from the video, how much of the overall task has been achieved? 17. Based on the actions shown, estimate the percentage of task completion. 18. Using the video context, determine how much progress has been made. 19. According to the current situation, what percent of the task is completed? 20. What is the estimated completion rate of the task shown in this clip? 21. Task completion percentage? 22. Estimate task progress (0-100%). 23. Completion rate of the task? 24. Task progress percentage based on the video? 25. How much of the task is done (in %)? 26. Approximate percent of task completion? 27. Predicted completion level (0-100)? 28. What’s the completion percentage? 29. Estimate progress ratio (0% or 100%)? 30. Task progress estimation in percentage? 31. How complete is the overall procedure in the video? 32. What’s the current progress percentage for this task? 33. Evaluate the current completion level of the task. 34. How much has the agent accomplished in this task? 35. Determine the completion percentage of the process. 36. Provide an estimate of how much of the task is done. 37. What’s the current progress ratio of the operation? 38. Estimate how complete the ongoing task is. 39. What is the approximate progress achieved so far? 40. Based on the video evidence, how much of the task is finished? 41. According to the observed actions, what percentage is complete? 42. How far has the agent advanced in completing the task? 43. Quantify the level of task completion (0-100%). 44. Provide a numeric estimate of task completion. 45. Indicate how much of the task is completed. 46. What portion of the task has been done so far? 47. Compute the completion percentage for the current task. 48. Estimate the proportion of the completed task. 49. Evaluate the current progress made toward completion. 50. How progressed is the task shown in this video? Reinforcement Learning Elicits Process Reasoning for Robotic Manipulation29 51. Based on this clip, what’s the completion percentage? 52. How much progress has the agent made so far? 53. Indicate the task completion rate as a percentage. 54. What’s the estimated completion percentage of the shown task? 55. Approximately what percentage of the task is complete? 56. How advanced is the task execution in this clip? 57. What is the current task progress in numeric terms? 58. From the visual information, estimate the completion percent. 59. Provide an approximate completion percentage. 60. How far along toward completion is the task? 61. Based on the actions, how complete is the task process? 62. What is the overall completion rate of this task? 63. Estimate the progress level of the operation (0-100). 64. To what degree is the task completed according to the video? 65. Provide an estimation of the task completion level. 66. How much work has been completed in the task so far? 67. How complete is the process illustrated in the video? 68. What’s the approximate task completion ratio? 69. How much of the procedure has been achieved? 70. Provide a numerical estimate of progress toward completion. 71. Based on what’s shown, estimate the completion level. 72. How much of the total work has been finished? 73. Provide a completion score between 0 and 100. 74. What is the predicted task completion rate? 75. Please quantify how much progress the agent has made. 76. How much of the defined task has already been accomplished? 77. What’s the expected percentage of task completion? 78. From this video, estimate how much the task has progressed. 79. How much progress can be observed in the task execution? 80. What is the level of completion observed? 81. According to the video, what’s the completion score? 82. How complete is the operation displayed? 83. Determine the degree of completion (in percentage). 84. How far toward full completion has the agent progressed? 85. Report the completion rate inferred from the video. 86. Provide a completion estimate between 0 and 100 percent. 87. What is the overall completion percentage observed? 88. How much of the ongoing task is done so far? 89. What is the measured completion proportion? 90. Estimate the current percentage of finished work. 91. Quantify the extent of completion visible in the video. 92. How far along is the process in percentage terms? 93. What percentage of the work has been achieved? 94. Approximate how complete the shown procedure is. 95. Indicate how much of the task remains unfinished. 96. How close to full completion is the task right now? 97. What percentage of the total task goal has been reached? 98. How much of the intended activity has been completed? 99. Give an estimated completion rate (0-100%). 100. Estimate the degree of completion based on the given video. 30Y. Liu et al. Table 8: SFT Training Config ConfigurationValue Algorithm trainerTRL SFTTrainer seed42 Model freeze_vision_towerFALSE enable_gradient_checkpointing TRUE attn_implementationflash_attention_2 precisionbf16 Batching nproc_per_node8 per_device_train_batch_size1 gradient_accumulation_steps8 global_batch_size1× 8× 8 = 64 Optimization strategyadamw lr1.0e −6 weight_decay0.0 lr_warmup_ratio0.0 lr_scheduler_typelinear num_train_epochs1 Rollout / inference num_generations8 max_turns3 top_p / temperature0.9 / 0.7 G Experimental Setup and Config We conduct all training experiments on a compute node equipped with 8 NVIDIA A100 (80GB) GPUs. During the training phase, to balance computational effi- ciency and temporal modeling capabilities, we limit the input video sequence to a maximum of 16 frames. The frame resolution is configured to 128× 28× 28 pixels. Detailed hyperparameters and specific experimental configurations for the Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) stages are summarized in Table 8 and Table 9, respectively. To ensure a fair and rigorous comparison, we standardize the input config- urations across all evaluated models during the inference stage, including both our proposed method and other open-source baselines. Specifically, we increase the temporal density by maintaining the video length at 32 frames. Correspond- ingly, the frame resolution is set to 256× 28× 28 pixels to capture finer visual details for precise progress estimation. Reinforcement Learning Elicits Process Reasoning for Robotic Manipulation31 Table 9: RL Training Config ConfigurationValue Data constraints min_pixels / max_pixels3136 / 401408 max_prompt_length16384 max_completion_length4096 Algorithm algorithmgrpo reward functionsaccuracy_reward + format_reward kl_coef (beta)0.04 temporal / len_controlfalse / true Model freeze_vision_towerFALSE enable_gradient_checkpointing TRUE attn_implementationflash_attention_2 precisionbf16 Batching per_device_train_batch_size1 gradient_accumulation_steps1 number of GPUs4 global prompt batch4 num_generations (G)8 rollouts per step32 max_grad_norm5 Optimization strategyadamw lr1.0e −6 weight_decay0.01 lr_warmup_ratio0.0 lr_scheduler_typecosine num_train_epochs1 Rollout / Inference n (generations views)8 generation: do_sampleTRUE generation: max_new_tokens4096 generation: top_p / temperature 0.95 / 1.0