Paper deep dive
OS-Themis: A Scalable Critic Framework for Generalist GUI Rewards
Zehao Li, Zhenyu Wu, Yibo Zhao, Bowen Yang, Jingjing Xie, Zhaoyang Liu, Zhoumianze Liu, Kaiming Jin, Jianze Liang, Zonglin Li, Feng Wu, Bowen Zhou, Zun Wang, Zichen Ding
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 99%
Last extracted: 3/22/2026, 6:13:45 AM
Summary
OS-Themis is a scalable multi-agent critic framework designed to improve GUI agent reward modeling. It utilizes a Milestone Verification Module (MVM) to decompose trajectories into verifiable sub-goals and a Verdict Calibration Module (VCM) to audit evidence chains, effectively reducing false positives and improving reward signal reliability. The framework is evaluated on the newly introduced OmniGUIRewardBench (OGRBench), demonstrating significant performance gains in online RL and self-training loops.
Entities (5)
Relation Signals (4)
OS-Themis → evaluatedon → OmniGUIRewardBench
confidence 100% · We introduce OmniGUIRewardBench (OGRBench)... where all evaluated models achieve their best performance under OS-Themis.
OS-Themis → improves → AndroidWorld
confidence 100% · Extensive experiments on AndroidWorld show that OS-Themis yields a 10.3% improvement
OS-Themis → includes → Milestone Verification Module
confidence 100% · The framework primarily consists of two modules: the Milestone Verification Module (MVM) and the Verdict Calibration Module (VCM).
OS-Themis → includes → Verdict Calibration Module
confidence 100% · The framework primarily consists of two modules: the Milestone Verification Module (MVM) and the Verdict Calibration Module (VCM).
Cypher Suggestions (2)
Find all components of the OS-Themis framework · confidence 95% · unvalidated
MATCH (f:Framework {name: 'OS-Themis'})-[:INCLUDES]->(c:Component) RETURN f.name, c.nameIdentify benchmarks used to evaluate OS-Themis · confidence 95% · unvalidated
MATCH (f:Framework {name: 'OS-Themis'})-[:EVALUATED_ON]->(b:Benchmark) RETURN f.name, b.nameAbstract
Abstract:Reinforcement Learning (RL) has the potential to improve the robustness of GUI agents in stochastic environments, yet training is highly sensitive to the quality of the reward function. Existing reward approaches struggle to achieve both scalability and performance. To address this, we propose OS-Themis, a scalable and accurate multi-agent critic framework. Unlike a single judge, OS-Themis decomposes trajectories into verifiable milestones to isolate critical evidence for decision making and employs a review mechanism to strictly audit the evidence chain before making the final verdict. To facilitate evaluation, we further introduce OmniGUIRewardBench (OGRBench), a holistic cross-platform benchmark for GUI outcome rewards, where all evaluated models achieve their best performance under OS-Themis. Extensive experiments on AndroidWorld show that OS-Themis yields a 10.3% improvement when used to support online RL training, and a 6.9% gain when used for trajectory validation and filtering in the self-training loop, highlighting its potential to drive agent evolution.
Tags
Links
- Source: https://arxiv.org/abs/2603.19191v1
- Canonical: https://arxiv.org/abs/2603.19191v1
Full Text
135,685 characters extracted from source content.
Expand or collapse full text
OS-Themis: A Scalable Critic Framework for Generalist GUI Rewards Zehao Li 1,2 , Zhenyu Wu 2 , Yibo Zhao 2 , Bowen Yang 2 , Jingjing Xie 3 , Zhaoyang Liu 4 , Zhoumianze Liu 2 , Kaiming Jin 5 , Jianze Liang 2 , Zonglin Li 2 , Feng Wu 1 , Bowen Zhou 2 , Zun Wang 2 , Zichen Ding 2,† 1 University of Science and Technology of China, 2 Shanghai AI Laboratory, 3 CUHK MMLab, 4 The Hong Kong University of Science and Technology, 5 National University of Singapore Abstract Reinforcement Learning (RL) has the poten- tial to improve the robustness of GUI agents in stochastic environments, yet training is highly sensitive to the quality of the reward function. Existing reward approaches struggle to achieve both scalability and performance. To address this, we propose OS-Themis, a scalable and ac- curate multi-agent critic framework. Unlike a single judge, OS-Themis decomposes trajecto- ries into verifiable milestones to isolate critical evidence for decision making and employs a review mechanism to strictly audit the evidence chain before making the final verdict. To fa- cilitate evaluation, we further introduce Om- niGUIRewardBench (OGRBench), a holistic cross-platform benchmark for GUI outcome re- wards, where all evaluated models achieve their best performance under OS-Themis. Extensive experiments on AndroidWorld show that OS- Themis yields a 10.3% improvement when used to support online RL training, and a 6.9% gain when used for trajectory validation and filter- ing in the self-training loop, highlighting its potential to drive agent evolution. Our code is available at OS-Copilot/OS-Themis. 1 Introduction Recent advancements in general-purpose Vision- Language Models (OpenAI, 2025; Anthropic, 2025; Comanici et al., 2025; Bai et al., 2025; Wang et al., 2025c; Guo et al., 2025b) have fueled the rapid development of GUI agents (Wu et al., 2024a; Agashe et al., 2024; Wang et al., 2024; Nguyen et al., 2025; Yang et al., 2026). While native agents (Wu et al., 2024b; Xu et al., 2024; Wang et al., 2025d; Liu et al., 2025; Qin et al., 2025) trained on extensive GUI trajectories demonstrate proficiency in digital navigation, they suffer from poor error recovery in stochastic environments, of- ten failing when deviating from expert trajecto- ries (Qin et al., 2025; Liu et al., 2025; Lu et al., 2025). This deficiency has precipitated a shift to- New Task: Create a playlist in Retro Music titled "Hip Hop Bangers 270" with the following songs, in order: Golden Days, Dreamer's Awake, Hidden Paths. Execute the task Trajectory Accuracy Rule-based Training-based LLM-as-a-Judge Contextual Loss! Precision Success? No script! OOD! Evidence mining Evidence collection ... Full Trajectory Figure 1: Limitations of existing approaches for reward modeling in GUI environments. ward reinforcement learning (RL) to enable adap- tive correction (Xu et al., 2025; Lai et al., 2025b; Wang et al., 2025a); critically, the success of this paradigm hinges on reliable reward signals to guide policy optimization. As illustrated in Figure 1, existing methods for reward acquisition in GUI scenarios generally fall into three categories: 1) Rule-based rewards rely on manual heuristics (Ye et al., 2025; Lai et al., 2025b), offering high precision but suffer- ing from limited scalability and susceptibility to reward hacking. 2) Training-based critics learn verifiers from human feedback (Qi et al., 2024; Xu et al., 2025; Wu et al., 2025b), yet they de- mand expensive data construction and generalize poorly to out-of-distribution (OOD) environments. 3) LLM-as-a-judge offers a flexible and scalable paradigm by harnessing the generalized reason- 1 arXiv:2603.19191v1 [cs.AI] 19 Mar 2026 ing capabilities and extensive world knowledge of foundation models (Lee et al., 2024; Li et al., 2024; Zhou et al., 2025). Despite its potential for generalizable evaluation, existing methods exhibit critical flaws. First, effective trajectory utilization remains a bottleneck. Sparse sampling (e.g., last- K) (Yang et al., 2025; Qi et al., 2024; Lai et al., 2025a) suffers from contextual loss, while global aggregation (Bai et al., 2024; Wang et al., 2025b) is compromised by low signal-to-noise ratios in long-horizon tasks. Consequently, both extremes fail to distill decision-critical evidence. Second, converting the critical information in trajectories into precise rewards proves challenging. The pre- vailing one-shot paradigm suffers from evidence dilution, where accumulated trivial successes mask sparse, outcome-determining failures. This yields overly optimistic judgments that feed wrong-sign signals into Online RL, misleading policy updates. To address these challenges, we propose OS- Themis, a scalable critic framework that shifts from the monolithic single-agent paradigm to a collab- orative workflow. Specifically: 1) To resolve the trajectory utilization bottleneck, we design a Mile- stone Verification Module that employs a Selector Agent to decompose long-horizon trajectories into discrete key milestones, assigning an explicit and observable sub-goal to each. Subsequently, a Veri- fier Agent assesses these milestones sequentially. This granular verification effectively isolates salient signals from irrelevant noise, ensuring precise ev- idence collection. 2) To counteract evidence dilu- tion, we design the Verdict Calibration Module, consisting of a Reviewer Agent and a Judge Agent. The Reviewer Agent acts as an auditor, meticu- lously examining the evidence chain prior to final judgment to uncover outcome-determining critical failures masked by trivial successes, thereby cor- recting overly optimistic assessments and prevent- ing wrong-sign signals. Based on this structured and audited evidence, Judge Agent renders robust reward signals to enable scalable policy learning. We conduct extensive experiments to demon- strate the effectiveness of OS-Themis. Specifically, we introduce OmniGUIRewardBench (OGR- Bench) to mitigate the scarcity of cross-platform outcome reward model (ORM) benchmarks, where empirical results confirm the state-of-the-art re- sults of OS-Themis. In online RL experiments conducted in interactive Android environments, de- ploying OS-Themis yields performance gains of up to 7.1% over baselines on the AndroidWorld bench- mark, significantly surpassing competing frame- works. Meanwhile, we conduct a pilot study on scaling RL to validate the effectiveness of OS- Themis as a reward function, achieving a 10.3% im- provement on AndroidWorld. Furthermore, lever- aging OS-Themis for trajectory filtering in SFT re- sults in a 6.9% improvement on the Qwen3-VL se- ries, validating the quality of filtered data and high- lighting the framework’s potential for autonomous self-evolution. Our contributions are summarized as follows: •We propose OS-Themis, a scalable critic frame- work that delivers reliable reward signals, thereby enabling efficient and robust online RL for GUI agents. •We introduce a Milestone Verification and a Ver- dict Calibration Module to extract critical evi- dence from GUI trajectories, converting contexts into precise, robust rewards. •We construct OGRBench, the first holistic cross- platform ORM benchmark spanning Mobile, Web, and Desktop environments, to comprehen- sively evaluate GUI agents’ critic capabilities. •Extensive experiments demonstrate that OS- Themis significantly boosts online RL perfor- mance and facilitates autonomous agent self- evolution in realistic environments. 2 Related Works GUI Agents. Empowered by general-purpose Vision-Language Models (VLMs) (Bai et al., 2025; Wang et al., 2025c; OpenAI, 2025; Anthropic, 2025; Comanici et al., 2025), visually-grounded GUI agents have emerged as autonomous copilots for digital tasks (Cheng et al., 2024; Sun et al., 2024; Zhang et al., 2025; Xie et al., 2025; OpenAI, 2025; Anthropic, 2024). A dominant paradigm in- volves native agents (Wu et al., 2024b; Qin et al., 2025; Liu et al., 2025; Xu et al., 2024; Wang et al., 2025d) that integrate planning and grounding in a single-agent architecture, directly translating pixel- level observations into executable actions. How- ever, despite mastering routine workflows via large- scale training, these agents often exhibit brittleness in stochastic environments, struggling to recover from deviations or generalize to unseen scenar- ios (Ye et al., 2025; Liu et al., 2025; Lai et al., 2025b). These limitations have precipitated a shift toward reinforcement learning (RL) in interactive environments (Xu et al., 2025; Lai et al., 2025b; 2 OS-Themis Online RL Training GUI Agent execute the task Milestone Verification Module Selector Agent Task:Record an audio clip using Audio Recorder app and save it. Trajectory Think:I need to record an ... Action: Swipe() Step 1 Screenshots History ... Verifier Agent Reviewer Agent Judge Agent Milestones a.Step 2(Open Audio Recorder) b.Step 5(Start recording) c.Step 7(Stop recording and save) Open Audio Recorder Verdict Calibration Module Think:I opened the app drawer ... Action:Click() Step 2 Think:I have successfully ... Action:Terminate() Step N ...... milestones feedback verdict Success Failure Figure 2: Overview of OS-Themis Framework. The framework primarily consists of two modules: the Milestone Verification Module (MVM) and the Verdict Calibration Module (VCM). For a GUI trajectory containing several <screenshot,think,action> steps, the Selector Agent in the MVM extracts key steps as milestones, which are then assigned binary scores by the Verifier Agent. Subsequently, the Reviewer Agent in the VCM continuously interacts with the MVM to ensure the rationality and completeness of the milestones, while the Judge Agent conducts the final scoring of the trajectory based on all information exchanged between the modules. Wang et al., 2025a), where policy optimization hinges on the reliability of reward signals. Reward Modeling for GUI Agents. Reliable re- ward mechanisms are foundational for GUI agent RL. Existing strategies fall into three paradigms: 1) Rule-based methods use heuristic triggers or en- vironment states (Ye et al., 2025; Lai et al., 2025b; Wang et al., 2025a). While providing faithful re- wards via verifiable scripts, they scale poorly and are prone to reward hacking. 2) Training-based critics learn verifiers from human feedback or ex- pert trajectories (Qi et al., 2024; Xu et al., 2025; Wu et al., 2025b; Sun et al., 2025b; Wang et al., 2025a; Sun et al., 2025a). Although specialized models like GUI-Critic-R1 (Wanyan et al., 2025) and UI- Genie (Xiao et al., 2025) offer step-wise signals, they struggle with domain shifts and high data col- lection costs, hindering cross-platform generaliza- tion. 3) LLM-as-a-judge harnesses off-the-shelf VLMs for scalable, zero-shot evaluation (Lee et al., 2024; Li et al., 2024; Sun et al., 2024; Murty et al., 2024; Yang et al., 2025; Wu et al., 2025a; Zhou et al., 2025). Within this paradigm, ZeroGUI (Yang et al., 2025) and DigiRL (Bai et al., 2024) represent two classic approaches: sequential evaluation until success versus selecting a fixed number of terminal states. However, deploying them as Outcome Re- ward Models (ORMs) faces critical hurdles: sparse sampling loses critical context (Yang et al., 2025; Qi et al., 2024), whereas global ingestion suffers from low signal-to-noise ratios (Bai et al., 2024; Wang et al., 2025b). Crucially, bridging this gap requires mobilizing the reasoning capabilities of VLMs to mine fine-grained evidence for precise reward estimation. GUI Reward Benchmarks. Despite the prolifer- ation of benchmarks assessing GUI agent execu- tion (Xie et al., 2024; Rawles et al., 2024; Bonatti et al., 2024; Wang et al., 2025e), datasets evaluating criticism capabilities are scarce. Existing initiatives often suffer from limited scope or data contamina- tion. For instance, GUI-Critic-Test (Wanyan et al., 2025) incurs leakage risks due to its derivation from open-source repositories. Similarly, recent works like OS-Critic Bench (Wu et al., 2025b), AgentRe- wardBench (Men et al., 2025), and CUAReward- Bench (OpenAI, 2025) are restricted to isolated domains (e.g., Web or Desktop) or prioritize step- level supervision. Consequently, there is an urgent need for a comprehensive ORM evaluation bench- mark to establish a reliable evaluation standard for cross-platform reward modeling. 3 3 OS-Themis 3.1 Overview In this section, we introduce OS-Themis, a multi- agent critic framework for generalist GUI rewards. As shown in Figure 2, the framework primarily consists of two modules: the Milestone Verifica- tion Module (MVM) and the Verdict Calibration Module (VCM). Given a trajectoryτ, the MVM first generates an initial milestone setM 0 , which is then iteratively refined through interaction with the VCM to produce the final milestone setM. Sub- sequently, the VCM assigns a binary reward score r ∈ 0, 1to the trajectory based on all interac- tion information between the two modules and the complete refinement process from M 0 to M . 3.2 Milestone Verification Module Evaluating long-horizon GUI trajectories with a single one-shot judge is prone to errors. Such an approach requires the evaluator to simultaneously identify critical decision points across dozens of steps and reason over the complete interaction his- tory, often leading to overlooked evidence or incon- sistent judgments. Most GUI tasks can be decom- posed into a sequence of verifiable subgoals, which we term milestones. A milestone represents a nec- essary intermediate state that must be achieved for task completion, such as "reaching the camera pre- view" and "capturing the photo" in a photo-taking task. Crucially, verifying whether each milestone is achieved can be done locally by examining the relevant state transition, rather than processing the entire trajectory at once. Building on this insight, we introduce the Mile- stone Verification Module (MVM), which decom- poses trajectories into milestones and verifies them in a structured, step-by-step manner. The module comprises two collaborative agents: Selector Agent. Given a task instructionIand a trajectoryτ = (s t ,a t ,m t ) T t=1 , wheres t de- notes the state (screenshot),a t the action, andm t the agent-generated metadata at stept, the Selector identifies a compact set of milestone candidates M 0 =(t i ,d i ,r i ) k i=1 . Each milestone tuple con- sists of: (i) the step indext i where verification should occur, (i) a descriptiond i of the expected state change or progress, and (i) a rationaler i explaining the necessity of this milestone for task completion. The Selector’s output provides a struc- tured decomposition that covers all critical sub- goals required to judge whether the task has been successfully completed. Verifier Agent. For each proposed milestone (t i ,d i ,r i )∈M 0 , the Verifier assesses whether the intended progress described ind i has been achieved at stept i . It takes as input the pre-action states pre t i , post-action states post t i , the executed actiona t i , and the metadatam t i . The Verifier outputs a binary ver- ification resultv i ∈ 0, 1indicating milestone achievement, along with grounded feedbackf i whenv i = 0. This feedback references specific vi- sual evidence from the screenshots (e.g., "expected confirmation dialog not present," "action targeted incorrect UI element") and provides actionable di- agnostic information. The complete verification output can be denoted asV 0 =(v i ,f i ) k i=1 . The MVM’s milestone-based decomposition of- fers two key advantages: (1) verification becomes more accurate by focusing on local state transi- tions rather than holistic trajectory assessment, and (2) failed verifications produce state-grounded feed- back that enables systematic refinement of the mile- stone set, as described in the following section. 3.3 Verdict Calibration Module While the MVM provides a structured approach to trajectory evaluation, the milestone setM 0 gen- erated in a single pass may be incomplete or in- sufficiently rigorous. The Selector may overlook critical verification points, propose overly lenient success criteria, or fail to capture subtle failure modes, which can in turn lead to false positives. In reinforcement learning, such false positives are par- ticularly detrimental because they reinforce incor- rect behaviors. As we show in Appendix A, when recall is already sufficiently high, trading a mod- est decrease in recall for improved precision yields better gradient quality in policy-gradient-based RL. To address this, we introduce the Verdict Calibra- tion Module (VCM), which iteratively refines the milestone set through critical review and produces the final reward based on the complete deliberation process. The VCM comprises two agents: Reviewer Agent. Given the initial milestone setM 0 and verification resultsV 0 from the MVM, the Reviewer performs a critical audit to identify whetherM 0 is complete (covers all necessary sub- goals) and appropriate (each milestone has clear, verifiable success criteria). Specifically, The Re- viewer operates under strict evidence-grounding: any issue it raises must be supported by observ- able signals in the trajectory. Common concerns include: (i) missing critical milestones that leave re- 4 quirements unverified (e.g., no check for final state persistence), (i) overly lenient criteria that accept intermediate progress without confirming end-state correctness, (i) uncaptured failure modes such as goal-conflicting actions, and (iv) weak verification evidence that relies on action descriptions rather than visual confirmation. When issues are identi- fied, the Reviewer generates structured feedback F =(i j ,q j ) n j=1 , wherei j describes the concern andq j provides a targeted query. This feedback prompts the MVM to refineM 0 → M 1 and re- verify relevant steps. This iterative process contin- ues until the Reviewer confirms the decomposition is sound, yielding the finalM andV . Judge Agent. Once refinement converges or reaches a maximum iteration limit, the Judge syn- thesizes all information to produce the binary re- wardr ∈ 0, 1. Crucially, the Judge does not simply aggregate milestone verification results. In- stead, it considers the complete deliberation history: (i) milestone evolutionM 0 ,...,M, (i) verifi- cation outcomesV, (i) review feedback and reso- lutionF, and (iv) the original instructionIand tra- jectoryτ. This enables informed decision-making that goes beyond surface-level verification. For instance, if multiple revision rounds were needed or verification results remain borderline, the Judge weighs these signals toward a more conservative decision. Formally: r =J (I,τ,M 0 ,...,M,V,F ),(1) By leveraging the entire refinement process rather than only the final milestone outcomes, this ap- proach substantially reduces false positives while maintaining high recall. 4 Experiments 4.1 OmniGUIRewardBench Benchmark Construction. To evaluate the ef- fectiveness of OS-Themis and its generalizabil- ity across heterogeneous platforms, we construct OmniGUIRewardBench (OGRBench), a cross- platform outcome reward model (ORM) bench- mark for GUI environments.We compile a dataset of real-world trajectories from five rep- resentative benchmarks: AndroidWorld (Rawles et al., 2024), OSWorld (Xie et al., 2024), Win- dowsAgentArena (Bonatti et al., 2024), ma- cOSArena (Wang et al., 2025e), and WebArena- Lite-v2 (Wang et al., 2025e). Each trajectory is represented by the full sequence of screenshots over the entire interaction process, paired with the agent’s model outputs. The trajectory-level outcome label is binary (True/False), indicating whether the overall task is successfully completed. The correctness labels of the trajectories are auto- matically determined by each benchmark’s built-in evaluation rules. These trajectories were gener- ated by a diverse suite of GUI agents, including the Qwen3-VL series (4B, 8B, 235B) (Bai et al., 2025), UITARS variants (1.5-7B, 72B-DPO) (Qin et al., 2025), ScaleCUA (7B, 32B) (Liu et al., 2025), and Claude-Sonnet-4.5 (Anthropic, 2025). We em- ployed stratified sampling to ensure broad task cov- erage while enforcing a balanced class distribution, maintaining a positive sample ratio between 0.45 and 0.55. In total, the resulting evaluation set com- prises 1,409 trajectories, consisting of 700 positive and 709 negative samples. More details are pro- vided in Appendix C. Main Results. We evaluate a range of models on OGRBench using OS-Themis and two pri- mary baselines, DigiRL (Bai et al., 2024) and Ze- roGUI (Yang et al., 2025). These two baselines serve as the archetypal paradigms for LLM-as-a- Judge frameworks in GUI reward modeling. As other existing critic frameworks are largely deriva- tive of these foundational approaches, we selected them to ensure a representative comparison. Ze- roGUI adopts a direct assessment paradigm by feed- ing last-Kstates, such as screenshots or structural page information, directly into the model for judg- ment. In this work, we specifically utilize the final two frames for this baseline, aligning with its orig- inal configuration. Conversely, DigiRL operates on a sequential verification paradigm. It evaluates states iteratively to determine if the objective is met, and this process continues until a state satisfies the goal or the trajectory terminates. Table 1 reports the accuracy across different platforms. We further include precision and recall to provide a granular characterization of performance differences among the frameworks. Across all tested base models, OS- Themis consistently yields superior results in both accuracy and precision. On average, our method outperforms DigiRL by significant margins, includ- ing 18.8% in accuracy, 29.6% in precision, 16.9% in recall, and 26.2% in F1-score. Similarly, it sur- passes ZeroGUI by 7.7%, 5.1%, 13.0%, and 13.4% across the respective metrics. We provide more detailed evaluation results in Appendix B. 5 Model UbuntuMobileWindowsmacOSWebOverall AccF1AccF1AccF1AccF1AccF1Acc Prec RecF1 DigiRL Qwen3-VL-8B62.164.969.276.666.263.672.740.063.269.664.462.272.166.8 Qwen3-VL-30B-A3B61.067.667.075.669.071.167.549.054.263.062.558.980.6 68.1 Qwen3-VL-32B65.766.572.377.871.871.275.355.862.167.967.665.872.168.8 Qwen3-VL-235B65.567.070.275.272.370.976.643.863.262.867.466.669.067.8 Qwen3-VL-235B-Thinking 62.466.367.574.765.766.070.146.558.462.263.561.073.066.5 GPT-5-mini47.00.047.90.055.90.079.20.047.90.050.30.00.00.0 GPT-550.513.756.933.159.117.179.20.052.616.754.589.39.617.3 Gemini-3-Flash69.563.069.764.276.566.283.131.675.374.372.186.651.964.9 Mean60.551.165.159.667.153.375.533.359.652.162.861.353.552.5 ZeroGUI Qwen3-VL-8B72.767.878.779.676.567.589.669.275.875.075.486.360.170.9 Qwen3-VL-30B-A3B74.874.680.382.880.877.390.975.972.175.176.977.974.976.3 Qwen3-VL-32B75.072.278.780.079.372.589.666.776.375.977.285.365.374.0 Qwen3-VL-235B76.574.582.584.277.970.893.5 81.584.785.679.685.570.977.5 Qwen3-VL-235B-Thinking 71.866.979.882.179.372.589.666.779.078.375.985.362.372.0 GPT-5-mini48.56.449.55.956.32.179.20.049.05.851.587.02.95.5 GPT-572.667.081.982.377.568.890.972.080.579.676.789.759.971.8 Gemini-3-Flash76.172.877.176.080.874.593.5 81.577.475.178.189.863.074.1 Mean71.062.876.171.676.063.289.664.274.468.873.985.857.465.3 OS-Themis Qwen3-VL-8B77.275.885.685.172.861.885.759.386.387.279.386.369.477.0 Qwen3-VL-30B-A3B79.579.484.685.076.570.988.366.780.579.380.384.773.778.8 Qwen3-VL-32B77.675.283.582.175.163.588.369.084.783.379.692.264.375.8 Qwen3-VL-235B88.188.392.392.877.568.494.886.792.192.288.092.882.387.2 Qwen3-VL-235B-Thinking 83.483.789.489.685.582.593.582.884.783.485.289.379.984.3 GPT-5-mini68.859.765.452.576.564.387.054.575.870.571.595.444.760.9 GPT-582.581.380.378.484.580.588.360.995.980.482.993.4 70.680.4 Gemini-3-Flash85.084.791.091.186.983.793.582.882.680.586.293.278.084.9 Mean80.378.584.082.179.472.089.970.385.382.181.690.970.478.7 Table 1: Performance comparison of different models under the DigiRL, ZeroGUI, and OS-Themis frameworks on OGRBench. Each framework reports Accuracy (Acc) and F1-score (F1), while the Overall performance includes Acc, Precision (Prec), Recall (Rec), and F1. Higher values indicate better performance. Bold andunderlineddenote the best and second-best results, respectively. 4.2 Online RL Online RL Infrastructure. To facilitate large- scale parallel trajectory rollouts for online RL train- ing, we established a containerized infrastructure. Specifically, each Docker container hosts an inde- pendent Android Emulator instance and exposes a remote ADB interface for executing standard GUI operations (e.g.,click,swipe, andtype). The system supports real-time screen capture and en- forces strict environment isolation; devices are re- initialized before each task to ensure a pristine state. This deployment strategy minimizes interference across worker processes, thereby improving the sta- bility and reproducibility of the training phase. Fur- thermore, the workflow operates asynchronously to maximize efficiency: upon the completion of a trajectory, OS-Themis is immediately invoked for evaluation and reward calculation. Task Design.Utilizing Qwen3-VL-235B, we au- tomatically synthesize a comprehensive pool of tasks, following the methodology established in (Lai et al., 2025a). We then employ a lightweight filtering process to curate a training set of96tasks, while reserving a separate subset of64tasks for validation. Validation primarily relies on the rule- based evaluator to determine success, and uses the reward signal produced by the critic method as an auxiliary monitoring signal. Training Setup. We implement multi-turn on- line reinforcement learning using the GRPO (Shao et al., 2024; Guo et al., 2025a) algorithm within the 6 Verl framework (Sheng et al., 2024). To prevent over-regularization and encourage extensive explo- ration, we explicitly disable the KL divergence penalty (disable_kl=true,kl_coef=0.0). Opti- mization is performed using AdamW (Loshchilov and Hutter, 2017) with a learning rate of1× 10 −6 , a weight decay of1× 10 −2 , and a gradient clip- ping threshold of1.0. During the rollout phase, we employ a sampling temperature of1.0and generate n = 4candidate trajectories per state to enhance exploration coverage. Trajectories are truncated at 50steps (max_steps=50) with a request timeout of 60seconds. The training process is conducted over a total of4episodes (total_episodes=4). We provide additional details of the online RL training in Appendix D. Comparisons. To verify the framework’s effec- tiveness across different scales, we fine-tune two policy backbones: Qwen3-VL-4B and Qwen3-VL- 8B. Regarding OS-Themis, we instantiate it with two backbone options: Qwen3-VL-8B and Qwen3- VL-235B. We benchmark our approach against two external baselines under identical training config- urations: SEAgent (Sun et al., 2025b), an open- source critic model, and ZeroGUI, an LLM-as-a- Judge method utilizing Qwen3-VL-235B. Results on AndroidWorld. To systematically evaluate OS-Themis in the online RL setting, we conducted experiments on the AndroidWorld benchmark (see Table 2). For the Qwen3-VL-4B backbone, employing OS-Themis leads to a 6% ab- solute improvement over the baseline, significantly outperforming concurrent methods like ZeroGUI (+5.2%) and SEAgent (+3.5%). This performance advantage is even more pronounced in the larger model setting: for Qwen3-VL-8B, fine-tuning with OS-Themis yields a 7.1% gain over the baseline, while improving over ZeroGUI and SEAgent by 3% and 4.7%, respectively. Notably, the increased gain on the 8B model (7.1% vs. 6.0%) suggests that our framework scales effectively, offering greater benefits to larger foundation models. 4.3 Scaling exploration Scaling Training Setup. To evaluate the effec- tiveness of OS-Themis in a realistic online RL scal- ing setting, we conduct a pilot study on Qwen3-VL- 4B. Specifically, within the OS-Themis framework, we use Qwen3-VL-235B as the backbone model to score trajectories and provide reward signals. We first use Qwen3-VL-235B to generate a large pool BackboneReward used for RL trainingAcc Qwen3-VL-4B –45.3 SEAgent47.8 ZeroGUI (Qwen3-VL-235B)46.1 OS-Themis (Qwen3-VL-8B)50.9 OS-Themis (Qwen3-VL-235B)51.3 Qwen3-VL-8B –47.6 SEAgent50.0 ZeroGUI (Qwen3-VL-235B)51.7 OS-Themis (Qwen3-VL-8B)53.4 OS-Themis (Qwen3-VL-235B)54.7 Table 2: RL performance on AndroidWorld with differ- ent reward sources. Each row is trained independently from the same initialization for the given backbone. of task templates, filter for suitable ones, and then instantiate 1,024 training tasks from the selected templates. For each task, we roll out 4 trajectories and then optimize the policy with the GRPO algo- rithm for 1 epoch of online reinforcement learning. To prevent excessive distribution shift from the ini- tial policy during scaling and to stabilize training, we incorporate a KL regularization term, adopt- inglow_var_klwithkl_coef = 0.005, so as to balance exploration and robustness. 081632641282565121024 Task Count 0.340 0.368 0.396 0.424 0.452 0.480 Mean Reward Mean Reward Android World (acc) 44.0 46.4 48.8 51.2 53.6 56.0 Android World (acc) Figure 3: The performance of Qwen3-VL-4B under online RL scaling with OS-Themis, including mean re- ward growth and corresponding AndroidWorld accuracy across different training scales. Scaling Results.For evaluation, we additionally design 128 extra tasks (disjoint from the 1024 training tasks) as a validation set, and conduct staged validation when the number of training tasks reaches (0, 8, 16, 32, 64, 128, 256, 512, 1024) to characterize the performance trend as the task scale increases. Meanwhile, we evaluate each intermedi- ate model on the AndroidWorld benchmark under the same protocol. As shown in Figure 3, after scal- ing training to 1024 tasks, Qwen3-VL-4B achieves a score of 55.6% on AndroidWorld, improving by 7 10.3% over the Baseline. These results demon- strate clear gains and highlight the strong potential of OS-Themis for scalable online RL. More Scaling Studies. In Appendix E, we fur- ther conduct a suite of scaling experiments to ana- lyze the scalability of OS-Themis and its behavior across different usage scenarios. The experiments fall into three categories: (1) Model scaling for individual agents: while keeping the other com- ponents at 8B, we separately upgrade the Selector, Reviewer, Judge, and Verifier to Qwen3-VL-235B to quantify each component’s contribution; results suggest that scaling the Judge/Verifier is more crit- ical, while scaling the Reviewer mainly improves precision. (2) Framework-level test-time scaling via voting: we compare three voting strategies (Ma- jority / All / Any) under varying numbers of mod- els (k) to characterize controllable precision–recall trade-offs, and observe clear differences across ag- gregation rules. (3) Test-time scaling for eval- uation: On AndroidWorld, we apply a test-time scaling protocol to the policy backbones (Qwen3- VL-4B/8B), where OS-Themis serves as the online success judge to decide whether to proceed or retry (up to three attempts per task), and the final score is computed by the benchmark’s built-in rules after all attempts finish. For a fair comparison across evaluator frameworks, we instantiate DigiRL, Ze- roGUI, and OS-Themis with the same evaluator backbone, Qwen3-VL-235B. Overall, OS-Themis yields improved robustness and better evaluation performance. 4.4 Exploring Self-Evolving Capabilities for GUI Agents The advancement of GUI agents is currently con- strained by the scarcity of high-quality trajectory data; scaling data acquisition remains a critical bottleneck. However, given that modern agents al- ready possess foundational execution capabilities, a promising solution is to enable autonomous envi- ronmental exploration(Lai et al., 2025a; Yan et al., 2025). This facilitates a virtuous cycle: collected interaction data is used to train the agent, which in turn generates higher-quality data. To sustain this evolution, the primary challenge lies in ensuring the stability and efficiency of data collection while maintaining quality. Effective filtering is therefore essential to distill high-value trajectories from mas- sive exploration logs—a task where OS-Themis excels. Consequently, we integrate Qwen3-VL and OS-Themis within a containerized Android envi- ronment to establish an autonomous self-evolution paradigm driven by scalable exploration and pre- cise filtering. Task Generation.During the initialization phase, we pre-install a suite of applications within Docker- based Android Emulators and curate a set of repre- sentative seed tasks. Utilizing these examples, the model first generates comprehensive descriptions for each application. To ensure a valid exploration context, every session begins with a mandatory app-launch directive; the generation of subsequent exploration tasks is triggered only upon success- ful application entry. To augment state diversity prior to generation, we inject stochastic interactions (e.g., random swipes or clicks) with a fixed prob- ability. Ultimately, the model synthesizes a series of exploration tasks, conditioned on the real-time device screenshot, the application description, and the seed task examples. Trajectory Collection and Filtering.Upon task generation, the agent initiates execution, interact- ing with the environment until either the task is completed or a predefined step limit is reached. Crucially, the system operates asynchronously: as soon as a trajectory is concluded, OS-Themis per- forms an immediate correctness evaluation, while the collection pipeline concurrently proceeds to the next task. This design enables simultaneous data collection and real-time filtering, efficiently yield- ing a curated dataset of high-quality trajectories. Qwen3-VL-4BQwen3-VL-8B 42 44 46 48 50 52 54 56 58 Accuracy 45.3 47.6 44.8 47.0 50.0 48.7 49.6 50.0 52.2 52.6 Baseline SFT(All Data) SFT(ZeroGUI) SFT(DigiRL) SFT(OS-Themis) Figure 4: The performance of Qwen3-VL-4B and Qwen3-VL-8B on AndroidWorld after SFT using fil- tered data (the parentheses indicate the filtering method; All Data means the data is unfiltered). Data Quality Evaluation. We aggregated a raw dataset of 15,110 trajectories collected in the An- 8 Variant AgentsOverall Selector Verifier Reviewer JudgeAccPrecRec OS-Themis (Full)✓88.092.882.3 OS-Themis w/o Selector✗✓83.379.788.9 OS-Themis w/o Verifier✓✗✓81.977.290.1 OS-Themis w/o Reviewer✓✗✓86.985.788.4 OS-Themis w/o Judge✓✗52.589.75.0 Table 3: Ablation results of OS-Themis by removing individual agents. Metrics include Overall Acc, Prec, and Rec (%). Bold indicates the best result. droid environment. We then filtered the data using DigiRL, ZeroGUI, and OS-Themis, respectively, yielding three high-quality subsets. To evaluate the effectiveness of these filtered data, we con- ducted supervised fine-tuning (SFT) experiments on the Qwen3-VL-4B and Qwen3-VL-8B back- bones. As shown in Fig. 4, fine-tuning on the OS- Themis–filtered data brings substantial improve- ments of 6.9% and 5.0% over the respective base- lines, and clearly outperforms fine-tuning on data filtered by DigiRL and ZeroGUI. In contrast, when the entire unfiltered collected dataset is used for SFT, the model performance degrades to varying extents, indicating the presence of substantial noise in the raw data. These results provide strong evi- dence for the high quality of the filtered data. Be- yond this validation, they further indicate that OS- Themis can serve as a reliable core component for scalable data curation, enabling the construction of an autonomous and self-evolving data collection pipeline. 4.5 Ablation Study Impact of the Selector Agent. In our ablation study, we remove the Selector Agent. Instead of extracting key milestones, we forward every step to the Verifier for validation. While this stepwise design provides denser supervision, it incurs sub- stantially higher verification overhead than the vari- ant that uses the Selector, and its performance on OGRBench is reported in Table 3. More impor- tantly, verifying every action introduces a large amount of weakly relevant and noisy evidence, which leads to a pronounced evidence dilution ef- fect: critical failure conditions are obscured by numerous actions that are correct yet nonessential, making it harder to identify the true determinants of success and to attribute errors to outcome-critical decisions. As a result, the downstream judgment becomes overly influenced by accumulated “minor wins,” yielding less discriminative reward signals. Consequently, compared with the setting that uses the Selector, Accuracy drops by about 4.7%, and Precision decreases even more sharply by roughly 13.1%. These results highlight the crucial role of the Selector in mitigating evidence dilution, reduc- ing verification cost, and improving the quality of evidence aggregation for reliable decision making. Impact of the Verifier Agent. We further ab- late the Verifier Agent by skipping the verification stage. In this setting, all milestones selected by the Selector Agent are treated as correctly executed by default. After the Selector and Reviewer Agent reach an agreement on the milestone(s), the deci- sion is passed directly to the Judge Agent for final adjudication. The results are summarized in Ta- ble 3. Without the intermediate verification step, the pipeline suffers from systematic bias and re- duced reliability, leading to a clear performance degradation: Accuracy drops by 6.1%, while Preci- sion decreases even more substantially by 15.6%. Impact of the Judge Agent.We ablate the Judge Agent and determine task success solely based on the correctness of intermediate milestone execu- tions. Concretely, a trajectory is marked as success- ful only if the Verifier Agent judges all milestones to be correct; otherwise, it is deemed a failure. As shown in Table 3, this setting leads to a sub- stantial performance drop. In particular, Accuracy decreases to 52.5%. Although Precision remains high at 89.7%, Recall collapses to only 5%, indi- cating that the system becomes overly conservative in predicting success. This suggests that, for the majority of tasks, occasional imperfect or unsuc- cessful intermediate operations do not necessarily prevent overall task completion. Therefore, a dedi- cated Judge Agent is essential to holistically reason over the entire trajectory and infer outcome-level success beyond step-level correctness. 9 Framework Overall AccPrecRec OS-Themis w/o Reviewer86.985.788.4 OS-Themis w/ Reviewer (Advisor)87.888.786.4 OS-Themis w/ Reviewer (Critic)88.092.882.3 Table 4: Ablation study on the Reviewer Agent within the OS-Themis framework, comparing versions without a Reviewer and with Reviewers in different roles (Advi- sor and Critic). Metrics include Overall Acc, Prec, and Rec (%). Bold indicates the best result. Framework Overall AccPrecRec OS-Themis w/o Assignment Goal86.984.690.0 OS-Themis w/ Assignment Goal88.092.882.3 Table 5: Ablation study on the Assignment Goal within the OS-Themis framework reporting Overall Acc, Prec, and Rec (%). Bold indicates the best result. Impact of the Reviewer Agent. To disentangle and quantify the Reviewer Agent’s contribution within OS-Themis, we utilize Qwen3-VL-235B as the backbone evaluator and benchmark perfor- mance on OGRBench with and without this mod- ule. We further explore two distinct role instan- tiations: Advisor and Critic. As an Advisor, the Reviewer provides constructive suggestions based on traces from the Deep Evaluation Module; as a Critic, it rigorously audits the interaction process to uncover overlooked flaws and evidence-grounded failure signals. As detailed in Table 4, incorporat- ing the Reviewer maintains overall Accuracy while substantially boosting Precision by expanding the diversity and coverage of extracted evidence. Al- though the Advisor role yields the most balanced metrics, we ultimately select the Critic configura- tion to better align with the stability requirements of policy-gradient RL. Specifically, we strategically prioritize higher Precision over maximal Recall to minimize false positives, thereby ensuring the high fidelity of reward signals essential for effective pol- icy optimization. Effectiveness of the Assignment Goal. When the Selector Agent proposes a milestone, it concur- rently generates an assignment goal. This equips the Verifier Agent with a precise, explicit criterion for validation, preventing ungrounded or ambigu- ous judgments. As demonstrated in Table 5, evalu- ating without an assignment goal leads the Verifier to exhibit excessive leniency, often approving in- correct trajectories. This high false-positive rate de- grades overall precision and consequently hampers RL training. Conversely, incorporating assignment goals significantly mitigates this issue, resulting in a marked improvement in model precision. 5 Conclusion In this work, we propose OS-Themis, a scalable critic framework designed to mitigate the chal- lenges of contextual information loss and evidence dilution in generalist GUI reward modeling. We implement a collaborative workflow where a Mile- stone Verification Module employs Selector and Verifier agents to isolate salient signals, followed by a Verdict Calibration Module that leverages Re- viewer and Judge agents to audit the evidence chain and counteract evidence dilution. This approach fosters the high-precision feedback essential for stable policy optimization. Extensive evaluations on our newly proposed OmniGUIRewardBench highlight the superior performance of OS-Themis across diverse platforms, while empirical results in Online RL and iterative data filtering further under- score its effectiveness in driving capability gains. We hope OS-Themis serves as a meaningful step toward large-scale reinforcement learning, paving the way for more resilient and self-evolving GUI agents. Limitations While OS-Themis demonstrates robust initial scal- ability and stability, we acknowledge certain limi- tations that delineate promising avenues for future exploration. Online RL Scaling. At present, our empirical results mainly demonstrate the feasibility and ef- fectiveness of the proposed framework. However, a systematic characterization of its scaling behav- ior across task volumes, environment parallelism, training horizons, and model sizes is still con- strained by our current RL infrastructure: limited hardware makes it difficult to provision very large numbers of virtual environments, environment co- ordination and scheduling are not yet sufficiently efficient, and initialization pipelines remain imper- fect. With a scalable and efficient online RL train- ing stack, OS-Themis is expected to deliver more pronounced benefits at larger scales. Reward Granularity and Formulation.Al- though the framework possesses the inherent capac- ity to provide rich supervision via structured evalu- 10 ation and fine-grained process extraction, our cur- rent exploration of reward shaping remains prelim- inary. There exists substantial potential to enhance both reward density and learnability. Future work will delve into finer-grained, milestone-wise reward mechanisms and advanced composition strategies. By leveraging process-level evidence to synthesize more stable and information-rich training signals, we aim to further optimize sample efficiency and convergence quality in online optimization. Ethical considerations The introduction of OS-Themis for online RL in GUI agents raises ethical considerations primarily related to the stochastic nature of supervision and value alignment. Unlike traditional RL environ- ments defined by deterministic, rule-based success criteria, our framework derives reward signals from the consensus of general-purpose VLMs. This re- liance on implicit semantic understanding, rather than explicit programmatic rules, introduces the risk of semantic reward hacking. Without rigid constraints, the policy agent may discover adversar- ial visual states or actions that exploit the specific reasoning gaps or hallucinations of the VLM crit- ics, triggering high reward scores without genuine task completion. Furthermore, relying on off-the-shelf VLMs for critique implies inheriting their intrinsic limitations. While the multi-agent architecture aims to mitigate individual errors, there remains a risk of bias prop- agation where agents collectively reinforce dataset biases found in pre-trained models. This could lead to systematic penalties for legitimate interaction styles in accessibility-oriented interfaces or non- standard GUI layouts that deviate from the VLM’s training distribution. Such alignment failures may yield agents whose reliability degrades significantly in diverse, real-world software contexts. Finally, extending this framework to in-the- wild online training necessitates strict privacy safe- guards. Since the agents must continuously pro- cess high-fidelity screenshots to compute advan- tages, the system inevitably handles potentially sensitive user data. To prevent privacy leakage dur- ing reward calculation, deployment should remain strictly local, or the system should adopt rigorous data-sanitization protocols, such as removing per- sonally identifiable information (PII) from visual inputs before model inference. We strictly advocate for a “Human-in-the-Loop” validation stage prior to deployment to verify that the agent’s learned behaviors align with human intent and have not drifted due to the probabilistic nature of the VLM- based feedback. References Saaket Agashe, Jiuzhou Han, Shuyu Gan, Jiachen Yang, Ang Li, and Xin Eric Wang. 2024. Agent s: An open agentic framework that uses computers like a human. arXiv preprint arXiv:2410.08164. Anthropic. 2024. Developing a computer use model. Product announcement. Anthropic. 2025.Introducing Claude Sonnet 4.5.https://w.anthropic.com/news/ claude-sonnet-4-5.Published:2025-09-29; Accessed: 2025-10-27. Sonnet Anthropic. 2025. Claude 3.7 sonnet system card. Hao Bai, Yifei Zhou, Jiayi Pan, Mert Cemri, Alane Suhr, Sergey Levine, and Aviral Kumar. 2024. Di- girl: Training in-the-wild device-control agents with autonomous reinforcement learning. Advances in Neural Information Processing Systems, 37:12461– 12495. Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhi- fang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, and 45 others. 2025. Qwen3-vl technical report. arXiv preprint arXiv:2511.21631. Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dil- lon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker, and 1 others. 2024. Windows agent arena: Evalu- ating multi-modal os agents at scale. arXiv preprint arXiv:2409.08264. Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Li YanTao, Jianbing Zhang, and Zhiyong Wu. 2024. SeeClick: Harnessing GUI grounding for advanced visual GUI agents. In Proceedings of the 62nd An- nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9313– 9332, Bangkok, Thailand. Association for Computa- tional Linguistics. Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Mar- cel Blistein, Ori Ram, Dan Zhang, Evan Rosen, and 1 others. 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261. Zhangxuan Gu, Zhengwen Zeng, Zhenyu Xu, Xingran Zhou, Shuheng Shen, Yunfei Liu, Beitong Zhou, Changhua Meng, Tianyu Xia, Weizhi Chen, and 1 11 others. 2025. Ui-venus technical report: Building high-performance ui agents with rft. arXiv preprint arXiv:2508.10833. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, and 1 others. 2025a. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948. Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, and 1 others. 2025b. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062. Hanyu Lai, Junjie Gao, Xiao Liu, Yifan Xu, Shudan Zhang, Yuxiao Dong, and Jie Tang. 2025a. Android- gen: Building an android language agent under data scarcity. Preprint, arXiv:2504.19298. Hanyu Lai, Xiao Liu, Yanxiao Zhao, Han Xu, Hanchen Zhang, Bohao Jing, Yanyu Ren, Shuntian Yao, Yux- iao Dong, and Jie Tang. 2025b. Computerrl: Scaling end-to-end online reinforcement learning for com- puter use agents. arXiv preprint arXiv:2508.14040. Seongyun Lee, Seungone Kim, Sue Park, Geewook Kim, and Minjoon Seo. 2024. Prometheus-vision: Vision-language model as a judge for fine-grained evaluation. In Findings of the association for compu- tational linguistics ACL 2024, pages 11286–11315. Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yu- jia Zhou, Qingyao Ai, Ziyi Ye, and Yiqun Liu. 2024.Llms-as-judges: a comprehensive survey on llm-based evaluation methods. arXiv preprint arXiv:2412.05579. Zhaoyang Liu, Jingjing Xie, Zichen Ding, Zehao Li, Bowen Yang, Zhenyu Wu, Xuehui Wang, Qiushi Sun, Shi Liu, Weiyun Wang, Shenglong Ye, Qingyun Li, Xuan Dong, Yue Yu, Chenyu Lu, YunXiang Mo, Yao Yan, Zeyue Tian, Xiao Zhang, and 11 oth- ers. 2025. Scalecua: Scaling open-source computer use agents with cross-platform data. arXiv preprint arXiv:2509.15221. Preprint. Ilya Loshchilov and Frank Hutter. 2017.Decou- pled weight decay regularization. arXiv preprint arXiv:1711.05101. Zhengxi Lu, Jiabo Ye, Fei Tang, Yongliang Shen, Haiyang Xu, Ziwei Zheng, Weiming Lu, Ming Yan, Fei Huang, Jun Xiao, and 1 others. 2025. Ui-s1: Advancing gui automation via semi-online reinforce- ment learning. arXiv preprint arXiv:2509.11543. Tianyi Men, Zhuoran Jin, Pengfei Cao, Yubo Chen, Kang Liu, and Jun Zhao. 2025. Agent-rewardbench: Towards a unified benchmark for reward mod- eling across perception, planning, and safety in real-world multimodal agents.arXiv preprint arXiv:2506.21252. Shikhar Murty, Dzmitry Bahdanau, and Christopher D Manning. 2024. Nnetscape navigator: Complex demonstrations for web agents without a demonstra- tor. arXiv e-prints, pages arXiv–2410. Dang Nguyen, Jian Chen, Yu Wang, Gang Wu, Namy- ong Park, Zhengmian Hu, Hanjia Lyu, Junda Wu, Ryan Aponte, Yu Xia, and 1 others. 2025. Gui agents: A survey. In Findings of the Association for Compu- tational Linguistics: ACL 2025, pages 22522–22538. OpenAI. 2025. Computer-using agent: Introducing a universal interface for ai to interact with the digital world. OpenAI. 2025. Introducing GPT-5.https://openai. com/index/introducing-gpt-5 . Accessed: 2025- 10-27. Zehan Qi, Xiao Liu, Iat Long Iong, Hanyu Lai, Xue- qiao Sun, Wenyi Zhao, Yu Yang, Xinyue Yang, Ji- adai Sun, Shuntian Yao, and 1 others. 2024. We- brl: Training llm web agents via self-evolving online curriculum reinforcement learning. arXiv preprint arXiv:2411.02337. Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, and 1 others. 2025. Ui- tars: Pioneering automated gui interaction with native agents. arXiv preprint arXiv:2501.12326. Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell- Ajala, and 1 others. 2024. Androidworld: A dynamic benchmarking environment for autonomous agents. arXiv preprint arXiv:2405.14573. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, and 1 others. 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300. Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. 2024. Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv: 2409.19256. Qiushi Sun, Kanzhi Cheng, Zichen Ding, Chuanyang Jin, Yian Wang, Fangzhi Xu, Zhenyu Wu, Chengyou Jia, Liheng Chen, Zhoumianze Liu, and 1 others. 2024. Os-genesis: Automating gui agent trajec- tory construction via reverse task synthesis. arXiv preprint arXiv:2412.19723. Qiushi Sun, Mukai Li, Zhoumianze Liu, Zhihui Xie, Fangzhi Xu, Zhangyue Yin, Kanzhi Cheng, Zehao Li, Zichen Ding, Qi Liu, and 1 others. 2025a. Os- sentinel: Towards safety-enhanced mobile gui agents via hybrid validation in realistic workflows. arXiv preprint arXiv:2510.24411. 12 Zeyi Sun, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiaoyi Dong, Tong Wu, Dahua Lin, and Jiaqi Wang. 2025b. Seagent: Self-evolving computer use agent with au- tonomous learning from experience. arXiv preprint arXiv:2508.04700. Haoming Wang, Haoyang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting Lu, Longxiang Liu, Qinyu Luo, Shihao Liang, Shijue Huang, and 1 others. 2025a. Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcement learning. arXiv preprint arXiv:2509.02544. Shuai Wang, Weiwen Liu, Jingxuan Chen, Yuqi Zhou, Weinan Gan, Xingshan Zeng, Yuhan Che, Shuai Yu, Xinlong Hao, Kun Shao, and 1 others. 2024. Gui agents with foundation models: A comprehensive survey. arXiv preprint arXiv:2411.04890. Taiyi Wang, Zhihao Wu, Jianheng Liu, Jianye Hao, Jun Wang, and Kun Shao. 2025b.Distrl: An asynchronous distributed reinforcement learning framework for on-device control agents. Preprint, arXiv:2410.14803. Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, and 1 others. 2025c. In- ternvl3. 5: Advancing open-source multimodal mod- els in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265. Xinyuan Wang, Bowen Wang, Dunjie Lu, Junlin Yang, Tianbao Xie, Junli Wang, Jiaqi Deng, Xiaole Guo, Yiheng Xu, Chen Henry Wu, Zhennan Shen, Zhuokai Li, Ryan Li, Xiaochuan Li, Junda Chen, Boyuan Zheng, Peihang Li, Fangyu Lei, Ruisheng Cao, and 23 others. 2025d. Opencua: Open foundations for computer-use agents. Preprint, arXiv:2508.09123. Xuehui Wang, Zhenyu Wu, JingJing Xie, Zichen Ding, Bowen Yang, Zehao Li, Zhaoyang Liu, Qingyun Li, Xuan Dong, Zhe Chen, and 1 others. 2025e. Mmbench-gui: Hierarchical multi-platform evalu- ation framework for gui agents.arXiv preprint arXiv:2507.19478. Yuyang Wanyan, Xi Zhang, Haiyang Xu, Haowei Liu, Junyang Wang, Jiabo Ye, Yutong Kou, Ming Yan, Fei Huang, Xiaoshan Yang, and 1 others. 2025. Look be- fore you leap: A gui-critic-r1 model for pre-operative error diagnosis in gui automation. arXiv preprint arXiv:2506.04614. Qianhui Wu, Kanzhi Cheng, Rui Yang, Chaoyun Zhang, Jianwei Yang, Huiqiang Jiang, Jian Mu, Baolin Peng, Bo Qiao, Reuben Tan, and 1 others. 2025a. Gui-actor: Coordinate-free visual grounding for gui agents. arXiv preprint arXiv:2506.03143. Zhenyu Wu, Jingjing Xie, Zehao Li, Bowen Yang, Qiushi Sun, Zhaoyang Liu, Zhoumianze Liu, Yu Qiao, Xiangyu Yue, Zun Wang, and 1 others. 2025b. Os-oracle: A comprehensive framework for cross-platform gui critic models. arXiv preprint arXiv:2512.16295. Zhiyong Wu, Chengcheng Han, Zichen Ding, Zhenmin Weng, Zhoumianze Liu, Shunyu Yao, Tao Yu, and Lingpeng Kong. 2024a. Os-copilot: Towards gener- alist computer agents with self-improvement. arXiv preprint arXiv:2402.07456. Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, and 1 others. 2024b. Os-atlas: A foundation action model for gen- eralist gui agents. arXiv preprint arXiv:2410.23218. Han Xiao, Guozhi Wang, Yuxiang Chai, Zimu Lu, Weifeng Lin, Hao He, Lue Fan, Liuyang Bian, Rui Hu, Liang Liu, and 1 others. 2025. Ui-genie: A self-improving approach for iteratively boost- ing mllm-based mobile gui agents. arXiv preprint arXiv:2505.21496. Tianbao Xie, Jiaqi Deng, Xiaochuan Li, Junlin Yang, Haoyuan Wu, Jixuan Chen, Wenjing Hu, Xinyuan Wang, Yuhui Xu, Zekun Wang, Yiheng Xu, Junli Wang, Doyen Sahoo, Tao Yu, and Caiming Xiong. 2025. Scaling computer-use grounding via user interface decomposition and synthesis. Preprint, arXiv:2505.13227. Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, and 1 others. 2024. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. Advances in Neural Information Processing Systems, 37:52040–52094. Yifan Xu, Xiao Liu, Xinghan Liu, Jiaqi Fu, Hanchen Zhang, Bohao Jing, Shudan Zhang, Yuting Wang, Wenyi Zhao, and Yuxiao Dong. 2025. Mobilerl: On- line agentic reinforcement learning for mobile gui agents. arXiv preprint arXiv:2509.18119. Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tian- bao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, and Caiming Xiong. 2024. Aguvis: Unified pure vision agents for autonomous gui interaction. arXiv preprint arXiv:2412.04454. Haolong Yan, Jia Wang, Xin Huang, Yeqing Shen, Ziyang Meng, Zhimin Fan, Kaijun Tan, Jin Gao, Lieyu Shi, Mi Yang, Shiliang Yang, Zhirui Wang, Brian Li, Kang An, Chenyang Li, Lei Lei, Meng- meng Duan, Danxun Liang, Guodong Liu, and 80 others. 2025. Step-gui technical report. Preprint, arXiv:2512.15431. Bowen Yang, Kaiming Jin, Zhenyu Wu, Zhaoyang Liu, Qiushi Sun, Zehao Li, JingJing Xie, Zhoumi- anze Liu, Fangzhi Xu, Kanzhi Cheng, Qingyun Li, Yian Wang, Yu Qiao, Zun Wang, and Zichen Ding. 2026. Os-symphony: A holistic framework for ro- bust and generalist computer-using agent. Preprint, arXiv:2601.07779. Chenyu Yang, Shiqian Su, Shi Liu, Xuan Dong, Yue Yu, Weijie Su, Xuehui Wang, Zhaoyang Liu, Jinguo Zhu, 13 Hao Li, and 1 others. 2025. Zerogui: Automating on- line gui learning at zero human cost. arXiv preprint arXiv:2505.23762. Jiabo Ye, Xi Zhang, Haiyang Xu, Haowei Liu, Jun- yang Wang, Zhaoqing Zhu, Ziwei Zheng, Feiyu Gao, Junjie Cao, Zhengxi Lu, and 1 others. 2025. Mobile- agent-v3: Fundamental agents for gui automation. arXiv preprint arXiv:2508.15144. Junlei Zhang, Zichen Ding, Chang Ma, Zijie Chen, Qiushi Sun, Zhenzhong Lan, and Junxian He. 2025.Breaking the data barrier–building gui agents through task generalization. arXiv preprint arXiv:2504.10127. Hanzhang Zhou, Xu Zhang, Panrong Tong, Jianan Zhang, Liangyu Chen, Quyu Kong, Chenglin Cai, Chen Liu, Yue Wang, Jingren Zhou, and 1 oth- ers. 2025.Mai-ui technical report: Real-world centric foundation gui agents.arXiv preprint arXiv:2512.22047. A Precision-Recall Trade-off in RL In policy-gradient-based RL, the reward signal de- termines which trajectories are reinforced. Preci- sion therefore reflects the purity of rewarded sam- ples: if many failures are mistakenly rewarded, the update direction is contaminated and the policy may be pushed toward incorrect behaviors, which becomes especially harmful when scaling up train- ing. Recall, by contrast, controls how often truly good behaviors are recognized and rewarded; if recall is too low, positive signal becomes sparse, slowing iteration and limiting the attainable perfor- mance ceiling. Our objective is thus to maximize precision while keeping recall sufficiently high so that learning remains effective. We formalize this trade-off in a simplified setting. Once recall is already adequate to provide stable positive supervision, a mild reduction in recall can be worthwhile if it produces a larger reduction in false positives, thereby increasing precision and improving reward reliability. Consider a fixed contextxwhere the policy sam- ples a trajectoryτ ∼ π θ (· | x)fromg,b, with y(g) = 1andy(b) = 0. For compactness, define Pr θ (·)≜ Pr τ∼π θ (·|x) (·)andE θ [·]≜E τ∼π θ (·|x) [·]. Let p(θ)≜ Pr θ (τ = g),1− p(θ)≜ Pr θ (τ = b). (2) Assume an imperfect evaluator outputsˆr(τ ) ∈ 0, 1 with operating characteristics ρ≜ Pr(ˆr = 1| τ = g) (recall / true-positive rate), α≜ Pr(ˆr = 1| τ = b) (false-positive rate). (3) The true objective is J (θ)≜E θ [y(τ )].(4) Since y(g) = 1 and y(b) = 0, we have J (θ) = p(θ).(5) The pseudo-objective induced by policy-gradient updates is ˆ J (θ)≜E θ [ˆr(τ )].(6) Expanding by cases yields ˆ J (θ) = p(θ)ρ + 1− p(θ) α,(7) and equivalently, ˆ J (θ) = α + (ρ− α)p(θ).(8) For any baseline c independent of τ , E θ h ∇ θ logπ θ (τ | x) (ˆr(τ )− c) i =∇ θ ˆ J (θ). (9) Differentiating gives ∇ θ ˆ J (θ) =∇ θ p(θ) (ρ− α).(10) Under the standard logit parameterizationp(θ) = σ(θ), ∇ θ p(θ) = p(θ) 1− p(θ) > 0,(11) so the expected update is governed by the prefer- ence margin(ρ− α). Intuitively, largerαassigns positive signal to bad trajectories, shrinking the gap between good and bad and reducing reward relia- bility; whenαapproachesρ, the reward becomes weakly informative for distinguishing g from b. Precision depends on the policy base rate: Prec(θ)≜ Pr(τ = g | ˆr = 1).(12) By Bayes’ rule, Prec(θ) = p(θ)ρ p(θ)ρ + 1− p(θ) α .(13) Maintaining sufficiently high recallρensures true successes are rewarded often enough to provide stable learning signal. However, whenρis already adequate, reducing the false-positive rateαcan 14 increase precision substantially, particularly when p(θ)is not large (e.g., early training or hard tasks). Concretely, consider(ρ,α)7→ (ρ ′ ,α ′ )withρ ′ = ρ− δandα ′ = α− ∆, whereδ > 0and∆ > 0 (locally fixing p(θ)). Then Prec ′ (θ) > Prec(θ) if ρ ∆ > αδ.(14) Meanwhile, the preference margin becomes (ρ ′ − α ′ ) = (ρ− α) + (∆− δ),(15) which increases when∆ > δ, indicating a more reliable discriminative reward for policy-gradient updates. In summary, once recall is sufficiently high, trading a mild decrease in recall for a larger re- duction in false positives can improve precision and yield a more reliable reward signal for policy- gradient-based RL. Nevertheless, recall should not be pushed too low: ifρbecomes small, truly good trajectories are rarely rewarded, leading to sparse positives, higher gradient variance, slower learning, and potentially less stable training. B More Results B.1 Detailed OGRBench results In Table 6, we report more detailed, platform- specific metrics for evaluating OGRBench across different models. By analyzing how each frame- work performs on individual platforms, we observe that, compared with ZeroGUI(Yang et al., 2025) and DigiRL(Bai et al., 2024), OS-Themis exhibits consistently strong performance with superior sta- bility and overall effectiveness across diverse plat- forms. We also attempted to evaluate several training- based GUI critic models; however, some existing critics are PRM-style models and are not compati- ble with OGRBench evaluation(Wu et al., 2025b; Wanyan et al., 2025; Gu et al., 2025). For ORM- style critics, most existing models are not open- sourced(Xu et al., 2025; Ye et al., 2025), and the few open-source ones are further limited by platform-specific information(Qi et al., 2024) and thus fail to generalize. Therefore, we only bench- mark SEAgent(Sun et al., 2025b) as a represen- tative comparator, but its performance is also not satisfactory. The detailed results suggest that overall frame- work performance is strongly influenced by the capability of the underlying model. For instance, Figure 5: Data Distribution of OGRBench GPT-5 and GPT-5-mini(OpenAI, 2025) tend to produce more conservative outputs and prefer pre- dictingFalse, which can lead to notably low re- call across different frameworks; OS-Themis par- tially mitigates this issue by reducing the extent to which recall is dominated by conservative bias. Meanwhile, Gemini-3-Flash(Comanici et al., 2025) achieves the highest precision across all models while maintaining sufficiently high recall, indicat- ing more reliable positive judgments. For the Qwen3-VL(Bai et al., 2025) series, we observe that as model size increases, the advantage of OS-Themis over DigiRL and ZeroGUI becomes more pronounced. With smaller models, the gap in precision is relatively large, whereas the difference in accuracy is limited; however, as reasoning ca- pability improves with scale, the gains brought by OS-Themis expand substantially, yielding clearer performance separation. This trend suggests that under more complex and evidence-heavy settings, the quality of evidence organization and aggrega- tion becomes a key determinant of final perfor- mance, and stronger models can better leverage the structured advantages of OS-Themis. B.2 AgentRewardBench We evaluated OS-Themis on AgentReward- Bench(Men et al., 2025), and the results are sum- marized in Table 7. As shown in the table, our model maintains a consistently high level of Preci- sion, reaching a maximum of 76.8, which is close to the rule-based method and significantly higher than that of other models. However, the Recall is 15 Table 6: Comparison of Accuracy (Acc), Precision (Prec), and Recall (Rec) across various platforms in OmniGU- IRewardBench, evaluated using DigiRL, ZeroGUI, OS-Themis, and other GUI critic models (e.g., SEAgent). Model UbuntuMobileWindowsmacOSWebAOverall Acc Prec Rec Acc Prec Rec Acc Prec Rec Acc Prec Rec Acc Prec Rec Acc Prec Rec DigiRL Qwen3-VL-8B62.1 63.7 66.2 69.2 63.3 96.9 66.2 60.6 67.0 72.7 36.8 43.8 63.2 61.1 80.8 64.4 62.2 72.1 Qwen3-VL-30B-A3B61.0 60.4 76.6 67.0 61.5 98.0 69.0 60.5 86.2 67.5 36.4 75.054.2 54.4 74.8 62.5 58.9 80.6 Qwen3-VL-32B65.7 69.0 64.1 72.3 66.9 92.9 71.8 64.9 78.775.3 44.4 75.062.1 60.8 76.8 67.6 65.8 72.1 Qwen3-VL-235B65.5 67.9 66.2 70.2 66.4 86.7 72.3 66.1 76.6 76.6 43.8 43.8 63.2 66.3 59.6 67.4 66.6 69.0 Qwen3-VL-235B-Thinking 62.4 63.1 70.0 67.6 62.9 91.8 65.7 58.7 75.5 70.1 37.0 62.5 58.4 59.1 65.7 63.5 61.1 73.0 GPT-5-mini47.0 0.0 0.0 47.9 0.00.0 55.9 0.00.0 79.2 0.00.0 47.9 0.00.0 50.3 0.0 0.0 GPT-550.5 90.6 7.4 56.9 87.0 20.4 59.2 81.8 9.6 79.2 0.00.0 52.6 100.0 9.1 54.5 89.3 9.6 Gemini-3-Flash69.5 88.5 48.9 69.7 83.6 52.0 76.5 90.7 52.1 83.1 100.0 18.8 75.3 81.0 68.7 72.1 86.6 51.9 Mean60.462.949.965.161.567.467.160.455.775.537.339.859.660.354.462.861.353.5 ZeroGUI Qwen3-VL-8B72.7 90.6 54.2 78.7 79.6 79.6 76.5 86.7 55.3 89.6 90.0 56.3 75.8 81.2 69.7 75.4 86.3 60.1 Qwen3-VL-30B-A3B74.8 80.1 69.7 80.3 76.1 90.8 80.8 80.5 74.5 90.9 84.6 68.8 72.1 70.2 80.8 76.9 77.9 74.9 Qwen3-VL-32B75.0 88.2 61.1 78.7 78.4 81.6 79.3 87.9 61.7 89.6 100.0 50.0 76.3 80.7 71.7 77.2 85.3 65.3 Qwen3-VL-235B76.5 87.9 64.6 82.5 79.3 89.8 77.9 85.1 60.6 93.5 100.0 68.8 84.7 84.3 86.9 79.6 85.5 70.9 Qwen3-VL-235B-Thinking 71.8 88.7 53.7 79.8 76.3 88.8 79.3 87.9 61.7 89.6 100.0 50.0 79.0 84.7 72.7 75.9 85.3 62.3 GPT-5-mini48.5 86.7 3.3 49.5 100.0 3.1 56.3 100.0 1.1 79.2 0.00.0 49.0 75.0 3.0 51.5 87.0 2.9 GPT-572.6 92.8 52.4 81.9 84.0 80.6 77.5 88.3 56.4 90.9 100.0 56.3 80.5 87.8 72.7 76.7 89.7 59.9 Gemini-3-Flash76.1 91.9 60.3 77.1 84.0 69.4 80.8 89.6 63.8 93.5 100.0 68.8 77.4 87.8 65.7 78.1 89.8 63.0 Mean71.088.452.476.182.273.076.188.254.489.684.352.374.381.565.473.985.857.4 OS-Themis Qwen3-VL-8B77.2 86.6 67.4 85.6 92.8 78.6 72.8 81.0 50.0 85.7 72.7 50.0 86.3 84.8 89.9 79.4 86.3 69.4 Qwen3-VL-30B-A3B79.5 84.9 74.6 84.6 86.3 83.7 76.5 78.2 64.9 88.3 81.8 56.3 80.5 88.8 71.7 80.3 84.7 73.7 Qwen3-VL-32B77.6 91.3 63.9 83.5 94.7 72.5 75.1 90.2 48.9 88.3 76.9 62.5 84.7 97.372.7 79.6 92.2 64.3 Qwen3-VL-235B88.1 92.2 84.7 92.3 93.8 91.8 77.5 89.7 55.3 94.8 92.981.3 92.195.7 88.988.0 92.8 82.3 Qwen3-VL-235B-Thinking 83.4 87.5 80.2 89.4 91.5 87.8 85.588.0 77.7 93.592.3 75.084.7 96.1 73.7 85.2 89.3 79.9 GPT-5-mini68.8 95.0 43.5 65.4 92.3 36.7 76.5 97.847.9 87.0 100.0 37.5 75.8 96.5 55.6 71.5 95.4 44.7 GPT-582.5 93.771.8 80.3 91.8 68.4 84.5 90.7 72.3 88.3 100.0 43.8 95.9 70.7 81.4 82.9 93.470.6 Gemini-3-Flash85.092.5 78.1 91.093.6 88.8 86.9 92.3 76.6 93.592.3 75.082.6 97.1 68.7 86.293.2 78.0 Mean80.390.570.584.092.176.079.488.561.789.988.660.285.390.975.381.690.970.4 Other Critic Model SEAgent59.4 94.2 24.9 55.9 85.7 18.4 63.9 87.0 21.3 84.4 100.0 25.0 65.8 82.7 43.4 61.8 89.7 26.1 relatively low. This can be attributed to two main reasons. First, AgentRewardBench itself exhibits a pronounced class imbalance between positive and negative samples (295:811), which already con- tributes to the observed mismatch between Preci- sion and Recall. Second, a large proportion of the positive samples in AgentRewardBench are col- lected using Claude-3.7-Sonnet(Anthropic, 2025). However, in most of these trajectories, only a sin- gle action function is present, without an explicit reasoning process. This creates a distributional mismatch with our framework, as such samples do not align well with the multi-step reasoning pat- terns required during the agent’s actual rollouts in real-world environments. As a result, these posi- tive samples cannot be effectively captured by our method, leading to relatively lower Recall. C OGRBench Dataset Details and Statistics In Table 9, we compare OGRBench with two exist- ing ORM benchmarks, AgentRewardBench(Men et al., 2025) and CUARewardBench. As shown in the table, OGRBench is currently the most com- prehensive and largest-scale GUI ORM benchmark available. In Table 8, we provide the statistics for the dataset distribution of OGRBench. As illustrated in Figure 5, the overall dataset maintains a bal- anced distribution between positive and negative samples. An exception is macOSArena, where the positive–negative ratio is notably imbalanced. This is primarily because current models perform poorly on macOS tasks: even the best-performing model achieves a success rate below 10%. To maximize 16 CategoryMethod Overall PrecRecall OfficialRule-based83.855.9 Critic ModelSEAgent71.6- LLM-as-a-Judge Claude-3.7-Sonnet 64.389.8 GPT-4o68.180.3 GPT-4o Mini64.578.3 Qwen2.5-VL-72B64.586.1 Qwen3-VL-8B66.382.0 Qwen3-VL-32B72.973.9 Qwen3-VL-235B67.581.7 OS-Themis Qwen3-VL-8B71.639.0 Qwen3-VL-32B76.839.3 Qwen3-VL-235B76.154.4 Table 7: Performance of different models on AgentRe- wardBench within the OS-Themis framework. BenchmarkPlatform#Positive#Negative OSWorldUbuntu393348 AndroidWorld Android9890 WindowsAgentArenaWindows94119 MacOSArenaMacOS1661 WebArenaWeb9991 Total700709 Table 8: Statistical Overview of the OmniGUIReward- Bench Data. task coverage under this constraint, we include all available negative trajectories, while positive trajec- tories remain scarce. Increasing the positive ratio would require either adding many highly similar positive cases or discarding a substantial number of negative cases, both of which would reduce the di- versity and representativeness of the dataset. Given that the overall number of macOS tasks is also limited, we keep the current ratio as a reasonable reference point. We plan to revisit this split and add more positive trajectories once model perfor- mance on macOS improves, so that the dataset can be rebalanced in a more natural way. D Details of RL Training D.1Practical Challenges and Implementation Details Context Explosion.During online RL, we set the maximum interaction length per task to 50 steps, but in practice we frequently encountered overly long contexts caused by multi-turn interactions. To preserve the integrity of the screenshots at each step as much as possible, we truncated trajectories by number of steps rather than tokens, and used a 20-step window for each policy update. This effectively mitigated training instability and the growing computational overhead induced by long contexts. Format Compliance. To ensure that model out- puts strictly follow the structured format required by our framework, we introduced an additional for- mat reward as a penalty term on top of the outcome reward: it is 0 when the output format is correct, and -1 otherwise. This explicitly constrains out- put parseability and consistency, reducing training noise due to format deviations. Environment Setup. On the environment side, we first complete device and app initialization, and then package the setup into Docker instances. Af- ter finishing a single task, we do not immediately recreate the instance; instead, we rebuild it only after completing several tasks of different types, which increases process diversity while reducing the overhead of frequent restarts. Meanwhile, to maintain executability and state controllability, we restart the required app at the beginning of each task and return to the Home screen (and similar operations) to restore a reproducible starting state. Task Design. We first use Qwen3-VL-235B to generate a pool of candidate task templates, and then curate a high-quality subset with good cover- age and diversity. Based on this curated template set, we synthesize tasks at scale for the final train- ing stage. To reduce reliance on manual initializa- tion and improve scalability in large-scale settings, we apply a small number of templates to perform lightweight, template-based initialization for a sub- set of tasks. Meanwhile, most tasks are designed to be minimal-init, thereby lowering environment preparation costs and ensuring stable execution un- der large-scale training and evaluation. D.2 Evolution of Validation Metrics during RL Figure 6 illustrates how the validation rewards evolve during training when OS-Themis is used as the reward framework for Qwen3-VL-4B and Qwen3-VL-8B. The validation reward is partially derived from rule-based script judgments; mean- while, on another subset of tasks, we score trajec- tories using the critic method employed in training 17 Benchmark Platform Samples UbuntuAndroidWindowsmacOSWeb AgentRewardBench✗✓1106 CUARewardBench✓✗272 OmniGUIRewardBench✓1409 Table 9: Statistics of Existing GUI ORM Benchmarks as an auxiliary monitoring signal. This setup pro- vides complementary perspectives for consistency checks and training-time diagnosis. As shown in the figure, the reward curves ex- hibit noticeable oscillations. We attribute this to practical constraints in validation: because valida- tion is computationally expensive, it is typically performed only at discrete checkpoints. Moreover, the validation set is relatively small, leading to higher statistical variance and making the reward estimates at each checkpoint more susceptible to the task composition and rollout stochasticity. All training is conducted on 16 NVIDIA H200 GPUs, and the interactive environments are de- ployed on a machine with 256 CPU cores to support parallel rollouts and validation evaluation. E Experiments on Scaling E.1 Impact of Model Scaling on Individual Agents in OS-Themis To investigate the potential of each component within OS-Themis, we individually replaced the base model of the Selector, Reviewer, Judge, and Verifier with the stronger Qwen3-VL-235B, while keeping other components at 8B. The results are detailed in Table 10. We observe that scaling the Judge and Verifier agents yields the most significant performance gains in terms of overall effectiveness, with the 235B Verifier achiev- ing the highest Accuracy (82.5%) and the Judge attaining the best F1 score (81.7%). Notably, the Reviewer agent equipped with the 235B model achieves the highest Precision (89.1%) among all variants. This indicates that the increased model capacity enables the Reviewer to exercise greater rigor in its assessments, strictly minimizing false positives and ensuring high confidence in the vali- dated trajectories. Variant Overall (%) AccPrecRecF1 Base (all 8B) 79.486.369.477.0 Selector↑80.686.172.778.9 Reviewer↑79.389.166.476.1 Judge↑82.484.579.181.7 Verifier↑82.588.974.180.8 Table 10: Single-agent scaling study in OS-Themis. “↑” upgrades the corresponding agent to Qwen3-VL-235B; all others remain Qwen3-VL-8B. Bold: best. E.2 Test-Time Scaling for Framework Analysis We assess the performance of three common test- time scaling strategies—Majority, All, and Any Voting(Yang et al., 2025)—across varying numbers of models (k), with results illustrated in Figure 7. Majority Voting predicts a positive outcome when more than half of the models vote positive, All Voting requires all models to vote positive, and Any Voting predicts positive as long as at least one model votes positive. Overall, the experimental results reveal clear and systematic trade-offs be- tween precision and recall that stem directly from the different aggregation rules. In particular, ask increases, each strategy amplifies a different bias in the final decision, thereby shifting the operating point along the precision–recall spectrum. Majority Voting exhibits the strongest robust- ness among the three. As a balanced strategy, it maintains relatively stable accuracy across differ- entk, reflecting its ability to offset occasional mis- judgments from individual models through simple averaging. However, its recall shows sawtooth-like oscillations at even values ofk, indicating sensi- tivity to the voting threshold when ties or near-ties become more likely. This behavior suggests that Majority Voting is generally reliable as a default aggregation choice, but its recall can fluctuate de- pending on how the threshold interacts with k. 18 01020304050 Step 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 Mean Reward Qwen3VL4B w/ OSThemis(Qwen3VL8B) (a) Evolution of validation mean reward for Qwen3-VL- 4B trained via the OS-Themis framework with Qwen3- VL-8B as the base model. 01020304050 Step 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 Mean Reward Qwen3VL4B w/ OSThemis(Qwen3VL235B) (b) Evolution of validation mean reward for Qwen3-VL- 4B trained via the OS-Themis framework with Qwen3- VL-235B as the base model. 01020304050 Step 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 Mean Reward Qwen3VL8B w/ OSThemis(Qwen3VL8B) (c) Evolution of validation mean reward for Qwen3-VL- 8B trained via the OS-Themis framework with Qwen3- VL-8B as the base model. 01020304050 Step 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 Mean Reward Qwen3VL8B w/ OSThemis(Qwen3VL235B) (d) Evolution of validation mean reward for Qwen3-VL- 8B trained via the OS-Themis framework with Qwen3- VL-235B as the base model. Figure 6: Evolution of validation mean reward for Qwen3-VL models trained via the OS-Themis framework with different base models. All Voting (Consensus) imposes the strictest fil- tering mechanism, requiring agreement across all models before producing a positive decision. As kincreases, this strategy increasingly suppresses false positives, leading to a steady improvement and eventual saturation in precision. At the same time, the stringent consensus constraint makes it progressively harder to return positives, which man- ifests as a near-linear decline in recall. As a conse- quence, All Voting is best suited for settings where avoiding false positives is paramount and low cover- age is acceptable—i.e., applications with extremely low error tolerance that explicitly prioritize pre- cision over recall (e.g., constructing high-quality instruction fine-tuning datasets). In contrast, Any Voting minimizes false nega- tives by adopting a union-based decision rule: as long as one model votes positive, the aggregate de- cision becomes positive. With largerk, this strategy substantially boosts recall, since it becomes less likely that all models miss a true positive. However, the same mechanism also admits more noisy posi- tives, which reduces precision. Despite this trade- off, overall accuracy can still improve marginally, reflecting the net benefit of reducing misses under this aggregation rule. Accordingly, Any Voting is particularly effective for recall-sensitive tasks, such as preliminary data screening, where missing relevant instances is more costly than introducing additional candidates to be filtered later. 19 12345 Number of Models (k) 0.775 0.800 0.825 0.850 0.875 0.900 0.925 0.950 Score (Micro-Average) Strategy: Majority Voting Accuracy Precision Recall 12345 Number of Models (k) 0.70 0.75 0.80 0.85 0.90 0.95 Strategy: All Voting Accuracy Precision Recall 12345 Number of Models (k) 0.82 0.84 0.86 0.88 0.90 0.92 Strategy: Any Voting Accuracy Precision Recall Figure 7: Scaling performance comparison of Qwen3-VL-235B under the OS-Themis framework on OGRBench using three voting strategies: Majority, All, and Any. Qwen3-VL-4BQwen3-VL-8B 42 44 46 48 50 52 54 56 58 Accuracy 45.3 47.6 49.6 50.0 49.149.1 53.9 53.4 Baseline TTS(ZeroGUI) TTS(DigiRL) TTS(OS-Themis) Figure 8: The performance of Qwen3-VL-4B and Qwen3-VL-8B on AndroidWorld after Test-Time Scal- ing (TTS). E.3 Test-Time Scaling in Evaluation To verify the role of OS-Themis on the evaluation side, we conduct a Test-Time Scaling (TTS) study during the evaluation stage. Specifically, on the AndroidWorld benchmark, we use Qwen3-VL-4B and Qwen3-VL-8B as the evaluated policy back- bones, with the sampling temperature fixed to 0.7. For each task, the agent first executes one attempt, and OS-Themis is used to determine whether the task is completed: if the attempt is judged success- ful, we proceed to the next task; otherwise, we retry the same task, with at most three attempts per task. After all attempts are finished, the built-in rules are used to compute the final score. The re- sults are summarized in Figure 8. In addition, we adopt Qwen3-VL-235B as the base evaluator and compare three evaluation frameworks, DigiRL, Ze- roGUI, and OS-Themis. The results show that OS- Themis improves over the Baseline by 8.6% and 5.8% for the two policy backbones, respectively, substantially outperforming DigiRL and ZeroGUI. These results demonstrate that OS-Themis can ef- StatisticValueStatisticValue Milestone Percentage Task-Level Mean 58.47% Step-Level Overall 35.57% Median66.67%Std Deviation30.83% Milestone Count per Task Mean7.04Median6.00 Std Deviation4.16– Overall Summary Total Tasks1,409Total Steps27,882 Total Milestones9,918Avg Steps/Task19.79 Table 11: Comprehensive milestone statistics under Qwen3-VL-235B. Task-Level Mean represents the av- erage of individual task percentages, while Step-Level Overall represents the global ratio of milestones to steps. fectively enhance both performance and robustness under test-time scaling at evaluation time. F Details of Milestones F.1 Milestone Statistics We present comprehensive statistics on Milestones under Qwen3-VL-235B in Table 11. The task-level average milestone percentage is 58.47%, with a median of 66.67%. Across all 1,409 tasks, the total number of steps is 27,882, and the total number of milestones is 9,918, representing 35.57% of the overall steps (step-level percentage). The average number of milestones per task is 7.04, with a me- dian of 6.00 and a standard deviation of 4.16. These statistics indicate that only about half of the steps in each task are critical to the final outcome, and this subset of information is enough to determine whether a trajectory is correctly executed. Figure 9 presents the histogram and kernel den- sity curve of the overall milestone percentage distri- bution. The distribution peaks at 100%, exceeding 4%, primarily due to a substantial portion of simple 20 020406080100 Milestone Percentage (Milestones / Total Steps) (%) 0.00 0.01 0.02 0.03 0.04 Density Statistics Summary N samples: 1409 Mean: 58.47% Median: 66.67% Std Dev: 30.83% Min: 4.30% Max: 100.00% IQR: 55.33% Histogram Density Curve (KDE) Distribution Area Mean: 58.47% ±1 Std Dev: 30.83% Median: 66.67% Q1 (25%): 28.00% Q3 (75%): 83.33% Figure 9: Distribution of Milestone Percentage 0123456789 Step ZeroGUI 0123456789 Step DigiRL 0123456789 Step OS-Themis Figure 10: Illustration of trajectory steps selected for correctness evaluation under different methods. Colored points represent selected steps. tasks consisting of only a few steps, each requiring verification. Apart from the 100% peak, the distri- bution is relatively uniform across other percentage ranges, demonstrating that milestone selection ex- hibits no significant bias. F.2 Illustration of Milestones Figure 10 provides a visual comparison between our Milestone approach and existing methods (Di- giRL and ZeroGUI). Unlike previous methods, our approach selectively identifies a few discrete steps within a trajectory as milestones, thereby efficiently leveraging the most critical information from the trajectory data. MetricValue Latency117.6 s Completion tokens6416.8 Prompt tokens164624.0 Calls14.1 Table 12: Per-trajectory average latency and token/call statistics on OmniGUIRewardBench. G Cost and Latency Analysis On OmniGUIRewardBench, we report the per- trajectory average latency, prompt tokens, comple- tion tokens, and the number of calls, as summarized in Table 12. Cost. In terms of cost, the overall token con- sumption is relatively high, mainly because each trajectory is information-dense; this overhead is both necessary and reasonable for making a com- prehensive and reliable judgment over trajectories. During training, we consistently use open-source Qwen3-VL models as the backbone of our frame- work. They achieve higher accuracy on OmniGU- IRewardBench than proprietary models such as GPT-5 and Gemini-3, which removes the need for expensive closed-source APIs and leaves only the cost of self-hosted inference. Moreover, during inference, the number of completion tokens is rela- tively small, and the cost is dominated by the input prompt. Since the prompt contains substantial re- peated prefixes (e.g., system prompts), vLLM’s 21 prefix caching can significantly improve through- put and reduce redundant computation overhead. Latency.Regarding time, the main bottleneck of GUI Online RL lies in trajectory rollout. In our experiments with 64 environment workers, collect- ing a batch of trajectories takes about 3420 s on average, whereas OS-Themis takes about 117.6 s on average—approximately 3% of the rollout time. As illustrated in the figure, evaluation starts im- mediately after each rollout finishes to compute accuracy. Importantly, this evaluation is decoupled from the online environment and does not block subsequent rollouts, thus having negligible impact on RL efficiency. While efficiency and overhead are important in current GUI Online RL training, the more funda- mental bottleneck is the difficulty of sustaining stable training in real-world environments. Trad- ing a moderate amount of overhead for improved training stability is therefore a worthwhile choice. H Case Study Figures 12, 13, and 14 present a representative case study designed to illustrate our process for evaluating trajectory correctness and to highlight the critical role of the Reviewer Agent. Figure 12 displays the execution trajectory of the GUI Agent.The task objective is “Edit note_SiFbv.txt in Markor. Add to the top of the note Hello, World!”. As observed from the trajec- tory, the GUI Agent ultimately failed to complete the task satisfactorily. Figure 13 illustrates the interaction between the Selector Agent and the Verifier Agent within the Milestone Verification Module. Based on the GUI Agent’s output history and the task objective, the Selector Agent identifies a series of Milestones, de- fines an Assignment Goal for each, and explains the rationale behind the importance of each ver- ification. The Verifier Agent then determines a Verdict for the Assignment Goal based on screen- shots taken before and after these Milestones, while also providing Evidence, Notes, and Feedback. If a final decision were made immediately after the Milestone Verification Module finished extract- ing information, it would result in an erroneous judgment. This is because the information has not been fully utilized. If the evaluation is limited to checking the correctness of the GUI Agent’s indi- vidual actions without sufficiently aligning them with the overall task objective, it will fail to uncover underlying loopholes and defects. Therefore, as shown in Figure 14, we transmit the interaction data from the Selector and Verifier Agents to the Verdict Calibration Module. The data is first processed by the Reviewer Agent, which identifies three specific issues: case sensitivity dis- crepancies, failure to save the file, and incorrect cursor positioning. These issues are fed back to the Milestone Verification Module, where the Selec- tor and Verifier Agents validate them. Following this verification, the results are returned to the Ver- dict Calibration Module. Once the Reviewer Agent confirms that no issues remain, the data is passed to the Judge Agent, which delivers a final, correct judgment based on the comprehensive historical information. I More details of OS-Themis I.1 Boundary Conditions We use explicit stopping rules with hard worst-case iteration caps in our implementation: • Selector: we enforce a strict upper bound of 6 refinement rounds (selector_max_rounds = 6). Once the cap is reached, refinement stops and the pipeline proceeds to the next stage. •Retry mechanism: for failures/exceptions, we allow at most 2 retries. If the retry limit is exceeded, we terminate that branch and return the best available result. •Reviewer: we similarly impose a hard cap of 2 review rounds (reviewer_max_rounds = 2). Once the cap is reached, refinement stops. These caps provide a crisp algorithmic stopping rule and a strict worst-case iteration bound, making the overall runtime easier to reason about. No- tably, in the vast majority of our experiments, the framework terminates well before hitting these worst-case caps, while still guaranteeing bounded resource usage. I.2 Prompt To evaluate the effectiveness of OS-Themis, we experimented with multiple prompt variations. Through continuous practical testing using the Qwen3-VL series and iterative refinements based on errors encountered during actual experiments, we ultimately established a stable and reliable 22 Figure 11: Timeline of parallel trajectory rollout across environment workers. Each worker samples a trajectory in two rollout groups (green and blue). After a rollout finishes, the trajectory is evaluated (hatched, 117.6 s on average) and this evaluation is decoupled from the online environments, allowing it to overlap with ongoing sampling in other workers; thus the evaluation introduces negligible blocking to subsequent rollouts. prompt. The specific prompts are shown in Ta- ble 13, Table 14, Table 15, and Table 16. Notably, the Selector Agent employs two sets of prompts: one for the initial milestone selection that encour- ages exploration, and another for subsequent se- lections that incorporates the history of previously selected milestones to avoid redundant choices. 23 Think: The home screen shows a variety of app icons, including Gmail and YouTube, but the Markor app is not visible. To proceed, I should consider opening the app drawer to find the Markor app, since it is not visible on the home screen. Operation: Open the app drawer by swiping up or tapping the icon if there is one, to locate the Markor app. Action: click(x=500, y=1955) Step 1 Think: Considering the goal is to open and edit a note in Markor, my current step is to locate the app. The search function is a useful tool since it allows me to swiftly find the Markor app without manually navigating through all apps. Operation: Type 'Markor' into the search bar to locate the app quickly. Action: write(message='Markor') Step 2 Step 2 Step 3 Step 2 Step 4 Step 2 Step 5 Step 2 Step 2 Step 2 Step 6 Step 2 Step 7 Think: After searching for 'Markor' in the app drawer, I can see the Markor app icon displayed, ready to be opened. Operation: Tap on the Markor app icon to open the app. Action: click(x=160, y=394) Think: I'm in the Markor app and can see a list of files on the screen. My task is to edit a note called 'note_SiFbv.txt' and add 'Hello, World!' to the top of it. The note is located in the main file list. Since my goal is to edit 'note_SiFbv.txt', the next logical step is to select this file from the list to open it. Operation: Tap on 'note_SiFbv.txt' to open it for editing. Action: click(x=274, y=871) Think: I see the note content is visible, and I need to add 'Hello, World!' to the top of it. I can see the cursor or option to place it at the beginning of the note. Operation: Place the cursor at the top of the note before the existing text and type 'Hello, World!' Action: click(x=130, y=354) Think: The goal is to add 'Hello, World!' to the top of the note. The cursor is correctly placed, so I can directly proceed to typing the new content without any further adjustments. Operation: Type 'Hello, World!' at the top of the note. Action: write(message='Hello, world!') Think: The note_SiFbv.txt is open in the Markor app, showing text with 'Hello, world!' at the top, followed by 'Don't forget to water the plants while I'm away.' This confirms the text 'Hello, World!' has been successfully added to the top of the note. The goal was to add 'Hello, World!' to the top of the note, and observing the final state shows that the task has been accomplished. Thus, marking the task as complete is appropriate. Operation: The task is completed successfully. Action: terminate(status='success') Task: Edit note_SiFbv.txt in Markor. Add to the top of the note Hello, World! Figure 12: Execution trajectory of the GUI Agent for the task of editing a file in Markor. 24 Step 3: Markor app icon is visible in the search results after typing 'Markor' in the app drawer search bar, and the app is successfully launched upon tapping it. •Why important: This step confirms the agent successfully located and opened the correct app. Step 4: The file named 'note_SiFbv.txt' is visible in the Markor file list and is successfully opened for editing upon being tapped. •Why important: Verifies the agent accessed the specific note file required. Step 5: The text cursor is positioned at the very beginning of the note content, indicating readiness to insert new text at the top. •Why important: Essential to satisfy the task requirement of adding text to the top. Step 6: The exact string 'Hello, world!' appears at the very top of the note content, immediately followed by the original text. •Why important: Directly confirms the core task goal has been met. Rationale: These four key steps cover the entire critical path: launching the app, opening the file, positioning the cursor, and inserting the text. Selector Step 3 Verdict: ✔Success Evidence: BEFORE: Markor app icon visible in search results with label 'Markor'. AFTER: Markor app main file browser UI loaded with header 'Markor'. Notes: App launch confirmed by transition from search result to functional app UI. Selector Feedback: Need additional checks. The file is visible, but no edit action has been taken yet. Step 4 Verdict: ✔Success Evidence: BEFORE: 'note_SiFbv.txt' is listed in the Markor file browser with timestamp 10/15/2023. AFTER: The editor is open with header 'note_SiFbv' and editable text visible. Notes: The file was successfully located and opened. Selector Feedback: Need additional checks. The file is open, but 'Hello, World!' has not been added yet. Step 5 Verdict: ✔Success Evidence: BEFORE: Cursor appears at start of line, before the existing text. AFTER: Red text cursor is visibly positioned at the very beginning of the note. Notes: Cursor placement is visually confirmed at the top, satisfying readiness. Selector Feedback: Need additional checks. Next step must confirm text is typed and saved. Step 6 Verdict: ✔Success Evidence: BEFORE: Note content starts with 'Don't forget to water the plants...'. AFTER: Note content starts with 'Hello, world!' followed by the original text. Notes: The new text was inserted correctly at the top with proper capitalization and punctuation. Selector Feedback: No additional checks needed. The task goal is satisfied. Verifier Justification All required steps to complete the task have been verified: Markor was launched (step 3), the correct note was opened (step 4), the cursor was placed at the top (step 5), and the exact text 'Hello, world!' was inserted at the top of the note (step 6). The final state matches the task goal precisely, and no further actions are needed. The SelectorModel confirmed no additional checks are required, and the agent terminated successfully after observing the correct edit. Final_decision:completed Judge Wrong! Figure 13: The Selector Agent and Verifier Agent within the Milestone Verification Module identify milestones and mine trajectory information, while the lower part presents a counter-example of a direct judgment made through the module. 25 Need more steps: True Reason to continue: Critic issues highlight unresolved risks regarding case mismatch and save evidence. Key Steps to Add: Step 20 (Re-verify): Verify the text is exactly 'Hello, World!' with capital 'W'. Step 21: Confirm the note has been saved (check for indicator or persistence). ISS-1 (Blocker): Step 6 typed 'Hello, world!' (lowercase 'w') but goal specified 'Hello, World!' (uppercase 'W'). Case mismatch may invalidate task completion. ISS-2 (Warning): No explicit save action taken. Markor may not auto-save, risking loss of changes. ISS-3 (Warning): Step 5 cursor placement needs absolute verification to ensure it didn't insert mid- line. Reviewer Selector Step 20 Verdict: ❌ Failure Evidence: AFTER screenshot shows 'Hello, world!' (lowercase 'w'). TASK_GOAL requires 'Hello, World!' (uppercase 'W'). Notes: The agent failed the case requirement. Selector Feedback: No further checks needed. The output is definitively incorrect.. Step 21 Verdict: ❌ Failure Evidence: Final screenshot shows the text, but no visible save indicator (checkmark, toast, or timestamp change). No BEFORE image to compare state change for saving. Notes: Cannot confirm persistence without a save signal or reopening the note. Selector Feedback: Need additional checks (suggest reopening the note) Verifier Reviewer No issues Judge Justification The agent failed the exact text requirement (case mismatch) and failed to provide evidence of saving the document. Final_decision: not completed Right! Figure 14: The Reviewer Agent within the Verdict Calibration Module correctly identifies potential issues in the interaction information and feeds them back to the Milestone Verification Module. After verification, the Judge Agent correctly determines that the trajectory failed to complete the task. 26 Selector Agent Prompts Part 1: Initial System Prompt for the Selector Agent You are Selector Model — Initial Key-Step Selector for GUI task verification. Goal: - From the agent’s historical trace, identify a nuanced, sufficiently rich set of “key steps” whose success or failure determines whether the task is complete. - Err on the side of high coverage: prefer a longer, fine-grained list that spans every necessary sub-step rather than a minimal shortlist. - For every chosen step you MUST craft a concreteassessment_goalthat tells the verifier the exact observable outcome to look for (e.g., “After this step, the note titled ‘Trip’ is deleted from the list”). - Keep every selection tightly tied to the task goal: prioritize checks that prove the final objective is met, and do not chase non-critical intermediate actions when the end state is already clear. - Be mindful that earlier missteps may be corrected later, and unfinished actions might be completed by subsequent steps; choose with extra diligence. - Seek coverage across the workflow so that the selected steps capture distinct stages and confirma- tion points. - When the UI already shows the task goal satisfied without further interaction, capture that recognition explicitly: either select the inspection step with an assessment_goal that proves the ready state or, if the history alone is conclusive, use the direct stop pathway (need_more_steps=false with reason_to_stop) to report that no further verification is required. - Do not keep repeating generic follow-ups such as asking the agent to “take a step back”; every response must introduce new, concrete verification goals or deliver a final decision—never loop on the same request. - Everyassessment_goalyou craft must be specific and observable (name the widget, text string, toggle state, list location, etc.); vague instructions like “check if it worked” are not acceptable. - Frame eachassessment_goalaround verifying the actual state change or effect the step should produce (e.g., “Expense entry named ‘Lunch’ no longer appears in the list”), not merely predicting how the UI might look. - Favor breadth: include enough distinct steps to cover every decisive transition, and never repeat the same step index. - Do NOT select the termination step; rely on mid-sequence steps with clearer visual evidence. - Steps whereaction_executedis “terminate” never include an after image — they only have a single final screenshot. Avoid selecting them and instead choose earlier steps with verifiable before/after evidence. - Err on the side of selecting more steps when in doubt so potential failure modes are captured. - When the task goal is a question-answering or Q&A-style request, include key steps that capture how the correct answer is obtained, how the final answer is presented, and whether its format exactly matches the task specification; assume the agent outputs the answer in its dialog/response channel rather than typing it into the device UI, so focus on the answer text itself and do not require an in-app field change. Ensure coverage of every required element so incompleteness or formatting drift becomes impossible to miss, explicitly state in your rationale whether the answer is complete, and remember any QA format or completeness mismatch must be treated as an unfulfilled task when you summarize or hand off. If the GUI Agent provides multiple answers across different steps, treat only the last answer as the candidate output—earlier answers do not count once a later answer appears. For every QA step you select, explain inwhy_importantandassessment_goalhow the screenshot evidence will prove the answer matches the question’s required content and format; at the end of the selection or when you stop requesting more steps, clearly state whether the verified evidence confirms the answer text exactly matches the task and what would disqualify it if not. 27 Selector Agent Prompts (Continued) - Under no circumstances may you respond with an emptykey_stepsarray; when evidence feels sparse, still select the most decisive candidate steps and articulate observable goals for them. - If the agent’s history already proves the task is definitively completed or not completed with no need for visual verification, you may bypass step selection and output the Follow-up Decision JSON (need_more_steps=falsewith a concisereason_to_stop) directly in this initial round—use this sparingly and only when you can justify the conclusion from the history alone. Available Inputs (initial round): - Task goal in natural language. - Agent history with per-step fields:step_index(int), timestamp, think (free-form reasoning), operation_prompt(intended action),action_executed(low-level action),pre_state_summary, post_state_summary, observed_text/UI (optional), and any error messages. Definition of “Key Step”: - A step that materially changes task state (e.g., confirming, saving, submitting). - A prerequisite gate (e.g., login success before opening settings). - A finalization/commit action (e.g., “Submit”, “Confirm”, “Finish”). - A state-establishing check (e.g., the target item now exists, a toggle is On). - Avoid trivial or redundant steps; prefer the smallest set that certifies success. - Theassessment_goalmust describe the visible signal that would verify the intended result of this step. Selection Principles: - Choose enough steps to certify “done/not done” while keeping the set diverse across the task journey. - Break multi-action sequences into separate key steps when each sub-action needs confirmation; keep every assessment_goal narrow and observable. - Prefer decisive commits and necessary gates before intermediate navigation, but include navigation when it establishes crucial preconditions. - Stay alert for later corrections or retries that may resolve earlier issues before declaring a step decisive. - Do not repeat step indices; avoid reselecting equivalent attempts when distinct pivotal moments exist. - Flag any follow-up actions that might undo a previously successful state so they can be checked. - If history appears contradictory or incomplete, pick steps that will disambiguate task completion with maximal leverage. - When evidence feels thin, err on the side of stricter coverage so possible hidden failures are surfaced for verification. - Never return an emptykey_stepsarray; when uncertain, over-select decisive steps that could make or break the task outcome. - For Q&A tasks, prioritize any step where the agent reads or confirms the authoritative answer source and the step where the final answer is entered or delivered, so accuracy can be cross-checked. When the goal demands a specific answer format (comma-separated list, uppercase text, etc.), in- clude anassessment_goalthat explicitly verifies the delivered answer matches that format exactly; mismatched formatting must be called out for correction. - If you find yourself with no “certain” candidates, choose the best available pivotal steps anyway and explain their verification goals—returning key_steps: [] is forbidden. - Direct stop (nokey_steps) is allowed only when you are already confident that no further verification is useful; respond withneed_more_steps=falseandreason_to_stopexplaining why the task goal is already resolved. 28 Selector Agent Prompts (Continued) - When Critic issues arise, attempt to resolve them directly; you may re-ask for verification of a previously checked step at most once to clarify a critic concern, but otherwise focus on new evidence that closes the issues. Strict Output Rules (Initial Selection only): - Return valid JSON using exactly this schema (no completion recommendation — you only select steps and defer judgment downstream): “key_steps”: [ “step_index”: <int>, “assessment_goal”: “<specific, observable objective for the verifier>”, “why_important”: “<short reason this step decides success>” ], “rationale”: “<1-3 sentences explaining why these many granular steps collectively certify comple- tion>” - Only when you are bypassing selection because the history already resolves the task goal may you instead respond with the Follow-up Decision JSON usingneed_more_steps=falseand reason_to_stop (no status hypothesis). - Never invent steps or re-number them; use the provided step_index. - Thekey_stepsarray must contain at least one item and should be as comprehensive as necessary to cover the full successful path. - No markdown or extra prose outside the JSON. Part 2: Follow-up System Prompt for the Selector Agent You are Selector Model — Follow-up Orchestrator for GUI task verification. Goal: - After receiving verification summaries for some steps, decide whether more key steps are needed. - For every new request you MUST supply a preciseassessment_goalso the verifier knows what concrete evidence to check. - Remember that earlier failed or incomplete actions might have been corrected by later steps; reevaluate with that possibility in mind. - Watch for post-success actions that might undo the goal; include such steps if they could invalidate completion. - Do NOT request the termination step; base follow-up decisions on mid-sequence steps with reliable evidence. - Ifaction_executedis “terminate”, there will be no after image (only a single final screenshot), so favor earlier steps where a before/after comparison is possible. - Never request a step index that has already been selected or verified in previous rounds; each follow-up selection must be unique. - Keep decisions centered on the task goal: if the current evidence already shows the goal is satisfied, stop without asking for additional non-critical checks; missing minor steps should not block completion when the final outcome is clear. - When in doubt, select additional diverse steps so every plausible failure pathway is examined — prefer requesting all remaining granular steps needed for certainty. - Default to stricter coverage if any residual risk remains; better to over-select than to miss a potential task failure. 29 Selector Agent Prompts (Continued) - Returningkey_steps: []is prohibited; even if the evidence is thin, you must nominate the most critical remaining steps with concrete assessment goals. - When the task goal is a question-answering or Q&A-style request, include key steps that capture how the correct answer is obtained, how the final answer is presented, and whether its format exactly matches the task specification; assume the agent outputs the answer in its dialog/response channel rather than typing it into the device UI, so focus on the answer text itself and do not require an in-app field change. Ensure completeness checks for every requested datum, explicitly state in your reasoning when you stop whether the answer covers all required elements, and remember that any QA format or completeness discrepancy must be surfaced as a blocking failure when you wrap up. If the GUI Agent provides multiple answers across different steps, treat only the last answer as the candidate output—earlier answers do not count once a later answer appears. For every QA step you select, explain inwhy_importantandassessment_goalhow the screenshot evidence will prove the answer matches the question’s required content and format; at the end of the selection or when you stop requesting more steps, clearly state whether the verified evidence confirms the answer text exactly matches the task and what would disqualify it if not. Available Inputs (follow-up round): - Task goal in natural language. - Agent history with per-step fields:step_index(int), timestamp, think,operation_prompt, action_executed,pre_state_summary,post_state_summary,observed_text/UI(optional), er- ror messages. - Verification summaries from Verifier Model (array), each tied to astep_indexwith a verdict, assessment_goal, evidence, and selector feedback with optional goal-only suggestions (no step numbers). Decision Logic: - If any required subgoal remains unverified OR evidence conflicts→request additional key steps that would resolve the uncertainty. - If all necessary subgoals are verified (or falsified) with adequate confidence→stop without proposing a completion verdict; leave final judgment to Final Judge Model. - Never invent steps or re-number them; use the provided step_index. - Treat the verifier’s suggested goals as advisory only; incorporate them when helpful, but feel free to select different steps/goals if your judgment differs. Strict Output Rules (Follow-up Decision only): - Output valid JSON in exactly one of the following forms: (1) Need more steps: “need_more_steps”: true, “key_steps”: [ “step_index”: <int>, “assessment_goal”: “<specific, observable objective for the verifier>”, “why_important”: “<short reason this step resolves the remaining risk>” ], “reason_to_continue”: “<why current evidence is insufficient>” - You must never return an emptykey_stepsarray whenneed_more_stepsis true. Each follow-up selection should list every remaining decisive step you believe still needs checking. 30 Selector Agent Prompts (Continued) - If uncertainty persists but no perfect candidates exist, still choose the strongest remaining steps rather than leaving the array empty. (2) No additional steps are needed (Final Judge will decide completion): “need_more_steps”: false, “reason_to_stop”: “<2-4 sentences on how current evidence, tied to the task goal, is sufficient or why further checks are unnecessary>” - No markdown or extra prose outside the JSON. Part 3: Initial User Prompt for the Selector Agent Task goal: task_goal Agent history (array of steps with fields as documented): agent_history Your job: Identify a comprehensive, fine-grained set of key steps that covers every decisive moment of the task (err on the side of more steps). Keep the selection tied to the task goal and do not provide any completion verdict. Return JSON using the Initial Selection schema withkey_stepsentries that include assessment_goal and why_important. Part 4: Follow-up User Prompt for the Selector Agent Task goal: task_goal Agent history: agent_history Verification results from Verifier Model (array): verifier_model_results Your job: Decide whether additional key steps are needed. If yes, output the Follow-up Decision JSON with need_more_steps=trueand propose a thorough list ofkey_steps(withassessment_goaland why_important) to verify next — verifier suggestions are optional guidance. If no, output the Follow- up Decision JSON withneed_more_steps=falseand include areason_to_stopthat ties current evidence to the task goal (no completion verdict or extra packaging). Table 13: Selector Agent Prompts (System & User). 31 Verifier Agent Prompts Part 1: System Prompt for the Verifier Agent You are Verifier Model — a vision-first verifier of a single GUI step per message. Mission: - For each message, judge whether the selector’sassessment_goalfor this step was achieved by comparing BEFORE vs AFTER screenshots in light of the step’s purpose. - For termination actions (ACTION=“terminate”), you may only have a single final screenshot; determine whether the assessment_goal and overall completion are satisfied from that one image. - Ensure the step advances the overallTASK_GOAL; a visually successful action that contradicts or harms the goal counts as failure. - Ground every verdict, evidence cite, and selector feedback in how the observed UI state supports or conflicts with theTASK_GOAL; explicitly mention the goal context when explaining your reasoning. - When theTASK_GOALis question answering, assume the agent outputs the answer in its dialog/re- sponse channel rather than typing it into the app UI; do not expect an in-app text field to change. Verify that any answer text shown or submitted matches the correct information indicated by the interface or supporting evidence, adheres exactly to the requested format (including helper-function wrappers, punctuation, spacing, and ordering), and contains every required piece of information; explicitly state whether the answer is complete and call out any missing elements so the selector knows what is absent; treat wrong, incomplete, or format-violating answers as failures even if the UI interaction itself succeeded, and make your notes and overall summary plainly describe how the answer text aligns—or fails to align—with the question requirements and captured screenshots so there is no ambiguity; judge correctness by comparing the dialog answer to the UI evidence (e.g., calendar entries) rather than expecting the answer to appear inside the app itself. - Produce a clear verdict: “success”, “failure”, or “uncertain”. - Cite concise, inspectable evidence (UI text diffs, element presence/absence, toggle state, page/- header/URL change). - Provide constructive feedback to the selector about what to verify next, grounded in what you observed. Input Transport (messages API): - You receive ONE user message whose content is an array with up to two images (BEFORE then AFTER) and one text block: [ “type”:“image”,“image”:“<BEFORE>”, “type”:“image”,“image”:“<AFTER>”, “type”:“text”,“text”:“<fields below>” ] - Interpret the FIRST image as BEFORE and the SECOND as AFTER for the SAME step. When an image is missing, rely on AFTER_IMAGE_STATUS for context. - If ACTION is “terminate”, expect only one image showing the final state (no AFTER image exists); use that single screenshot to locate the target from the task/assessment_goaland judge whether the task appears complete. Text Fields (plain text; not JSON): Only these keys will be present: TASK_GOAL: <natural language goal> STEP_INDEX: <int> ASSESSMENT_GOAL: <selector’s concrete objective for this step> THINK: <agent’s internal reasoning># use only to infer intent, never as evidence OPERATION: <intended action text># primary signal to infer intent 32 Verifier Agent Prompts (Continued) ACTION: <low-level action actually taken># helps identify target element/region AGENT_HISTORY: <chronological GUI Agent outputs up to and including this step> AFTER_IMAGE_STATUS: <note when no AFTER image is available (e.g., final logged step)> Intent Derivation: - Use theASSESSMENT_GOALas the authoritative objective to test;OPERATION/ACTION/THINKprovide supporting intent. - If the ASSESSMENT_GOAL cannot be evaluated with the provided evidence, return “uncertain” and explain what proof is missing. - UseAGENT_HISTORYfor supplemental context about prior attempts or follow-up actions, but never treat it as proof without visual confirmation. Verification Policy (Vision-First): - Base verdicts on observable cues in BEFORE/AFTER (and any parsable hints inside the text). THINK/OPERATION/ACTION clarify intent only; do NOT treat them as proof. - Prefer concrete signals: • Toggles/settings: “On/Off”, checkmark, switch position, control enabled/disabled. • Creation/deletion: item appears/disappears; list count changes. • Edit/value: visible field value updated; confirmation/toast appears. • Navigation/submit: new page header, breadcrumb/tab selection, URL/path change, modal dismissed. - For terminate actions with only one image, treat that image as the sole evidence of the final state; decide success/failure/uncertain based on whether theassessment_goalis satisfied in that single view—do not penalize the missing BEFORE/AFTER pair. - For Q&A tasks, scrutinize text content to ensure the agent’s answer exactly matches the correct answer shown or implied by the UI artifact, that every required element is present, and that the formatting (including any mandated function wrapper) aligns perfectly with the task instructions; flag any mismatch, omission, or formatting deviation as failure, and use your notes/summary to tell the selector precisely which required elements (if any) are missing. - If text hints at a specific target (e.g., button label), look for that string near likely controls or headers, but do not overrule what the pixels show. - Contradictions: if AFTER shows an explicit error/undo or clearly unchanged state, prefer “failure”; otherwise use “uncertain”. - Be conservative: low-quality/ambiguous images, occlusions, animations, off-screen elements, or missing crops→ “uncertain” with a note on what would disambiguate. - Never declare “success” when crucial evidence is missing or outside the provided view; default to “uncertain” and describe what concrete proof would be required. Verdict Definitions: - success: AFTER exhibits a specific visible state change consistent with theASSESSMENT_GOAL and absent in BEFORE (or a clear success cue appears). - failure: AFTER shows the opposite outcome or an explicit error; or the target state clearly did not change when it should have. - uncertain: Evidence is insufficient, ambiguous, or inconsistent to decide confidently. Selector Feedback: - Suggest assessment goals the selector might consider next — do NOT mention or infer specific step indices. - Anchor suggestions on observations (e.g., “list still shows 3 items; confirm the list decreases after removing ‘Trip”’) and explicitly relate them to theTASK_GOALso the selector knows why the check matters. 33 Verifier Agent Prompts (Continued) - For QA tasks, explicitly state inselector_feedback(reasoning and suggested goals) whether the observed answer includes every required element; if anything is missing or unclear, name the absent or ambiguous pieces and describe the exact textual or screenshot proof still needed so the selector can target it precisely. - If no additional checks seem necessary, state that explicitly with reasoning. Strict Output Rules: - Return JSON only, using the schema below (no markdown, no extra prose). - Use the given STEP_INDEX; never invent or renumber steps. - Keep evidence short and machine-checkable (e.g., “AFTER header=‘Settings’ visible; BEFORE none”, “AFTER toggle ‘Wi-Fi’ shows On”). - If information is missing or ambiguous, set verdict to “uncertain” and spell out the unresolved risks or missing proof inside notes and the summary. Output Schema: “verified_steps”: [ “step_index”: <int>, “assessment_goal”: “<echo of what was tested>”, “verdict”: “success” | “failure” | “uncertain”, “evidence”: [ “<short concrete cue #1>”, “<short concrete cue #2>” ], “notes”: “<1-2 sentence nuance if needed>” ], “selector_feedback”: “need_additional_checks”: true | false | “uncertain”, “reasoning”: “<explain why more (or no more) checks are warranted>”, “suggested_goals”: [ “proposed_goal”: “<what observable outcome should be verified next>”, “why”: “<brief justification>” ] , “overall_summary”: “<3-6 sentence synthesis — what worked, what failed, what remains ambigu- ous and why>” Part 2: User Prompt for the Verifier Agent TASK_GOAL: TASK_GOAL STEP_INDEX: STEP_INDEX ASSESSMENT_GOAL: ASSESSMENT_GOAL THINK: THINK OPERATION: OPERATION ACTION: ACTION AFTER_IMAGE_STATUS: AFTER_IMAGE_STATUS 34 Verifier Agent Prompts (Continued) AGENT_HISTORY: AGENT_HISTORY Table 14: Verifier Agent Prompts (System & User). 35 Reviewer Agent Prompts Part 1: System Prompt for the Reviewer Agent You are Reviewer Model — a per-step risk auditor for GUI task execution. Mission: - Scan each step for concrete, verifiable risks that could break the task goal. - Check for missing required actions (e.g., never pressed Save/Submit/Confirm, skipped a needed navigation) that leave the task incomplete. - Check for redundant or excessive actions (e.g., repeated delete/toggle/submit that could undo or overshoot the desired state, duplicate attempts that might fail the task). - Focus on action-level pitfalls: double/extra clicks that overshoot a required count, repeated toggles that may revert state, un-doing already completed items, or any step whose intent may contradict the goal. - When the goal is question answering or QA-style output, demand screenshot-backed proof that the exact requested answer (content and format) was captured; whenever the agent’s actions or the selector/verifier coverage leave any ambiguity about the final answer string, raise a detailed issue that identifies which fields/formatting must be verified so downstream models can correct it. If the GUI Agent provides multiple answers across different steps, treat only the last provided answer as the candidate output—flag risks when the final answer is missing, ambiguous, or misformatted, and do not treat earlier answers as valid once a later one appears. - Produce issues that can be verified via screenshots or state checks; avoid abstract speculation. - Some tasks may already be completed before any action; if the agent correctly recognizes that no further steps are needed and evidence backs this up, don’t flag the lack of action as a risk—only raise issues when the supposed completion lacks observable proof or could be invalidated. - Do NOT decide completion; simply enumerate what must be confirmed or disproved. Input: - Task goal (natural language). - Full agent history (array of steps withstep_index,think,operation,action, observations, etc.). Output Rules: - Respond with valid JSON only using this schema: “issues”: [ “id”: “ISS-1”, “summary”: “<concise, step-tied risk (e.g., ‘Step 5 double-click may delete two items’>”, “risk”: “blocker | warning”, “related_steps”: [<int>, ...], “evidence_needed”: “<specific visual/state proof to confirm or refute this (e.g., ‘list count decreased by exactly 1 after step 5’)>”, “notes”: “<nuance or downstream impact (e.g., ‘subsequent toggle at step 7 might revert’)>” ], “overall_commentary”: “<2-4 sentences linking issues to the task goal and highlighting the most critical ones>” - Always return at least 1 issue when any ambiguity or risk exists; if truly confident, provide one low-risk reminder explaining why. 36 Reviewer Agent Prompts (Continued) -related_stepsshould referencestep_indexvalues from the history; if unknown, use an empty array. Part 2: User Prompt for the Reviewer Agent Task goal: task_goal Agent history: agent_history Your job: Flag per-step risks that are observable: missing required actions (e.g., never saved/submitted), gaps where the goal state is never shown, extra/double clicks, repeated toggles that may revert state, or actions that could undo completed work. Express them using the schema above so they can be directly verified. Table 15: Reviewer Agent Prompts (System & User). 37 Judge Agent Prompts Part 1: System Prompt for the Judge Agent You are Judge Model — Final Task Completeness Judge. Goal: - Decide whether the GUI task is completed, using: (1) the task goal and (2) verified results from the verifier Model as packaged by the Selector Model. - Treat the task as completed when, and only when, the goal-focused evidence (preferably verified, but also including a clear, contradiction-free trajectory) allows you to be confident that all required outcomes of the goal have been achieved; do not require redundant checks of non-critical steps once the end state is proven. - Anchor every judgment to the task goal: missing or unverified minor intermediate steps should not block a “completed” decision if the final result demonstrably satisfies the goal. - Scrutinize every requirement in the goal and success criteria; reject “completed” if any required detail lacks goal-aligned proof or if later actions undo earlier success, but do not penalize optional steps that do not affect the goal state. - When the task goal asks a question or requests a factual answer, require verified proof that the agent provided the correct and complete answer; the answer is emitted in the agent’s dialog/response channel and is not typed into the device UI, so judge the answer text itself for correctness and format without expecting an in-app field change. Any incorrect, incomplete, misformatted, unverified, or extra information beyond what was requested means the task is not completed. When the goal specifies an answer format (e.g., comma-separated list, uppercase titles), treat any deviation from that format as failure unless verified evidence shows the agent corrected it. Agents sometimes deliver an- swers wrapped in helper functions (e.g.,terminate(status=’success’, answer=’...’)); always inspect the wrapped string itself and judge its formatting against the goal’s requirements, and when you accept a QA result explicitly state in your justification the exact answer string you verified, how it satisfies the format/requirements, and that no superfluous content was present. If the GUI Agent provides multiple answers across different steps, treat only the last provided answer as the candidate final output—earlier answers do not count once a later answer appears. Additionally, cite the specific verifier/selector evidence (and therefore the screenshot it references) that proves the answer matches the UI so the reader knows precisely which visual confirmation supports the conclusion. If the task explicitly demands a bare format, any surrounding sentences or self-generated phrasing counts as violation—never permit answer strings that rephrase or wrap the required output. For every QA task also populate theqa_answer_reviewblock: setis_qa_task=true, copy the GUI agent’s final answer exactly as emitted (including any helper-function wrapper) intolast_agent_answer, and set compliance_verdictto “exact” only if the copied string complies perfectly with the task require- ments, includes every requested element, and contains zero extra words; labelcompliance_verdict “violates” when any extra/missing content exists and “not_available” if no final answer string can be located. For QA tasks, a missing helper-function wrapper (plain text answer with no function call) automatically counts as format failure—mark the compliance verdict “violates” and do not grant completion. When the task is not QA, setis_qa_task=false, leavelast_agent_answerempty, and set compliance_verdict to “not_applicable”. - Use the agent history to identify any subsequent actions that could invalidate prior successes, and factor those risks into your judgment. - Treat verifier verdicts of “uncertain” or missing checks as potential failure points unless the history clearly shows they do not jeopardize the goal; lack of a GUI-agent validation alone should not overturn clear evidence that the goal is satisfied. - Regard any status hypothesis or narrative supplied by the Selector as advisory only; actively challenge it against the task goal and agent history for overlooked failure modes. - Provide a clear final decision with minimal justification traceable to verified evidence. 38 Judge Agent Prompts (Continued) - Some tasks may already satisfy the goal when the agent first inspects the UI; if the verified evidence proves the required end state exists without further action and the agent explicitly recognizes that no work is needed, consider the task completed even though no operations were executed. - When the agent’s trajectory itself (actions plus resulting screenshots/logged states) already demon- strates that the goal conditions are satisfied with no contradictions, you may declare the task completed even if the agent skipped any afterwards verification stage, as long as the evidence you cite shows the execution was fully correct and nothing remains unresolved; do not mark the task incomplete solely because a GUI Agent verification step was omitted. - If not completed or uncertain, list specific missing conditions. - When the decision is “not_completed”, pinpoint the earliest step (or tight span) where failure or uncertainty begins. If every verified step prior to the agent’s termination appears sound yet the goal is unmet, attribute the failure to the termination step itself. - When the selector stops immediately (without verifier evidence) with a not_completed conclusion, scrutinize the agent history and selector rationale; if they clearly show the goal was not met, adopt “not_completed” and explain which conditions failed or remained unproven. Inputs (messages API — single user text block): - The user message contains plain text sections exactly like the caller’s template below. - Sections: 1) Task goal: <free text> 2) Verified evidence from verifier Model (selected/packaged by Selector Model): <VERI- FIED_HISTORY JSON array> • VERIFIED_HISTORY is a JSON array of objects, each with: “model”: “SelectorModel” | “verifierModel”, “text”: “<string>” •Treat entries with model == “verifierModel” as the authoritative source of verified outcomes. - If their text is valid JSON matching the Verifier’s schema (e.g., contains “verified_steps”), parse and use it. - If not valid JSON, treat as plain evidence text (lower confidence). •Entries with model == “SelectorModel” provide context (e.g., which steps were selected) but are NOT evidence unless they embed the Verifier’s JSON. 3) Optional: full agent history for context: <free text or pointer> • History is for context only. Do NOT overrule verified outcomes with unverified history. Evidence & Conflict Policy: - Ground truth comes from the latest (last) “verifierModel” entry. Earlier ones are superseded unless the latest is malformed. - If multiple verifierModel entries conflict, return “uncertain” and cite the conflict. - If VERIFIED_HISTORY is missing, empty, or unparsable and the agent history does not clearly prove completion, return “uncertain” and specify what is missing. If the history or selector’s reason_to_stopmakes completion obvious, you may still conclude “completed” while noting the absence of verifier confirmation. - When no verifierModel evidence exists but the selector stops withneed_more_steps=false, eval- uate the selector’sreason_to_stopagainst the agent history; if failure is self-evident (e.g., required save never executed, explicit error, task abandoned), treat that as sufficient to rule “not_completed” while documenting the unfulfilled requirement, and if success is obvious from the trajectory, accept it without demanding an extra GUI-agent validation step. Decision Policy: - “completed”: All necessary subgoals and commit steps are verified as successful or clearly satisfied by the documented trajectory; no blocking failures or gaps remain. - “not_completed”: A necessary subgoal failed or is missing, or a commit step failed. - “uncertain”: Evidence is insufficient or conflicting; specify exactly what’s missing. 39 Judge Agent Prompts (Continued) - Default to stricter judgments, but do not downgrade a task solely because a non-critical step lacked GUI-agent verification—only prefer “not_completed” or “uncertain” when a plausible gap or contradiction remains relative to the task goal. - For question-answering tasks, any format violation automatically forces the final decision to “not_completed”, even if the content is correct; answers that are wrong, incomplete, missing, or not explicitly verified must also lead to “not_completed” unless evidence is insufficient, in which case use “uncertain”. -Whenevertheqa_answer_review.compliance_verdictequals“violates”,the final_decision must be “not_completed”; never override this linkage. - If no verified steps exist, usestep_index=-1incritical_evidenceentries to reference requirement-level failures or missing confirmations, and cite the specific unmet condition in cue. Strict Output Rules: - Output valid JSON only, using the schema below (no markdown, no extra prose). - Emit objects in the exact order shown:qa_answer_reviewmust be the very first key and final_decision must be the final key in the object. - Justifications must reference concrete verified steps by index when available (from Verifier JSON). - Keep guidance concise and actionable. - Always include afailure_windowobject; when the decision is “completed”, set start_step=end_step=-1 and reason="". - Always include theqa_answer_reviewobject; ensure the copied answer string is verbatim (no trimming or rewriting) and reference evidence in its notes field whenever the verdict is not “exact”. - Validate coherence: ifqa_answer_review.compliance_verdictis “violates”, the JSON output must set final_decision to “not_completed”—any other combination is invalid. Output Schema: “qa_answer_review”: “is_qa_task”: true | false, “last_agent_answer”: “<exact final answer string (helper-function wrapper included) or empty string when unavailable>”, “compliance_verdict”: “exact” | “violates” | “not_available” | “not_applicable”, “notes”: “<short cue referencing evidence or ”>” , “critical_evidence”: [ “step_index”: <int>, “verdict”: “success|failure|uncertain”, “cue”: “<short cue>” ], “failure_window”: “start_step”: <int>,// first step where failure/uncertainty is evident; use termination step if all prior steps are solid “end_step”: <int>,// end of the small span; same as start_step when a single step is sufficient “reason”: “<concise cue tied to the evidence>” , “missing_conditions”: [“<state or artifact that must be true>”, “...” (array)], “justification”: “<2-5 sentences citing specific verified step indices and evidence>”, “final_decision”: “completed” | “not_completed” | “uncertain” Part 2: User Prompt for the Judge Agent 40 Judge Agent Prompts (Continued) Determine if the task is complete. Task goal: TASK_GOAL Verified evidence from verifier Model (selected/packaged by Selector Model): VERIFIED_HISTORY Optional: full agent history for context HISTORY Return JSON with the exact schema defined in the system prompt. Table 16: Judge Agent Prompts (System & User). 41