Paper deep dive

AdaZoom-GUI: Adaptive Zoom-based GUI Grounding with Instruction Refinement

Siqi Pei, Liang Tang, Tiaonan Duan, Long Chen, Shuxian Li, Kaer Huang, Yanzhe Jing, Yiqiang Yan, Bo Zhang, Chenghao Jiang, Borui Zhang, Jiwen Lu

Year: 2026Venue: arXiv preprintArea: cs.CVType: PreprintEmbeddings: 28

Abstract

Abstract:GUI grounding is a critical capability for vision-language models (VLMs) that enables automated interaction with graphical user interfaces by locating target elements from natural language instructions. However, grounding on GUI screenshots remains challenging due to high-resolution images, small UI elements, and ambiguous user instructions. In this work, we propose AdaZoom-GUI, an adaptive zoom-based GUI grounding framework that improves both localization accuracy and instruction understanding. Our approach introduces an instruction refinement module that rewrites natural language commands into explicit and detailed descriptions, allowing the grounding model to focus on precise element localization. In addition, we design a conditional zoom-in strategy that selectively performs a second-stage inference on predicted small elements, improving localization accuracy while avoiding unnecessary computation and context loss on simpler cases. To support this framework, we construct a high-quality GUI grounding dataset and train the grounding model using Group Relative Policy Optimization (GRPO), enabling the model to predict both click coordinates and element bounding boxes. Experiments on public benchmarks demonstrate that our method achieves state-of-the-art performance among models with comparable or even larger parameter sizes, highlighting its effectiveness for high-resolution GUI understanding and practical GUI agent deployment.

PDF

Open source PDF →Open local PDF →

Intelligence

Status: not_run | Model: - | Prompt: - | Confidence: 0%

Entities (0)

No extracted entities yet.

Relation Signals (0)

No relation signals yet.

Cypher Suggestions (0)

No Cypher suggestions yet.

Full Text

27,625 characters extracted from source content.

Expand or collapse full text

AdaZoom-GUI: Adaptive Zoom-based GUI Grounding with Instruction Refinement Siqi Pei Lenovo Research Beijing, China peisq2@lenovo.com Liang Tang Lenovo Research Beijing, China tangliang5@lenovo.com Tiaonan Duan Lenovo Research Beijing, China duantn1@lenovo.com Long Chen Lenovo Research Beijing, China chenlong5@lenovo.com Shuxian Li Lenovo Research Beijing, China lisx14@lenovo.com Kaer Huang Lenovo Research Beijing, China huangke1@lenovo.com Yanzhe Jing Lenovo Research Beijing, China jingyz1@lenovo.com Yiqiang Yan Lenovo Research Beijing, China yanyq@lenovo.com Bo Zhang Tsinghua University Beijing, China bo-zhang23@mails.tsinghua.edu.cn Chenghao Jiang Tsinghua University Beijing, China chenghao.jiang2022@outlook.com Borui Zhang Tsinghua University Beijing, China zhang-br21@mails.tsinghua.edu.cn Jiwen Lu Tsinghua University Beijing, China lujiwen@tsinghua.edu.cn Abstract GUI grounding is a critical capability for vision- language models (VLMs) that enables automated inter- action with graphical user interfaces by locating target elements from natural language instructions.However, grounding on GUI screenshots remains challenging due to high-resolution images, small UI elements, and ambigu- ous user instructions. In this work, we propose AdaZoom- GUI, an adaptive zoom-based GUI grounding framework that improves both localization accuracy and instruction understanding. Our approach introduces an instruction refinement module that rewrites natural language com- mands into explicit and detailed descriptions, allowing the grounding model to focus on precise element localiza- tion. In addition, we design a conditional zoom-in strat- egy that selectively performs a second-stage inference on predicted small elements, improving localization accuracy while avoiding unnecessary computation and context loss on simpler cases. To support this framework, we con- struct a high-quality GUI grounding dataset and train the grounding model using Group Relative Policy Optimization (GRPO), enabling the model to predict both click coordi- nates and element bounding boxes. Experiments on public benchmarks demonstrate that our method achieves state- of-the-art performance among models with comparable or even larger parameter sizes, highlighting its effectiveness for high-resolution GUI understanding and practical GUI agent deployment. 1. Introduction Graphical User Interfaces (GUIs) are designed to en- hance user experience in human-device interaction, but they often overlook the need for automated processes. In many scenarios, GUIs provide the only interface through which users can perceive information and interact with devices. This makes it difficult for language-model-based agents that rely on structured representations such as HTML or acces- sibility trees (a11y tree) to interact with such systems effec- tively [9]. To address this limitation, extensive research has explored GUI agents that aim to automate interactions with graphical interfaces based on natural language instructions [18]. GUI grounding refers to the capability of vision- language models (VLMs) to understand natural language arXiv:2603.17441v1 [cs.CV] 18 Mar 2026 instructions and locate the corresponding target elements within GUI screenshots. As a fundamental capability for GUI agents, GUI grounding has the potential to transform the way humans interact with GUI-based devices. By con- verting vague instructions into precise coordinates, it en- ables automated execution of tasks that are otherwise ac- cessible only through GUI interactions, where element lo- cations and shapes may vary across interfaces. With the rapid development of vision-language models, general-purpose VLMs [2, 5] have demonstrated promising abilities on GUI grounding tasks. Many existing approaches [3, 17, 22] build upon these models and further train them using supervised fine-tuning (SFT) and/or reinforcement learning. Although these methods achieve strong perfor- mance in certain scenarios, they still struggle with complex cases, particularly when dealing with high-resolution GUIs or instructions that require stronger semantic understand- ing. Compared with images from other domains, GUI screen- shots typically have much higher resolutions, while UI ele- ments such as icons and texts occupy relatively small re- gions. This leads to significant information loss during downsampling, making it difficult for grounding models to accurately localize small elements. To address this is- sue, some prior works [13, 25] adopt a two-stage strat- egy that zooms in on the screenshot around a predicted point using a fixed ratio. However, this fixed two-stage design introduces unnecessary computation for simple or low-resolution screenshots, and may even degrade perfor- mance due to the loss of contextual information outside the zoomed-in region. To alleviate this problem, we propose a conditional zoom-in strategy that selectively applies zoom-in opera- tions only when necessary. Specifically, we construct a high-quality GUI grounding dataset and train a model using Group Relative Policy Optimization (GRPO) [16] to pre- dict both the location and size of target elements — the two pieces of information typically provided in grounding an- notations. The zoom-in strategy is triggered only when the predicted target element is relatively small. In this way, the model can perform efficient single-stage inference for sim- ple cases while applying more accurate two-stage inference for complex scenarios. Another limitation of existing datasets [6, 8, 22] is that their natural language instructions are usually simple and straightforward, as high-quality instruction annotations are expensive to obtain. In real-world applications, however, user instructions are often complex or ambiguous, requir- ing stronger semantic understanding. We argue that instruc- tion comprehension and GUI grounding are two distinct ca- pabilities. Training a model solely on grounding datasets with simple instructions may improve localization ability but does not necessarily enhance instruction understanding. To address this issue, we introduce an instruction re- finement model that rewrites the original instruction into a clearer and more detailed description of the target element before passing it to the grounding model. This design al- lows the grounding model to focus on answering “which exact point to click” rather than interpreting “which element should be clicked”. The contributions of this work are summarized as fol- lows: • We propose AdaZoom-GUI, a novel GUI grounding framework that integrates an instruction refinement mod- ule with a conditional zoom-in strategy. This design enables the model to apply efficient single-stage infer- ence for simple cases and accurate two-stage inference for complex scenarios. • We construct a high-quality GUI grounding dataset and train the grounding model using GRPO, enabling the model to predict both the location and size of target el- ements, which further supports the conditional zoom-in strategy. • We evaluate our method on public benchmarks and demonstrate that it achieves state-of-the-art performance among models with comparable or even larger parameter sizes. 2. Method The overall grounding pipeline of AdaZoom-GUI is il- lustrated in Figure 1. It consists of two components: an instruction refinement model and a grounding model. The details of each component are described in the following sections. 2.1. Instruction Refinement The instruction refinement model is designed to rewrite the original natural language instruction into a more explicit and detailed command that can be more easily understood by the grounding model. This is particularly important for complex instructions that may contain indirect references or ambiguous expressions. The refinement model takes the original instruction and the GUI screenshot as input and generates a refined instruc- tion that explicitly describes the target element and the de- sired action. Specifically, the model refines the instruction along two dimensions: 1. Instruction interpretation. The model first interprets the instruction and identifies the exact target element to be grounded. For example, if the instruction is "launch a new file explorer", the refinement model may interpret it as "click on the ’New Window’ option", converting a high-level instruction into a concrete GUI element. 2. Instruction enrichment. The model further enriches the instruction with additional visual or spatial details. For example, if the instruction is "click on the download launch a new file explorer Click on the "New Window" option in the context menu located near the top-left corner of the screen, adjacent to the file manager folder icon in the left dock. Original Instruction: Refined Instruction: Refinement Model pyautogui.click(42,76), <|box_start|>(21,67),(98,84)<|box_end|> Prediction: (42,76) Output: Large bbox Small bbox (43,76) Output: Zoom-in pyautogui.click(86,152), <|box_start|>(42,135),(199,168)<|box_end|> Prediction: Instruction Refinement Grounding Step 1 Grounding Step 2 (Zoom-in) Figure 1. Overall grounding inference pipeline. Given a GUI screenshot and a natural language instruction, a refinement model first rewrites the instruction into a direct and detailed command. The grounding model then takes the rewritten instruction and the screenshot as input and outputs both the click-point and the bounding box of the target element. If the predicted bounding box is larger than a certain threshold, the click-point is directly used as the final output. Otherwise, the screenshot is zoomed in around the click-point and fed into the grounding model again for a second round of inference. icon", the refinement model may rewrite it as "click on the download icon (a downward-pointing arrow) located in the top-right corner of the Firefox browser toolbar, just to the right of the address bar". These additional descriptions, such as visual appearance or approximate location, help the grounding model identify the target el- ement more accurately. To achieve this, we employ a pre-trained vision-language model with the following prompt template: Refinement prompt System prompt: You are a helpful GUI assistant. User prompt: You are given a task description and a screenshot of a GUI. The task can be completed with only one click. You need to find out the target to click, and then refine the task description to let user easily locate the target on the screen. Possible refinements include adding location infor- mation, describing visual features (color, size, text, icon shape, ...), clarifying ambiguous terms, etc. Only reply with the refined description.Do not add explanations. Task: instruction 2.2. Conditional Zoom-in Grounding The grounding model is responsible for locating the tar- get element on the GUI based on the refined instruction. It takes the refined instruction and the GUI screenshot as in- put and outputs both the click-point and the bounding box of the target element. The click-point represents the exact coordinate where the user should click, while the bounding box indicates the spatial extent of the target element. The prompt template for the grounding model is as fol- lows: Grounding prompt System prompt: You are a GUI agent. You are given a task and a screenshot of the screen. You need to perform pyau- togui click/moveTo action to complete the task, and then provide the bouding box of the target object. The answer format is `pyautogui.click(x=?, y=?), <|box_start|>(x1,y1),(x2,y2)<|box_end|>`.If the task is infeasible (e.g., the task is already completed, the target does not exist in the image, or the instruc- tion is unrelated to the screenshot), output a null action exactly as follows: `pyautogui.click(x=0, y=0), <|box_start|>(0,0),(0,0)<|box_end|>`. User prompt: Please complete the following tasks by clicking us- ing `pyautogui.click` and returning the bounding box: Task: refined_instruction 2.2.1. Dataset Construction To train the grounding model, we construct a dataset con- sisting of GUI screenshots paired with natural language in- structions and their corresponding bounding boxes. We col- lect a diverse set of screenshots from various applications and annotate them with instructions describing the intended actions. Each instruction is associated with a bounding box indicating the location of the target element. To further increase instruction diversity and improve model robustness, we use a large language model (LLM) to augment the instructions by generating variations with or without location descriptions, intention cues, or other con- textual information. In addition, the screenshots are aug- mented with padding and resizing so that the model can handle different screen resolutions. 2.2.2. Training Details We train the grounding model using Group Relative Pol- icy Optimization (GRPO) [16]. The reward function con- sists of two components: a point reward R point and a bounding box rewardR bbox : R = λR point + (1− λ)R bbox (0≤ λ≤ 1)(1) The point reward R point includes a format reward R f,point and a point-in-box reward R point−in−box . The format reward verifies whether the output format of the click-point is correct, ensuring that the prediction follows the required format `pyautogui.click(x=?, y=?), ...` and that the coordinates are valid. The point-in-box reward checks whether the predicted click-point lies within the ground-truth bounding box: R point−in−box = ( 1, if x 1 ≤ ˆx≤ x 2 and y 1 ≤ ˆy ≤ y 2 0, otherwise (2) where (ˆx, ˆy) denotes the predicted click-point, and (x 1 ,y 1 ) and (x 2 ,y 2 ) are the coordinates of the ground-truth bounding box. The bounding box reward R bbox similarly con- sists of a format reward R f,bbox and an IoU reward R IoU .The format reward checks whether the bound- ing box prediction follows the required format `..., <|box_start|>(x1,y1),(x2,y2)<|box_end|>`. The IoU reward measures the overlap between the predicted bounding box and the ground-truth bounding box: R IoU = Area of Intersection Area of Union (3) 2.2.3. Conditional Zoom-In Strategy During inference, we employ a conditional zoom-in strat- egy to improve localization accuracy for small target ele- ments while avoiding unnecessary computation for large el- ements. The decision of whether to apply zoom-in is based on the size of the predicted bounding box ( ˆx 1 , ˆy 1 ), ( ˆx 2 , ˆy 2 ). If the width or height of the predicted bounding box is smaller than a predefined threshold, and thus the other side of the bounding box is smaller than another larger thresh- old, the element is considered small and may be difficult to localize accurately in the original screenshot. In this case, zoom-in is applied while ensuring that sufficient surround- ing context is preserved. The zoom-in condition is formu- lated as: ( ˆw ≤ α∧ ˆ h≤ β)∨ ( ˆ h≤ α∧ ˆw ≤ β) (α < β)(4) where ˆw = ˆx 2 − ˆx 1 and ˆ h = ˆy 2 − ˆy 1 denote the predicted bounding box width and height, respectively. α and β are the smaller and larger thresholds. If the condition is satisfied, the screenshot is zoomed in around the predicted click-point with a predefined ratio and resized back to the original input resolution. The zoomed- in image is then fed into the grounding model for a sec- ond round of inference. By providing a closer view of the relevant region, this strategy improves the model’s ability to localize small UI elements such as icons or text, while maintaining efficiency for simpler cases. 3. Experiments We conduct extensive experiments to evaluate the effec- tiveness of the proposed AdaZoom-GUI. Our approach is compared with several state-of-the-art methods on bench- mark datasets to demonstrate its advantages in grounding accuracy, particularly for high-resolution screenshots and complex GUI tasks. 3.1. Experimental Setup Our grounding model is trained based on Qwen3-VL- 4B-Instruct [2] using GRPO implemented in the ms-swift framework [24]. The instruction refinement module uses Qwen3.5-397B-A17B [14]. 3.2. Main Results We evaluate our AdaZoom-GUI on the ScreenSpot- Pro [10] benchmark, which contains high-resolution GUI Table 1. Performance Comparison on ScreenSpot-Pro benchmark. Model DevelopmentCreativeCADScientificOfficeOS Avg. TextIconTextIconTextIconTextIconTextIconTextIcon General Models GPT-4o [7]2.00.01.30.01.00.02.10.01.10.00.00.00.8 Claude 3.7 Sonnet [1]------------27.7 Operater [12]50.019.351.523.116.814.158.324.560.528.334.630.336.6 Gemini-3-Pro [5]------------72.7 Seed1.8 [15]------------73.1 Qwen3-VL-4B-Instruct [2]------------59.5 Qwen3-VL-32B-Instruct [2]------------57.9 Qwen3-VL-235B-A22B-Instruct [2]------------62.0 Qwen3.5-397B-A17B [14]------------65.6 GUI Models InfiGUI-R1-3B [11]51.312.444.97.033.014.158.320.065.528.343.912.435.7 GUI-Actor-7B [20]59.115.959.616.147.79.470.125.569.541.555.119.144.6 GUI-G 2 -7B [17]68.817.257.115.455.812.577.124.574.032.757.921.347.5 OpenCUA-7B [19]------------50.0 GTA1-7B [22]66.920.762.618.253.317.276.431.882.550.948.625.950.1 GUI-Owl-7B [23]76.631.059.627.364.521.979.137.377.439.659.833.754.9 Holo2-4B [4]------------57.2 MAI-UI-8B [25]83.852.476.333.672.635.979.937.388.760.476.649.465.8 MAI-UI-8B (Zoom-in) [25]78.658.678.846.980.743.886.149.188.181.176.651.770.9 Ours AdaZoom-GUI-4B81.240.074.223.861.421.987.539.188.720.983.237.161.6 + Conditional Zoom-In87.747.683.338.577.732.888.949.194.466.080.453.970.6 + Qwen3.5-397B-A17B [14] Refinement91.657.988.949.783.239.191.051.897.281.189.760.776.8 screenshots from multiple professional domains. The re- sults are summarized in Table 1. Our base model achieves an average score of 70.6 when equipped with the conditional zoom-in strategy.When combined with the instruction refinement module based on Qwen3.5-397B-A17B, the performance further improves to 76.8. This result surpasses several state-of-the-art GUI grounding models with comparable parameter sizes and is competitive with much larger models. These results demon- strate that the proposed framework effectively improves grounding performance on complex and high-resolution GUI scenarios. The benefit of the conditional zoom-in strategy becomes more evident on the ScreenSpot-v2 benchmark [21], where screenshots typically have much lower resolutions than those in ScreenSpot-Pro. As shown in Table 2, our base model achieves an average score of 0.943. However, when an unconditioned zoom-in strategy is applied, the perfor- mance drops to 0.916. This degradation likely occurs be- cause the tasks in ScreenSpot-v2 are relatively simple and can be accurately grounded using the original screenshots. In such cases, zooming into already low-resolution images introduces additional information loss. In contrast, the proposed conditional zoom-in mech- anism selectively applies zoom-in only when necessary, avoiding unnecessary cropping and preserving global con- Table 2. Performance Comparison on ScreenSpot-v2 benchmark. ModelAvg. Score Ours-4B0.943 + Unconditioned Zoom-In0.916 + Conditional Zoom-In0.945 text. As a result, it achieves a higher average score of 0.945. This observation highlights the importance of adap- tive zoom-in decisions, which effectively balance the trade- off between contextual information and fine-grained visual details. We also evaluate the contribution of the instruction re- finement module independently. Using only the Qwen3.5- 397B-A17B refinement model, our system achieves an av- erage score of 69.7, which already exceeds the performance of the original Qwen3.5-397B-A17B model (65.6) on the same benchmark. When combined with the conditional zoom-in mechanism, the performance further increases to 76.8. Notably, this improvement is achieved by introducing only an additional 4B grounding model alongside the 397B refinement model. This result suggests that the proposed framework can effectively leverage the reasoning capabil- ity of large models while delegating precise localization to a smaller specialized grounding model. Consequently, our approach provides a cost-efficient way to enhance the GUI grounding capability of existing large models without re- quiring extensive fine-tuning or additional large-scale data collection. 4. Conclusion In this work, we present AdaZoom-GUI, a novel frame- work for GUI grounding that addresses the challenges of high-resolution screenshots, small UI elements, and am- biguous user instructions. Our approach decomposes the task into two complementary components: an instruction refinement module that converts natural language tasks into explicit, visually grounded descriptions, and a grounding model that predicts both click coordinates and bounding boxes of target elements.To further improve localiza- tion accuracy while maintaining efficiency, we introduced a conditional zoom-in strategy that selectively performs a second-stage inference when the predicted element size in- dicates that a closer inspection is necessary. To support this framework, we constructed a high- quality GUI grounding dataset and trained the grounding model using Group Relative Policy Optimization (GRPO), enabling the model to jointly optimize click-point ac- curacy and bounding box prediction.Extensive exper- iments on public benchmarks demonstrate that the pro- posed method significantly improves grounding perfor- mance compared with existing approaches. In particular, our method achieves competitive or superior results to mod- els with comparable or even larger parameter sizes, while maintaining an efficient inference pipeline. Overall, our results highlight the importance of separat- ing instruction comprehension from grounding and adapt- ing inference strategies based on element scale. These find- ings suggest a practical and scalable direction for improving GUI agents and enabling more reliable interaction with real- world graphical interfaces. Future work may explore ex- tending this framework to multi-step GUI tasks and broader interactive agent settings. References [1] Anthropic. Claude 3.7 sonnet, 2025. 5 [2] Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhao- hai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Jun- yang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan Liu, Dunjie Lu, Ruilin Luo, Chenxu Lv, Rui Men, Lingchen Meng, Xuancheng Ren, Xingzhang Ren, Sibo Song, Yuchong Sun, Jun Tang, Jianhong Tu, Jian- qiang Wan, Peng Wang, Pengfei Wang, Qiuyue Wang, Yux- uan Wang, Tianbao Xie, Yiheng Xu, Haiyang Xu, Jin Xu, Zhibo Yang, Mingkun Yang, Jianxin Yang, An Yang, Bowen Yu, Fei Zhang, Hang Zhang, Xi Zhang, Bo Zheng, Humen Zhong, Jingren Zhou, Fan Zhou, Jing Zhou, Yuanzhi Zhu, and Ke Zhu. Qwen3-vl technical report. arXiv preprint arXiv:2511.21631, 2025. 2, 4, 5 [3] Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Li YanTao, Jianbing Zhang, and Zhiyong Wu. Seeclick: Har- nessing gui grounding for advanced visual gui agents. In Proceedings of the 62nd Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Papers), pages 9313–9332, 2024. 2 [4] H Company. Holo2: Foundational models for navigation and computer use agents, 2025. 5 [5] Google DeepMind. Gemini 3 pro - model card, 2025. 2, 5 [6] Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. Navigating the digital world as humans do: Universal visual grounding for gui agents. arXiv preprint arXiv:2410.05243, 2024. 2 [7] Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 5 [8] Raghav Kapoor, Yash Parag Butala, Melisa Russak, Jing Yu Koh, Kiran Kamble, Waseem AlShikh, and Ruslan Salakhut- dinov. Omniact: A dataset and benchmark for enabling mul- timodal generalist autonomous agents for desktop and web. In European Conference on Computer Vision, pages 161– 178. Springer, 2024. 2 [9] Geunwoo Kim, Pierre Baldi, and Stephen McAleer. Lan- guage models can solve computer tasks. Advances in Neural Information Processing Systems, 36:39648–39677, 2023. 1 [10] Kaixin Li, Meng Ziyang, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. Screenspot-pro: GUI grounding for professional high- resolution computer use. In Workshop on Reasoning and Planning for Large Language Models, 2025. 4 [11] Yuhang Liu, Pengxiang Li, Congkai Xie, Xavier Hu, Xi- aotian Han, Shengyu Zhang, Hongxia Yang, and Fei Wu. Infigui-r1:Advancing multimodal gui agents from re- active actors to deliberative reasoners.arXiv preprint arXiv:2504.14239, 2025. 5 [12] OpenAI. Operater, 2025. 5 [13] Joonhyung Park, Peng Tang, Sagnik Das, Srikar Appalaraju, Kunwar Yashraj Singh, R Manmatha, and Shabnam Ghadar. R-vlm: Region-aware vision language model for precise gui grounding. In Findings of the Association for Computational Linguistics: ACL 2025, pages 9669–9685, 2025. 2 [14] Qwen Team. Qwen3.5: Towards native multimodal agents, 2026. 4, 5 [15] Bytedance Seed. Seed1.8 model card: Towards generalized real-world agency. Technical report, 2025. 5 [16] Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of math- ematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024. 2, 4 [17] Fei Tang, Zhangxuan Gu, Zhengxi Lu, Xuyang Liu, Shuheng Shen, Changhua Meng, Wen Wang, Wenqi Zhang, Yongliang Shen, Weiming Lu, et al.Gui-g 2 : Gaus- sian reward modeling for gui grounding.arXiv preprint arXiv:2507.15846, 2025. 2, 5 [18] Shuai Wang, Weiwen Liu, Jingxuan Chen, Yuqi Zhou, Weinan Gan, Xingshan Zeng, Yuhan Che, Shuai Yu, Xinlong Hao, Kun Shao, et al. Gui agents with foundation models: A comprehensive survey. arXiv preprint arXiv:2411.04890, 2024. 1 [19] Xinyuan Wang, Bowen Wang, Dunjie Lu, Junlin Yang, Tianbao Xie, Junli Wang, Jiaqi Deng, Xiaole Guo, Yiheng Xu, Chen Henry Wu, et al. Opencua: Open foundations for computer-use agents. arXiv preprint arXiv:2508.09123, 2025. 5 [20] Qianhui Wu, Kanzhi Cheng, Rui Yang, Chaoyun Zhang, Jianwei Yang, Huiqiang Jiang, Jian Mu, Baolin Peng, Bo Qiao, Reuben Tan, et al. Gui-actor: Coordinate-free visual grounding for gui agents. arXiv preprint arXiv:2506.03143, 2025. 5 [21] Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, et al.Os-atlas: A foundation action model for generalist gui agents.arXiv preprint arXiv:2410.23218, 2024. 5 [22] Yan Yang, Dongxu Li, Yutong Dai, Yuhao Yang, Ziyang Luo, Zirui Zhao, Zhiyuan Hu, Junzhe Huang, Amrita Saha, Zeyuan Chen, et al. Gta1: Gui test-time scaling agent. arXiv preprint arXiv:2507.05791, 2025. 2, 5 [23] Jiabo Ye, Xi Zhang, Haiyang Xu, Haowei Liu, Junyang Wang, Zhaoqing Zhu, Ziwei Zheng, Feiyu Gao, Junjie Cao, Zhengxi Lu, et al. Mobile-agent-v3: Fundamental agents for gui automation. arXiv preprint arXiv:2508.15144, 2025. 5 [24] Yuze Zhao, Jintao Huang, Jinghan Hu, Xingjun Wang, Yun- lin Mao, Daoze Zhang, Zeyinzi Jiang, Zhikai Wu, Baole Ai, Ang Wang, Wenmeng Zhou, and Yingda Chen. Swift:a scal- able lightweight infrastructure for fine-tuning, 2024. 4 [25] Hanzhang Zhou, Xu Zhang, Panrong Tong, Jianan Zhang, Liangyu Chen, Quyu Kong, Chenglin Cai, Chen Liu, Yue Wang, Jingren Zhou, et al.Mai-ui technical report: Real-world centric foundation gui agents. arXiv preprint arXiv:2512.22047, 2025. 2, 5