InfiGUIAgent is a Multimodal Large Language Model (MLLM)-based GUI agent designed for robust and efficient task automation on computing devices. Trained with a two-stage supervised fine-tuning pipeline, InfiGUIAgent excels in understanding and interacting with GUIs. The first stage focuses on foundational skills like GUI comprehension and instruction grounding, while the second stage cultivates advanced reasoning capabilities, including hierarchical reasoning and expectation-reflection, using synthesized data. This empowers InfiGUIAgent to perform complex multi-step GUI interactions, overcoming limitations of existing agents that struggle with multi-step reasoning and reliance on textual annotations.
InfiGUIAgent is trained in two stages. Stage 1 cultivates fundamental abilities using diverse datasets covering GUI understanding (element recognition and layout comprehension), question answering, instruction grounding, general knowledge, and tool usage. Stage 2 introduces native advanced reasoning, employed during both training and inference. This stage follows a cyclical process at each step, consisting of Reflection, Hierarchical Reasoning (strategic and tactical layers), Action, and Expectation. Each step receives the overall task, the history of previous screenshots and reasoning, and the current environment as input. Reflection assesses the previous action’s outcome against its expectation, while Expectation predicts the outcome of the current action for subsequent reflection.
We gathered data covering several GUI tasks from multiple sources to ensure a comprehensive capabilities improvement.The datasets can be categorized into five parts:
GUI Agents are trained to master advanced reasoning skills: (1) Hierarchical reasoning, which involves task decomposition into strategic and tactical layers for efficient execution, and (2) Expectation-reflection reasoning, enabling self-correction and consistent decision-making through iterative reflection and learning from past actions. These skills are integrated into training datasets for native reasoning. The interaction follows a standard protocol using function calls and responses:
Below provides the results of different models across three platforms (Mobile, Desk- top and Web) and two element types (Text and Icon) on ScreenSpot:
Below compares the success rates of InfiGUIAgent with open-source models on AndroidWorld:
We demonstrate the fundamental abilities trained in Stage 1 through three cases:
Below we provide two representative cases to demonstrate the reasoning and interaction process of InfiGUIAgent:
Please let us know if you find out a mistake or are interested in contributing by e-mail: liuyuhang@zju.edu.cn.
If you find our work valuable for your research or applications, we would greatly appreciate a star ⭐ and a citation using the BibTeX entry provided below.
@article{liu2025infiguiagent,
title={InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection},
author={Liu, Yuhang and Li, Pengxiang and Wei, Zishu and Xie, Congkai and Hu, Xueyu and Xu, Xinchen and Zhang, Shengyu and Han, Xiaotian and Yang, Hongxia and Wu, Fei},
journal={arXiv preprint arXiv:2501.04575},
year={2025}
}