UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding
Summary: arXiv:2507.22025v4 Announce Type: replace
Abstract
The emergence of Multimodal Large Language Models (MLLMs) has driven significant advances in Graphical User Interface (GUI) agent capabilities. Nevertheless, existing GUI agent training and inference techniques still suffer from a dilemma for reasoning designs, ineffective reward, and visual noise. To address these issues, we introduce UI-AGILE for enhancing GUI agents at both training and inference.
Key Features of UI-AGILE
UI-AGILE introduces several innovative enhancements aimed at improving both the training and inference processes for GUI agents:
- Continuous Reward Function: This function incentivizes high-precision grounding, allowing for more accurate interactions with GUI elements.
- Simple Thinking Reward: A novel reward system that balances the need for planning with the necessity for speed and grounding accuracy, facilitating smoother agent performance.
- Cropping-Based Resampling Strategy: This technique is designed to mitigate the sparse reward problem, enhancing the learning process on complex tasks by focusing on smaller, more manageable segments of data.
Inference Enhancements
In addition to training improvements, UI-AGILE incorporates advanced strategies for inference:
- Decomposed Grounding with Selection: This method significantly boosts grounding accuracy on high-resolution displays by breaking down images into smaller parts, allowing for more precise identification and interaction with GUI elements.
Performance Results
Experiments conducted using UI-AGILE have demonstrated remarkable improvements in grounding performance. The model achieved state-of-the-art results on two benchmarks: ScreenSpot-Pro and ScreenSpot-v2. Notably, the integration of both training and inference enhancements resulted in a 23% increase in grounding accuracy over the best baseline on ScreenSpot-Pro.
General Agent Capabilities
Besides focusing on grounding performance, UI-AGILE exhibits robust general agent capabilities, making it versatile for various applications beyond GUI interactions. Its design caters to a wide range of tasks, showcasing the potential for broader implementations in the field of AI.
Conclusion
The introduction of UI-AGILE marks a significant step forward in the development of GUI agents, addressing key challenges in training and inference processes. By leveraging continuous rewards, innovative resampling strategies, and refined grounding techniques, UI-AGILE stands out as a leader in advancing the capabilities of GUI agents.
Availability
For those interested in exploring UI-AGILE further, the code is available at GitHub Repository.
