UI-AGILE: Boost GUI Agents with RL & Precise Grounding

UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding

Summary: arXiv:2507.22025v4 Announce Type: replace

Abstract

The emergence of Multimodal Large Language Models (MLLMs) has driven significant advances in Graphical User Interface (GUI) agent capabilities. Nevertheless, existing GUI agent training and inference techniques still suffer from a dilemma for reasoning designs, ineffective reward, and visual noise. To address these issues, we introduce UI-AGILE for enhancing GUI agents at both training and inference.

Key Features of UI-AGILE

UI-AGILE introduces several innovative enhancements aimed at improving both the training and inference processes for GUI agents:

Continuous Reward Function: This function incentivizes high-precision grounding, allowing for more accurate interactions with GUI elements.
Simple Thinking Reward: A novel reward system that balances the need for planning with the necessity for speed and grounding accuracy, facilitating smoother agent performance.
Cropping-Based Resampling Strategy: This technique is designed to mitigate the sparse reward problem, enhancing the learning process on complex tasks by focusing on smaller, more manageable segments of data.

Inference Enhancements

In addition to training improvements, UI-AGILE incorporates advanced strategies for inference:

Decomposed Grounding with Selection: This method significantly boosts grounding accuracy on high-resolution displays by breaking down images into smaller parts, allowing for more precise identification and interaction with GUI elements.

Performance Results

Experiments conducted using UI-AGILE have demonstrated remarkable improvements in grounding performance. The model achieved state-of-the-art results on two benchmarks: ScreenSpot-Pro and ScreenSpot-v2. Notably, the integration of both training and inference enhancements resulted in a 23% increase in grounding accuracy over the best baseline on ScreenSpot-Pro.

General Agent Capabilities

Besides focusing on grounding performance, UI-AGILE exhibits robust general agent capabilities, making it versatile for various applications beyond GUI interactions. Its design caters to a wide range of tasks, showcasing the potential for broader implementations in the field of AI.

Conclusion

The introduction of UI-AGILE marks a significant step forward in the development of GUI agents, addressing key challenges in training and inference processes. By leveraging continuous rewards, innovative resampling strategies, and refined grounding techniques, UI-AGILE stands out as a leader in advancing the capabilities of GUI agents.

Availability

For those interested in exploring UI-AGILE further, the code is available at GitHub Repository.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

UI-AGILE: Boost GUI Agents with RL & Precise Grounding

UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding

Abstract

Key Features of UI-AGILE

Inference Enhancements

Performance Results

General Agent Capabilities

Conclusion

Availability

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related