GUI-AIMA: Aligning Intrinsic Multimodal Attention with a Context Anchor for GUI Grounding
Summary: arXiv:2511.00810v3 Announce Type: replace-cross
Graphical user interface (GUI) grounding has emerged as a critical capability for computer-use agents, enabling them to map natural-language instructions to actionable regions on the screen. Traditional approaches, especially those leveraging Multimodal Large Language Models (MLLMs), typically treat GUI grounding as a text-based coordinate generation task. However, the challenge of generating precise coordinates directly from visual inputs often leads to significant data requirements, making it a less efficient solution.
In light of these challenges, a more intuitive approach is proposed, focusing on the identification of instruction-relevant visual patches prior to determining the exact click location within them. This method is driven by recent findings suggesting that general MLLMs possess an inherent grounding ability, embedded within their attention maps. To harness this capability, we introduce GUI-AIMA, a novel attention-based and coordinate-free supervised fine-tuning framework designed for efficient GUI grounding.
Key Features of GUI-AIMA
- Intrinsic Multimodal Attention Alignment: GUI-AIMA aligns the intrinsic multimodal attention of MLLMs with patch-wise grounding signals. These signals are adaptively computed for a variety of user instructions.
- Multi-Head Aggregation: The framework employs multi-head aggregation on simplified query-visual attention matrices, enhancing the model’s ability to accurately interpret user commands.
- Coordinate-Free Integration: The coordinate-free structure of GUI-AIMA allows for the seamless integration of a plug-and-play zoom-in stage, further improving its functionality.
GUI-AIMA-3B, a variant of the proposed framework, was trained using only 509,000 samples, which includes approximately 101,000 screenshots. This training strategy has led to remarkable data efficiency, highlighting that light training can effectively activate the native grounding capabilities of MLLMs.
Performance Metrics
The performance of GUI-AIMA-3B has been evaluated across several benchmarks, achieving state-of-the-art results among 3B models. The accuracy metrics are as follows:
- ScreenSpot-Pro: 61.5%
- ScreenSpot-v2: 92.1%
- OSWorld-G: 68.1%
- MMBench-GUI-L2: 79.1%
- UI-Vision: 60.0%
These results underscore the effectiveness of GUI-AIMA in translating natural language into precise GUI interactions, thereby enhancing the capabilities of computer-use agents.
Conclusion
In summary, GUI-AIMA represents a significant advancement in the field of GUI grounding. By leveraging intrinsic multimodal attention and a coordinate-free approach, this framework not only simplifies the training process but also enhances the overall performance of MLLMs in practical applications. For further insights and to access the project, visit the GUI-AIMA project page.
