GUI-AIMA: Efficient Multimodal Attention for GUI Grounding

Date:

GUI-AIMA: Aligning Intrinsic Multimodal Attention with a Context Anchor for GUI Grounding

Summary: arXiv:2511.00810v3 Announce Type: replace-cross

Graphical user interface (GUI) grounding has emerged as a critical capability for computer-use agents, enabling them to map natural-language instructions to actionable regions on the screen. Traditional approaches, especially those leveraging Multimodal Large Language Models (MLLMs), typically treat GUI grounding as a text-based coordinate generation task. However, the challenge of generating precise coordinates directly from visual inputs often leads to significant data requirements, making it a less efficient solution.

In light of these challenges, a more intuitive approach is proposed, focusing on the identification of instruction-relevant visual patches prior to determining the exact click location within them. This method is driven by recent findings suggesting that general MLLMs possess an inherent grounding ability, embedded within their attention maps. To harness this capability, we introduce GUI-AIMA, a novel attention-based and coordinate-free supervised fine-tuning framework designed for efficient GUI grounding.

Key Features of GUI-AIMA

  • Intrinsic Multimodal Attention Alignment: GUI-AIMA aligns the intrinsic multimodal attention of MLLMs with patch-wise grounding signals. These signals are adaptively computed for a variety of user instructions.
  • Multi-Head Aggregation: The framework employs multi-head aggregation on simplified query-visual attention matrices, enhancing the model’s ability to accurately interpret user commands.
  • Coordinate-Free Integration: The coordinate-free structure of GUI-AIMA allows for the seamless integration of a plug-and-play zoom-in stage, further improving its functionality.

GUI-AIMA-3B, a variant of the proposed framework, was trained using only 509,000 samples, which includes approximately 101,000 screenshots. This training strategy has led to remarkable data efficiency, highlighting that light training can effectively activate the native grounding capabilities of MLLMs.

Performance Metrics

The performance of GUI-AIMA-3B has been evaluated across several benchmarks, achieving state-of-the-art results among 3B models. The accuracy metrics are as follows:

  • ScreenSpot-Pro: 61.5%
  • ScreenSpot-v2: 92.1%
  • OSWorld-G: 68.1%
  • MMBench-GUI-L2: 79.1%
  • UI-Vision: 60.0%

These results underscore the effectiveness of GUI-AIMA in translating natural language into precise GUI interactions, thereby enhancing the capabilities of computer-use agents.

Conclusion

In summary, GUI-AIMA represents a significant advancement in the field of GUI grounding. By leveraging intrinsic multimodal attention and a coordinate-free approach, this framework not only simplifies the training process but also enhances the overall performance of MLLMs in practical applications. For further insights and to access the project, visit the GUI-AIMA project page.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.