GUIDE: Resolving Domain Bias in GUI Agents through Real-Time Web Video Retrieval and Plug-and-Play Annotation
Summary: arXiv:2603.26266v1 Announce Type: new
Abstract: Large vision-language models have endowed GUI agents with strong general capabilities for interface understanding and interaction. However, due to insufficient exposure to domain-specific software operation data during training, these agents exhibit significant domain bias – they lack familiarity with the specific operation workflows (planning) and UI element layouts (grounding) of particular applications, limiting their real-world task performance.
In this paper, we present GUIDE (GUI Unbiasing via Instructional-Video Driven Expertise), a training-free, plug-and-play framework that resolves GUI agent domain bias by autonomously acquiring domain-specific expertise from web tutorial videos through a retrieval-augmented automated annotation pipeline. GUIDE introduces two key innovations:
- Subtitle-driven Video-RAG Pipeline: This component unlocks video semantics through subtitle analysis, enabling a progressive three-stage retrieval process that includes:
- Domain classification
- Topic extraction
- Relevance matching
- This process effectively identifies task-relevant tutorial videos for GUI agents.
- Automated Annotation Pipeline: Built on an inverse dynamics paradigm, this pipeline feeds consecutive keyframes enhanced with UI element detection into vision-language models (VLMs). This enables the inference of the required planning and grounding knowledge, which is then injected into the agent’s corresponding modules.
These innovations address both manifestations of domain bias, significantly enhancing the performance of GUI agents in real-world scenarios. Extensive experiments conducted on OSWorld demonstrate GUIDE’s generality as a plug-and-play component for both multi-agent systems and single-model agents.
Results show that GUIDE consistently yields over 5% improvements in task performance while also reducing execution steps. Importantly, these enhancements are achieved without modifying any model parameters or architecture, validating GUIDE as an architecture-agnostic solution to bridge the gap of domain bias in GUI agents.
The implications of this research are far-reaching. As GUI agents become increasingly integrated into various software applications, the ability to adapt and perform in specific domains is crucial for their effectiveness. By leveraging publicly available web tutorial videos, GUIDE offers a scalable and efficient method for continuously updating the knowledge base of GUI agents, ensuring they remain competitive and functional across diverse use cases.
In conclusion, GUIDE represents a significant advancement in the field of AI and human-computer interaction. By resolving domain bias, it opens up new possibilities for the deployment of GUI agents in real-world applications, enhancing their utility and performance for end-users.
