UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI Grounding
Summary: arXiv:2604.14113v1 Announce Type: cross
Abstract
GUI grounding, which localizes interface elements from screenshots given natural language queries, remains challenging for small icons and dense layouts. Test-time zoom-in methods improve localization by cropping and re-running inference at higher resolution, but apply cropping uniformly across all instances with fixed crop sizes, ignoring whether the model is actually uncertain on each case. We propose UI-Zoomer, a training-free adaptive zoom-in framework that treats both the trigger and scale of zoom-in as a prediction uncertainty quantification problem.
Key Features of UI-Zoomer
- Confidence-Aware Gate: A novel mechanism that fuses spatial consensus among stochastic candidates with token-level generation confidence to selectively trigger zoom-in only when localization is uncertain.
- Uncertainty-Driven Crop Sizing: This module decomposes prediction variance into inter-sample positional spread and intra-sample box extent, deriving a per-instance crop radius via the law of total variance.
- Training-Free Framework: UI-Zoomer does not require additional training, making it more efficient for real-world applications where time and resources are limited.
Performance Evaluation
Extensive experiments conducted on benchmark datasets such as ScreenSpot-Pro, UI-Vision, and ScreenSpot-v2 demonstrate consistent improvements over strong baselines across multiple model architectures. The results indicate that UI-Zoomer achieves gains of up to:
- +13.4% on ScreenSpot-Pro
- +10.3% on UI-Vision
- +4.2% on ScreenSpot-v2
These improvements highlight the effectiveness of the uncertainty-driven approach in refining localization tasks in GUI grounding, particularly in scenarios where conventional methods fall short.
Conclusion
The UI-Zoomer framework represents a significant advancement in the field of GUI grounding. By adapting the zoom-in process based on uncertainty quantification, it not only enhances the accuracy of localization tasks but also streamlines the process by eliminating the need for additional training. This positions UI-Zoomer as a promising solution for developers and researchers aiming to improve the robustness of GUI analysis systems.
Future Directions
Going forward, the potential applications of UI-Zoomer can extend beyond GUI grounding to other areas such as image segmentation and object detection, where uncertainty plays a critical role. Continued research and development in this area could further enhance the capabilities of AI systems in understanding and interacting with complex visual information.
