Coding with Eyes: Visual Feedback Unlocks Reliable GUI Code Generating and Debugging
Summary: arXiv:2604.19750v1 Announce Type: cross
The field of code generation has witnessed significant advancements with the advent of Large Language Model (LLM)-based agents. Despite the progress made, these agents predominantly utilize text-output-based feedback for debugging, particularly in multi-round scenarios. A critical area where these methods face challenges is in graphical user interfaces (GUIs), which inherently involve visual information.
The Challenges in GUI Code Generation
Current agent methods encounter two primary limitations when dealing with GUI applications:
- Event-Driven Nature: GUI programs are event-driven, meaning that they react to user interactions. Existing methods often lack the capability to simulate these interactions, which is essential for triggering the underlying logic of GUI elements.
- Visual Attributes: GUI applications possess a variety of visual attributes that are difficult to assess using text-based approaches. This limitation hampers the ability to determine if the rendered interface meets user needs and expectations.
Introducing InteractGUI Bench
To systematically tackle these challenges, researchers have introduced InteractGUI Bench, an innovative benchmark that includes 984 commonly used real-world desktop GUI application tasks. This benchmark is designed for a fine-grained evaluation of both interaction logic and visual structure in GUI applications, providing a comprehensive framework for testing and improving GUI code generation methods.
VF-Coder: A Vision-Feedback-Based Multi-Agent System
In conjunction with the InteractGUI Bench, researchers have developed VF-Coder, a vision-feedback-based multi-agent system specifically aimed at debugging GUI code. VF-Coder leverages visual information and interacts directly with program interfaces, allowing it to identify potential logic and layout issues in a manner akin to human users.
Results and Effectiveness
The effectiveness of the VF-Coder approach is evident in its performance on the InteractGUI Bench. The success rate of Gemini-3-Flash, an existing model, improved from 21.68% to 28.29% when using VF-Coder. Additionally, the visual score for the same model rose from 0.4284 to 0.5584, underscoring the impact of integrating visual feedback into the debugging process of GUI applications.
Conclusion
These developments signify a pivotal advancement in the realm of GUI code generation and debugging. By addressing the inherent challenges faced by traditional text-based methods, the incorporation of visual feedback through systems like VF-Coder and benchmarks like InteractGUI Bench opens new avenues for creating more reliable and user-friendly GUI applications. The future of code generation may very well depend on the ability of AI systems to “see” and interact with user interfaces, much like human developers do.
