Vision2Web: A Hierarchical Benchmark for Visual Website Development with Agent Verification
Summary: arXiv:2603.26648v1 Announce Type: cross
Recent advances in large language models have significantly improved the capabilities of coding agents. However, a systematic evaluation of complex, end-to-end website development remains limited. To address this gap, researchers have introduced Vision2Web, a hierarchical benchmark specifically designed for visual website development.
Introduction
Vision2Web spans a broad spectrum of tasks, from static UI-to-code generation to interactive multi-page frontend reproduction and long-horizon full-stack website development. This innovative benchmark is constructed from real-world websites, providing a practical framework for evaluating the performance of various visual language models.
Benchmark Overview
The Vision2Web benchmark comprises a total of 193 tasks categorized into 16 categories. It includes 918 prototype images and 1,255 test cases, which provide a comprehensive resource for testing coding agents’ capabilities in real-world scenarios.
Evaluation Methodology
To ensure a flexible, thorough, and reliable evaluation, the researchers propose a workflow-based agent verification paradigm. This paradigm consists of two complementary components:
- GUI Agent Verifier: This component assesses the graphical user interface generated by the coding agents to ensure it meets specified design criteria.
- VLM-based Judge: This component evaluates the performance of visual language models by analyzing their outputs against predetermined standards.
Findings
The evaluation of multiple visual language models instantiated under different coding-agent frameworks reveals substantial performance gaps at all task levels. Despite the advancements in the field, state-of-the-art models continue to struggle with full-stack development, highlighting the need for improved methodologies and tools in this domain.
Conclusion
Vision2Web presents a significant step forward in the systematic evaluation of visual website development capabilities of coding agents. By providing a comprehensive benchmark and a robust evaluation framework, it lays the groundwork for future research aimed at enhancing the performance of visual language models in complex web development tasks. The insights gained from this benchmark can drive the development of more effective coding agents, ultimately contributing to the evolution of automated website development.
Future Directions
As the field of artificial intelligence continues to evolve, the need for innovative benchmarks like Vision2Web will be critical. Future work may focus on refining the agent verification paradigm, expanding the task categories, and improving the overall performance of coding agents in real-world scenarios.
