WebForge: Breaking the Realism-Reproducibility-Scalability Trilemma in Browser Agent Benchmark
Summary: arXiv:2604.10988v1 Announce Type: new
Abstract: Existing browser agent benchmarks face a fundamental trilemma: real-website benchmarks lack reproducibility due to content drift, controlled environments sacrifice realism by omitting real-web noise, and both require costly manual curation that limits scalability. We present WebForge, the first fully automated framework that resolves this trilemma through a four-agent pipeline — Plan, Generate, Refine, and Validate — that produces interactive, self-contained web environments end-to-end without human annotation. A seven-dimensional difficulty control framework structures task design along navigation depth, visual complexity, reasoning difficulty, and more, enabling systematic capability profiling beyond single aggregate scores. Using WebForge, we construct WebForge-Bench, a benchmark of 934 tasks spanning 7 domains and 3 difficulty levels. Multi-model experiments show that difficulty stratification effectively differentiates model capabilities, while cross-domain analysis exposes capability biases invisible to aggregate metrics. Together, these results confirm that multi-dimensional evaluation reveals distinct capability profiles that a single aggregate score cannot capture. Code and benchmark are publicly available at https://github.com/yuandaxia2001/WebForge.
Introduction
The development of browser agent benchmarks is crucial for evaluating the capabilities of AI agents in navigating and interacting with web content. However, existing benchmarks are hindered by a trilemma that compromises their effectiveness:
- Real-website benchmarks suffer from reproducibility issues due to content drift over time.
- Controlled environments that aim for reproducibility sacrifice realism by not including the unpredictable noise of real web interactions.
- Both types of benchmarks require costly manual curation, which limits their scalability.
WebForge: An Innovative Solution
WebForge offers a groundbreaking solution to this trilemma through its unique four-agent pipeline:
- Plan: Strategizes the tasks to be generated.
- Generate: Creates interactive web environments autonomously.
- Refine: Enhances the generated environments for better realism.
- Validate: Ensures the functionality and reliability of the tasks.
Difficulty Control Framework
WebForge incorporates a seven-dimensional difficulty control framework that structures task design across various parameters:
- Navigation depth
- Visual complexity
- Reasoning difficulty
- And more…
This multi-dimensional approach allows for systematic capability profiling that transcends traditional single aggregate scores, providing a more nuanced understanding of agent performance.
WebForge-Bench
Using the WebForge framework, researchers have constructed WebForge-Bench, a benchmark comprising 934 tasks distributed over 7 domains and 3 difficulty levels. This extensive dataset enables rigorous testing and evaluation of AI models:
- Multi-model experiments demonstrate that difficulty stratification can effectively differentiate model capabilities.
- Cross-domain analyses reveal capability biases that are often hidden when only aggregate metrics are considered.
Conclusion
The introduction of WebForge and WebForge-Bench represents a significant advancement in the field of AI benchmarking. By breaking the realism-reproducibility-scalability trilemma, this framework facilitates a deeper understanding of AI agents’ capabilities, ultimately pushing the boundaries of what is possible in browser-based interactions.
