WebForge: Solving Browser Agent Benchmark Trilemma

WebForge: Breaking the Realism-Reproducibility-Scalability Trilemma in Browser Agent Benchmark

Summary: arXiv:2604.10988v1 Announce Type: new

Abstract: Existing browser agent benchmarks face a fundamental trilemma: real-website benchmarks lack reproducibility due to content drift, controlled environments sacrifice realism by omitting real-web noise, and both require costly manual curation that limits scalability. We present WebForge, the first fully automated framework that resolves this trilemma through a four-agent pipeline — Plan, Generate, Refine, and Validate — that produces interactive, self-contained web environments end-to-end without human annotation. A seven-dimensional difficulty control framework structures task design along navigation depth, visual complexity, reasoning difficulty, and more, enabling systematic capability profiling beyond single aggregate scores. Using WebForge, we construct WebForge-Bench, a benchmark of 934 tasks spanning 7 domains and 3 difficulty levels. Multi-model experiments show that difficulty stratification effectively differentiates model capabilities, while cross-domain analysis exposes capability biases invisible to aggregate metrics. Together, these results confirm that multi-dimensional evaluation reveals distinct capability profiles that a single aggregate score cannot capture. Code and benchmark are publicly available at https://github.com/yuandaxia2001/WebForge.

Introduction

The development of browser agent benchmarks is crucial for evaluating the capabilities of AI agents in navigating and interacting with web content. However, existing benchmarks are hindered by a trilemma that compromises their effectiveness:

Real-website benchmarks suffer from reproducibility issues due to content drift over time.
Controlled environments that aim for reproducibility sacrifice realism by not including the unpredictable noise of real web interactions.
Both types of benchmarks require costly manual curation, which limits their scalability.

WebForge: An Innovative Solution

WebForge offers a groundbreaking solution to this trilemma through its unique four-agent pipeline:

Plan: Strategizes the tasks to be generated.
Generate: Creates interactive web environments autonomously.
Refine: Enhances the generated environments for better realism.
Validate: Ensures the functionality and reliability of the tasks.

Difficulty Control Framework

WebForge incorporates a seven-dimensional difficulty control framework that structures task design across various parameters:

Navigation depth
Visual complexity
Reasoning difficulty
And more…

This multi-dimensional approach allows for systematic capability profiling that transcends traditional single aggregate scores, providing a more nuanced understanding of agent performance.

WebForge-Bench

Using the WebForge framework, researchers have constructed WebForge-Bench, a benchmark comprising 934 tasks distributed over 7 domains and 3 difficulty levels. This extensive dataset enables rigorous testing and evaluation of AI models:

Multi-model experiments demonstrate that difficulty stratification can effectively differentiate model capabilities.
Cross-domain analyses reveal capability biases that are often hidden when only aggregate metrics are considered.

Conclusion

The introduction of WebForge and WebForge-Bench represents a significant advancement in the field of AI benchmarking. By breaking the realism-reproducibility-scalability trilemma, this framework facilitates a deeper understanding of AI agents’ capabilities, ultimately pushing the boundaries of what is possible in browser-based interactions.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

WebForge: Solving Browser Agent Benchmark Trilemma

WebForge: Breaking the Realism-Reproducibility-Scalability Trilemma in Browser Agent Benchmark

Introduction

WebForge: An Innovative Solution

Difficulty Control Framework

WebForge-Bench

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related