WebSP-Eval: Evaluating Web Agents on Website Security and Privacy Tasks
Summary: arXiv:2604.06367v1 Announce Type: cross
Web agents have become increasingly prevalent in automating browser tasks, ranging from simple actions like form completion to more complex workflows such as ordering groceries online. However, while existing benchmarks like WebArena focus on general-purpose performance and safety against malicious actions (as seen in SafeArena), there remains a significant gap in evaluating web agents regarding their effectiveness in managing security and privacy tasks on user-facing websites.
To address this critical issue, researchers have introduced WebSP-Eval, a novel evaluation framework specifically designed to measure web agent performance concerning website security and privacy tasks. This framework is essential for ensuring that web agents can effectively handle sensitive actions that users frequently encounter online.
Key Components of WebSP-Eval
The WebSP-Eval framework comprises three primary components:
- A Comprehensive Task Dataset: The framework includes a meticulously crafted dataset containing 200 task instances across 28 different websites. This dataset is invaluable for testing web agents in real-world scenarios.
- Robust Agentic System: WebSP-Eval supports account and initial state management across multiple runs through a custom Google Chrome extension, ensuring that web agents can be evaluated consistently and fairly.
- Automated Evaluator: An automated evaluation system is integrated to assess the performance of web agents in real-time, providing objective and quantifiable results.
Evaluation Results and Findings
In their assessment, the researchers evaluated a total of eight web agent instantiations leveraging state-of-the-art multimodal large language models. The evaluation process involved a fine-grained analysis across various dimensions, including:
- Different websites
- Task categories
- User interface (UI) elements
The results of this comprehensive evaluation highlighted several critical insights:
- Current web agents exhibited limited autonomous exploration capabilities, making it challenging for them to reliably solve website security and privacy tasks.
- Performance varied significantly across specific task categories and websites, indicating that some agents are better suited for particular environments.
- A notable finding was that stateful UI elements, such as toggles and checkboxes, were primary reasons for agent failures, with failure rates exceeding 45% in tasks that included these elements across many evaluated models.
Conclusion
WebSP-Eval represents a significant advancement in the evaluation of web agents, particularly concerning their performance on security and privacy tasks. By identifying the shortcomings of current models and highlighting specific challenges, this framework lays the groundwork for future improvements in web agent design and functionality. The findings underscore the pressing need for enhanced capabilities in handling complex user interface elements, ensuring that web agents can effectively safeguard user privacy and security in an increasingly digital world.
