RiskWebWorld: A Realistic Interactive Benchmark for GUI Agents in E-commerce Risk Management
Summary: arXiv:2604.13531v1 Announce Type: new
Abstract: Graphical User Interface (GUI) agents show strong capabilities for automating web tasks, but existing interactive benchmarks primarily target benign, predictable consumer environments. Their effectiveness in high-stakes, investigative domains such as authentic e-commerce risk management remains underexplored. To bridge this gap, we present RiskWebWorld, the first highly realistic interactive benchmark for evaluating GUI agents in e-commerce risk management.
RiskWebWorld features 1,513 tasks sourced from production risk-control pipelines across 8 core domains, and captures the authentic challenges of risk operations on uncooperative websites, including partially environmental hijackments. To support scalable evaluation and agentic reinforcement learning (RL), we further build a Gymnasium-compliant infrastructure that decouples policy planning from environment mechanics.
Key Features of RiskWebWorld
- Diverse Task Set: The benchmark includes 1,513 tasks that are reflective of real-world complexities in e-commerce risk management.
- Multiple Domains: Tasks are categorized across 8 core domains, ensuring a comprehensive assessment of GUI agents.
- Realistic Challenges: RiskWebWorld simulates scenarios involving uncooperative websites and environmental hijackments, providing a challenging testbed.
- Gymnasium-compliant Infrastructure: The infrastructure allows for seamless integration with existing reinforcement learning frameworks, making it easier to develop and evaluate models.
Evaluation Insights
Our evaluation across diverse models reveals a dramatic capability gap: top-tier generalist models achieve a success rate of 49.1%, while specialized open-weights GUI models lag significantly, showing near-total failure. This gap highlights a crucial insight: the scale of foundation models currently outweighs the importance of zero-shot interface grounding in long-horizon professional tasks.
Future Implications
We also demonstrate the viability of our infrastructure through agentic RL, which improves open-source models by 16.2%. These results position RiskWebWorld as a practical testbed for developing robust digital workers capable of navigating complex e-commerce environments.
Conclusion
As e-commerce continues to evolve, the demand for effective risk management solutions will only grow. RiskWebWorld stands at the forefront of this evolution, offering a realistic platform for the development and assessment of GUI agents. By addressing the unique challenges of e-commerce risk management, we hope to pave the way for advancements in automated decision-making and operational efficiency.
