WebXSkill: Skill Learning for Autonomous Web Agents
In a recent announcement on arXiv (arXiv:2604.13318v1), researchers have unveiled WebXSkill, an innovative framework designed to enhance the capabilities of autonomous web agents powered by large language models (LLMs). This groundbreaking approach addresses the inherent challenges faced by these agents in executing long-horizon workflows, promising a significant leap in their operational efficiency.
Understanding the Grounding Gap
Autonomous web agents have emerged as powerful tools capable of performing complex tasks within web browsers. However, they frequently encounter difficulties when it comes to executing long-term workflows due to what is known as the grounding gap. This gap primarily stems from the limitations of current skill formulations:
- Textual Workflow Skills: While these provide natural language instructions, they cannot be directly executed by the agents.
- Code-based Skills: Although these skills are executable, they lack transparency for the agent, making it challenging for the agent to recover from errors or adapt to new situations.
Introducing WebXSkill
WebXSkill is a novel framework that effectively bridges the grounding gap by introducing executable skills. Each skill consists of a parameterized action program that is complemented by step-level natural language guidance, providing both direct execution capabilities and a means for agent-driven adaptation. This dual approach enhances the operational flexibility and efficiency of autonomous web agents.
Three Stages of Operation
The WebXSkill framework operates in three distinct stages:
- Skill Extraction: This initial stage involves mining reusable action subsequences from readily available synthetic agent trajectories, which are then abstracted into parameterized skills.
- Skill Organization: In this stage, skills are indexed into a URL-based graph, enabling context-aware retrieval for efficient access.
- Skill Deployment: The final stage exposes two complementary modes:
- Grounded Mode: This mode allows for fully automated multi-step execution.
- Guided Mode: Here, skills function as step-by-step instructions that the agent follows using its native planning capabilities.
Performance Improvements
The implementation of WebXSkill has demonstrated remarkable improvements in task success rates. In evaluations conducted using the WebArena and WebVoyager environments, WebXSkill improved the success rate by up to 9.8 points and 12.9 points over baseline performances, respectively. These results underscore the effectiveness of executable skills in enhancing the capabilities of web agents.
Accessing WebXSkill
For those interested in exploring this innovative framework, the code is publicly available at https://github.com/aiming-lab/WebXSkill. Researchers and developers are encouraged to leverage this resource to further advance the field of autonomous web agents.
