CAPTCHA Solving for Native GUI Agents: Automated Reasoning-Action Data Generation and Self-Corrective Training
Summary: arXiv:2603.23559v1 Announce Type: cross
As the field of artificial intelligence continues to evolve, GUI agents are transitioning from multi-module pipelines to more sophisticated end-to-end, native vision-language models (VLMs). These advanced models are designed to interpret raw screenshots and interact directly with digital devices. However, despite significant advancements in general GUI tasks, CAPTCHA solving remains a considerable obstacle for these systems. While there are specialized CAPTCHA solving pipelines available, they lack the versatility to perform general GUI tasks effectively.
Introducing ReCAP: A Breakthrough in CAPTCHA Solving
To bridge the gap between specialized CAPTCHA solvers and general GUI agents, we introduce ReCAP, an innovative CAPTCHA-capable native GUI agent. ReCAP is engineered to robustly tackle modern, interactive CAPTCHA challenges while maintaining its efficacy as a general GUI agent. This article details the development of a comprehensive dynamic CAPTCHA system, which encompasses seven representative CAPTCHA types, specifically created to evaluate both fundamental and complementary capabilities necessary for effective CAPTCHA solving.
Key Features of ReCAP
- Dynamic CAPTCHA System: The system is designed to challenge ReCAP’s capabilities, focusing on aspects such as robust Optical Character Recognition (OCR) amidst significant noise and text stylization, fine-grained visual understanding, and precise control.
- Automated Data Collection: We have established a data collection and curation pipeline that generates extensive CAPTCHA interaction trajectories, which are paired with reasoning traces. This data is crucial for training the model in understanding and solving CAPTCHAs.
- Self-Corrective Training: Understanding that CAPTCHA solving frequently involves multi-step interactions and the possibility of errors, we utilize failed trajectories to create self-correction data. This innovative approach trains agents to analyze their mistakes and adjust their actions in real-time, enhancing their learning process.
Performance Improvements
Our experiments demonstrate that ReCAP significantly enhances CAPTCHA-solving success rates. In controlled tests, the success rate surged from approximately 30% to an impressive 80%. Notably, this improvement in CAPTCHA-solving capabilities does not compromise ReCAP’s performance on standard general GUI-agent benchmarks, showcasing its dual functionality.
Conclusion
ReCAP represents a significant advancement in the realm of GUI agents, particularly in its ability to solve CAPTCHAs while retaining its proficiency in general GUI tasks. By integrating automated reasoning-action data generation and self-corrective training, ReCAP sets a new standard for the capabilities of native GUI agents. As the demand for more capable and versatile AI systems continues to grow, innovations like ReCAP pave the way for the future of intelligent digital interactions.
