Tuning Qwen2.5-VL to Improve Its Web Interaction Skills
Summary: arXiv:2604.09571v1 Announce Type: cross
Recent advances in vision-language models (VLMs) have sparked growing interest in using them to automate web tasks, yet their feasibility as independent agents that reason and act purely from visual input remains underexplored. This article investigates this setting using Qwen2.5-VL-32B, one of the strongest open-source VLMs available, focusing on improving its reliability in web-based control.
Identifying Key Challenges
Through initial experimentation, we observed three key challenges that hinder the performance of Qwen2.5-VL-32B:
- Inaccurate Localization: The model often struggles with accurately localizing target elements, the cursor, and their relative positions on the web page.
- Sensitivity to Instruction Phrasing: The model’s performance varies significantly based on how the instructions are phrased, indicating a need for more robust comprehension capabilities.
- Overoptimistic Action Bias: The model tends to assume its actions succeed without adequately analyzing their actual outcomes, leading to potential errors in task execution.
Fine-Tuning Methodology
To address these issues, we developed a fine-tuning approach for Qwen2.5-VL-32B focused on a basic web interaction task: moving the mouse and clicking on a page element described in natural language. Our training pipeline consists of two key stages:
- Stage One: Teaching the model to determine whether the cursor already hovers over the target element or whether movement is required. This step is crucial for reducing unnecessary actions and improving efficiency.
- Stage Two: Training the model to execute a single command (either a mouse move or a mouse click) at a time. After executing the command, the model verifies the resulting state of the environment before planning the next action. This sequential approach enhances the model’s understanding of cause and effect in web interactions.
Evaluation and Results
We evaluated our approach on a custom benchmark of single-click web tasks, designed to test the model’s capabilities under various challenging conditions. Remarkably, our fine-tuning increased the success rates from 86% to 94%, demonstrating a significant enhancement in the model’s web interaction skills.
Conclusion
The advancements made through the fine-tuning of Qwen2.5-VL-32B highlight the potential of vision-language models in automating web tasks. By addressing the key challenges of localization, instruction sensitivity, and action bias, we have opened new avenues for research and application in this evolving field. Future work will focus on expanding the range of tasks and further improving the model’s adaptability in dynamic web environments.
