InquireMobile: Teaching VLM-based Mobile Agent to Request Human Assistance via Reinforcement Fine-Tuning
In a significant advancement in the realm of artificial intelligence, researchers have introduced InquireMobile, a pioneering model designed to enhance the interaction capabilities of Vision-Language Model (VLM)-based mobile agents. This development aims to address the safety challenges posed by fully autonomous systems that may not always comprehend or reason effectively in complex real-world scenarios.
The recent paper, available on arXiv (arXiv:2508.19679v2), outlines a comprehensive strategy to improve mobile agents’ abilities to seek human assistance at critical decision-making junctures. The researchers emphasize the importance of incorporating human oversight in mobile agent interactions, especially when faced with ambiguous or complex tasks.
The Challenge of Autonomous Decision Making
As VLMs continue to evolve, their integration into mobile agents has enabled these systems to perceive and interact with dynamic environments based on human instructions. However, reliance on fully autonomous decision-making can lead to safety risks, particularly when agents encounter scenarios beyond their training data or reasoning capabilities. To mitigate these risks, the researchers propose a new approach that encourages proactive inquiry from mobile agents.
Introducing InquireBench
At the core of this research is InquireBench, a meticulously crafted benchmark that assesses mobile agents’ proficiency in safe interactions and proactive inquiries with users. InquireBench is divided into five categories and includes 22 sub-categories, highlighting the diverse challenges that VLM-based agents currently face. Notably, many existing models have shown near-zero performance in these areas, underscoring the necessity for improved training methodologies.
- Evaluation Categories:
- Understanding Ambiguity
- Contextual Awareness
- User Intent Recognition
- Safety Protocols
- Proactive Communication
- Sub-Categories:
- Real-Time Decision Making
- Complex Query Handling
- Feedback Integration
- Task Prioritization
- Safety Compliance Checks
Development of InquireMobile
To cultivate a mobile agent that can effectively request human assistance, the researchers devised InquireMobile, employing a novel two-stage training strategy inspired by reinforcement learning. This model incorporates an interactive pre-action reasoning mechanism that prompts the agent to seek confirmation from users before executing critical tasks. This interaction not only enhances the agent’s decision-making process but also fosters a collaborative environment between the agent and the user.
Performance and Future Directions
The results of the study are promising, revealing that InquireMobile achieved a remarkable 46.8% improvement in inquiry success rates compared to existing baseline models on InquireBench. Moreover, it secured the highest overall success rate, showcasing its potential to transform the landscape of mobile agent interaction.
In a move to promote further research and development, the authors have committed to open-sourcing all datasets, models, and evaluation codes. This initiative aims to foster collaboration between academia and industry, ultimately enhancing the safety and efficacy of VLM-based mobile agents in real-world applications.
The introduction of InquireMobile marks a pivotal step towards creating more reliable and safe AI systems that can seamlessly integrate human judgment into their operational frameworks, paving the way for future advancements in artificial intelligence.
Related AI Insights
- DySIB: Learning Phase Space from High-Dim Experimental Data
- DepthKV: Layer-Wise KV Cache Pruning for Efficient LLMs
- Scaling Compute Infrastructure for the AI Intelligence Age
- WinkTPG: Advanced Multi-Agent Path Finding with Temporal Reasoning
- Source-Sensitive Reasoning in Turkish: Humans vs LLMs
- AgentWard: Secure Lifecycle Architecture for AI Agents
- Detecting Defective Task Descriptions in LLM Code Generation
- Meta-CoT: Advanced Granularity & Generalization in Image Editing
- On-Device Small Language Models: Mobile Integration Challenges
- Eero Signal: Reliable Backup for Business Internet Outages
