Android Coach: Enhance Online Agentic Training Efficiency
The field of online reinforcement learning (RL) has made significant strides in improving the capabilities of Android agents. However, one of the primary challenges remains the high cost associated with guiding these agents through online interactions. The inefficiencies stemming from emulator latency and existing RL algorithms have made this process prohibitively expensive. A critical limitation in current methodologies is the Single State Single Action paradigm, which restricts learning to one-to-one state-action pairs derived from online one-way rollouts. This approach fails to fully explore the complexities of each costly emulator state.
In response to these challenges, we introduce Android Coach, a groundbreaking framework that transitions the training paradigm from Single State Single Action to Single State Multiple Actions. This innovative shift allows agents to sample and utilize multiple actions for a single online state, enhancing the learning experience without incurring additional emulator overhead.
Key Features of Android Coach
- Critic Learning: Android Coach leverages a critic that estimates action values, enabling the agent to make informed decisions based on multiple actions available at any given state.
- Process Reward Model: To ensure that the critic serves as a reliable coach, we integrate a process reward model that aligns the agent’s learning objectives with real-world performance.
- Group-Wise Advantage Estimator: We introduce a group-wise advantage estimator that uses averaged critic outputs, further refining the decision-making process for the agent during training.
Experimental Results
Rigorous testing has demonstrated the effectiveness and efficiency of Android Coach. In comparative studies, our framework achieved notable improvements in success rates on both AndroidLab and AndroidWorld environments. Specifically, Android Coach recorded a 7.5% and 8.3% increase in success rates over the previous benchmark UI-TARS-1.5-7B. Furthermore, it exhibited a remarkable 1.4 times higher training efficiency compared to traditional Single State Single Action methods such as Proximal Policy Optimization (PPO) and Generalized REINFORCE with Policy Optimization (GRPO) while maintaining matched success rates.
Conclusion
The introduction of Android Coach marks a pivotal advancement in the realm of online reinforcement learning for Android agents. By redefining the training paradigm to accommodate multiple actions for a single state, we have positioned Android Coach as a superior alternative to existing methodologies. The implications of this work extend beyond mere efficiency; they pave the way for more sophisticated and capable agents capable of operating in complex environments. As we continue to refine and develop this framework, the potential for improved agentic learning and application in real-world scenarios is immense.
