Towards Scalable Lightweight GUI Agents via Multi-role Orchestration
Summary: arXiv:2604.13488v1 Announce Type: new
Abstract: Autonomous Graphical User Interface (GUI) agents powered by Multimodal Large Language Models (MLLMs) enable digital automation on end-user devices. While scaling both parameters and data has yielded substantial gains, advanced methods still suffer from prohibitive deployment costs on resource-constrained devices. When facing complex in-the-wild scenarios, lightweight GUI agents are bottlenecked by limited capacity and poor task scalability under end-to-end episodic learning, impeding adaptation to multi-agent systems (MAS), while training multiple skill-specific experts remains costly. Can we strike an effective trade-off in this cost-scalability dilemma, enabling lightweight MLLMs to participate in realistic GUI workflows?
To address these challenges, we propose the LAMO framework, which endows a lightweight MLLM with GUI-specific knowledge and task scalability, allowing multi-role orchestration to expand its capability boundary for GUI automation.
Key Features of the LAMO Framework
The LAMO framework combines role-oriented data synthesis with a two-stage training recipe:
- Supervised Fine-tuning: This involves Perplexity-Weighted Cross-Entropy optimization for knowledge distillation and visual perception enhancement.
- Reinforcement Learning: This stage focuses on role-oriented cooperative exploration to enhance the agent’s adaptability and performance.
Development of LAMO-3B
With LAMO, we have developed a task-scalable native GUI agent known as LAMO-3B. This agent supports both monolithic execution and MAS-style orchestration, allowing for a flexible approach to GUI automation.
When paired with advanced planners as a plug-and-play policy executor, LAMO-3B can continuously benefit from advancements in planning technologies. This dynamic capability enables a higher performance ceiling, significantly enhancing the agent’s operational efficiency and effectiveness in real-world applications.
Evaluation and Results
Extensive static and online evaluations have validated the effectiveness of our design. The evaluations demonstrate that LAMO-3B is not only capable of performing tasks efficiently but also exhibits adaptability in various scenarios. This adaptability is critical for fulfilling the demands of complex GUI workflows, where traditional agents often fall short due to their rigid design.
Conclusion
The introduction of the LAMO framework represents a significant advancement in the field of lightweight GUI agents. By addressing the cost-scalability dilemma and enhancing task scalability, LAMO-3B paves the way for more efficient and effective digital automation solutions across various end-user devices. As the demand for intelligent automation continues to grow, the insights and methodologies presented in this research will play a pivotal role in shaping the future of GUI agents.
