UI-Oceanus: Scaling GUI Agents with Synthetic Environmental Dynamics
Summary: arXiv:2604.02345v1 Announce Type: cross
Abstract: Scaling generalist GUI agents is hindered by the data scalability bottleneck of expensive human demonstrations and the “distillation ceiling” of synthetic teacher supervision. To transcend these limitations, we propose UI-Oceanus, a framework that shifts the learning focus from mimicking high-level trajectories to mastering interaction physics via ground-truth environmental feedback.
Through a systematic investigation of self-supervised objectives, we identify that forward dynamics, defined as the generative prediction of future interface states, acts as the primary driver for scalability and significantly outweighs inverse inference. UI-Oceanus leverages this insight by converting low-cost autonomous exploration, which is verified directly by system execution, into high-density generative supervision to construct a robust internal world model.
Key Findings of UI-Oceanus
Experimental evaluations across a series of models demonstrate the decisive superiority of our approach:
- Continual Pre-Training (CPT): Models utilizing CPT on synthetic dynamics outperform non-CPT baselines with an average success rate improvement of 7% on offline benchmarks.
- Real-World Navigation: The success rate gain amplifies to 16.8% in real-world online navigation tasks.
- Data Volume Impact: Navigation performance scales positively with the volume of synthetic data used during training.
Advantages of Forward Predictive Modeling
The results confirm that grounding agents in forward predictive modeling offers a superior pathway to scalable GUI automation with:
- Robust Cross-Domain Adaptability: The ability to adapt across different environments and tasks without extensive retraining.
- Compositional Generalization: The capacity to generalize learned skills to new and unseen combinations of tasks or interfaces.
Conclusion
UI-Oceanus represents a significant advancement in the field of GUI automation, addressing the critical challenges imposed by traditional methods of training GUI agents. By focusing on interaction physics and leveraging synthetic data, this framework paves the way for more efficient and effective training processes. The findings underscore the potential for future research and application in scaling GUI agents, promising enhanced performance in both simulated and real-world environments.
As the demand for intelligent automation continues to grow, frameworks like UI-Oceanus could play a pivotal role in the development of more capable and adaptable GUI agents, ultimately transforming how we interact with technology.
