Mobile GUI Agents under Real-world Threats: Are We There Yet?
Summary: arXiv:2507.04227v2 Announce Type: replace-cross
Recent years have witnessed a rapid development of mobile GUI agents powered by large language models (LLMs),
which can autonomously execute diverse device-control tasks based on natural language instructions. The increasing
accuracy of these agents on standard benchmarks has raised expectations for large-scale real-world deployment,
and there are already several commercial agents released and used by early adopters. However, are we really
ready for GUI agents integrated into our daily devices as system building blocks?
We argue that an important pre-deployment validation is missing to examine whether the agents can maintain
their performance under real-world threats. Specifically, unlike existing common benchmarks that are based on
simple static app contents (they have to do so to ensure environment consistency between different tests),
real-world apps are filled with contents from untrustworthy third parties, such as advertisement emails,
user-generated posts, and media, etc.
Introducing a New Framework
To address this gap, we introduce a scalable app content instrumentation framework to enable flexible and
targeted content modifications within existing applications. Leveraging this framework, we create a test
suite comprising both a dynamic task execution environment and a static dataset of challenging GUI states.
-
Dynamic Environment: This encompasses 122 reproducible tasks designed to simulate real-world
interactions. -
Static Dataset: This consists of over 3,000 scenarios constructed from commercial apps,
creating a robust testing ground for various conditions.
Experimental Findings
We performed experiments on both open-source and commercial GUI agents to evaluate their performance under
the influence of third-party content. The results were revealing:
-
All examined agents experienced significant performance degradation due to the presence of third-party
contents. - The average misleading rate was found to be 42.0% in dynamic environments and 36.1% in static environments.
These findings underscore the importance of validating GUI agents against real-world threats before their
widespread adoption. The current reliance on controlled benchmarks fails to account for the unpredictable
nature of user-generated content and advertisements that can compromise the performance of these agents.
Conclusion
As we move forward with the development and deployment of mobile GUI agents, it is crucial to integrate
rigorous testing frameworks that assess their resilience in real-world scenarios. The framework and benchmark
developed in this study have been released and can be accessed at https://agenthazard.github.io. Only through such comprehensive evaluations can we ensure
that these innovative tools are ready for everyday use, securing both reliability and user trust in the
technology.
