Can LLMs Perceive Time? An Empirical Investigation
Summary: arXiv:2604.00010v1 Announce Type: cross
Abstract
Large language models (LLMs) have proven to be adept at a variety of language-related tasks; however, a critical limitation has been identified regarding their ability to perceive time. In this article, we delve into this limitation through a comprehensive empirical investigation involving four experiments conducted across 68 tasks and four distinct model families. Our findings indicate that these models consistently overshoot their pre-task estimates by a factor of 4 to 7 times (p < 0.001), often predicting durations in human-scale minutes for tasks that complete in mere seconds. Moreover, the relative ordering of task durations demonstrates similar inaccuracies, particularly in task pairs designed to expose the models' reliance on heuristics. For instance, GPT-5 scored only 18% on counter-intuitive pairs (p = 0.033), indicating a systematic failure when confronted with misleading complexity labels. Additionally, post-hoc recall of task durations is disconnected from reality, revealing a divergence from actual time estimates by an order of magnitude in either direction. This study highlights the persistent nature of these failures even in multi-step agentic settings, where errors can range from 5 to 10 times. While the models possess propositional knowledge about duration obtained from their training data, they lack experiential grounding in their own inference time. This shortcoming presents practical implications for applications involving agent scheduling, planning, and time-critical scenarios.
Introduction
The ability of AI systems, particularly large language models, to understand and estimate time remains a topic of significant interest and concern. As these models are increasingly integrated into various applications, from automated customer service to sophisticated planning systems, their inability to accurately perceive time could lead to inefficiencies and errors. This article aims to explore the depth of this limitation through a series of empirical experiments.
Methodology
To assess the temporal perception of LLMs, we designed four experiments involving 68 tasks across four different model families. Each experiment was structured to evaluate both pre-task estimates and post-task recalls, allowing us to measure the discrepancies between predicted and actual durations.
Key Findings
- Pre-task Estimates: The models consistently overestimated the duration of tasks by 4 to 7 times.
- Relative Ordering: In task pairs designed to challenge heuristic reliance, GPT-5 scored at or below chance levels, indicating a significant lack of accuracy.
- Post-hoc Recall: Models demonstrated a considerable divergence in their recall of task durations, often misestimating by an order of magnitude.
- Multi-step Settings: Errors persisted in multi-step tasks, with models displaying inaccuracies of 5 to 10 times the actual duration.
Implications
The findings of this investigation underscore the importance of developing LLMs that not only possess propositional knowledge but also a grounding in experiential understanding. As applications of AI become more complex and time-sensitive, addressing these limitations is crucial for enhancing the reliability and efficiency of AI systems in tasks involving scheduling and planning.
Conclusion
The empirical investigation into LLMs’ perception of time reveals a critical gap that must be addressed. As we continue to integrate these models into real-world applications, understanding and improving their temporal awareness will be essential for ensuring optimal performance and reliability.
