Theory of Mind in Action: The Instruction Inference Task in Dynamic Human-Agent Collaboration
Summary: arXiv:2507.02935v3 Announce Type: replace-cross
Introduction
Successful human-agent collaboration hinges on an agent’s ability to comprehend instructions from a human principal. However, instructions are often incomplete or ambiguous, necessitating the agent to deduce unspoken intentions from the shared context. This process involves the application of Theory of Mind (ToM), enabling the agent to infer the mental states of its human counterpart. This article discusses the implications of this concept and introduces a novel task called Instruction Inference.
The Instruction Inference Task
The Instruction Inference task is designed to evaluate ToM within a dynamic and collaborative environment. In this task, an agent aids a principal in achieving a specific goal by interpreting incomplete or ambiguous instructions. This capability is critical for effective human-agent collaboration, especially in scenarios where clear communication is not possible.
Introducing Tomcat
We present Tomcat, a large language model (LLM)-based agent crafted to demonstrate ToM reasoning in interpreting and responding to instructions from a principal. Tomcat is implemented in two distinct variants:
- Fs-CoT: This variant stands for few-shot chain-of-thought, utilizing a limited number of examples to showcase the necessary structured reasoning.
- CP: The commonsense prompt variant relies on commonsense knowledge and contextual information related to the problem at hand.
Implementation of Tomcat
Both variants of Tomcat have been implemented using three leading LLMs: GPT-4o, DeepSeek-R1, and Gemma-3-27B. This diversity in implementation allows us to assess the effectiveness of each model in performing the Instruction Inference task.
Research Methodology
To evaluate Tomcat’s capabilities, we conducted a study involving 52 human participants. Participants were provided with the same information as the CP variant of Tomcat. We measured the effectiveness of the models using three key metrics:
- Intent Accuracy: This metric assesses how accurately Tomcat and the human participants could identify the intended actions based on the given instructions.
- Action Optimality: This evaluates the efficiency of the actions taken by both the agent and the human participants in reaching the goal.
- Planning Optimality: This metric measures the effectiveness of the plans developed by Tomcat and the human participants in achieving the desired outcome.
Results and Discussion
The findings reveal that Tomcat, particularly in the Fs-CoT variant with GPT-4o and DeepSeek-R1, achieved performance levels comparable to those of human participants. This suggests a significant potential for ToM in enhancing human-agent collaboration. The ability of Tomcat to interpret ambiguous instructions accurately underscores the importance of developing agents capable of understanding human intentions.
Conclusion
The exploration of ToM through the Instruction Inference task highlights the growing capabilities of LLM-based agents in collaborative environments. As we continue to develop and refine these technologies, the potential for effective human-agent teamwork will only increase, paving the way for more intuitive and adaptive interactions.
