StoryTR: Video Retrieval with Theory of Mind Reasoning

StoryTR: Narrative-Centric Video Temporal Retrieval with Theory of Mind Reasoning

Recent advancements in video moment retrieval have primarily focused on action-centric tasks, often overlooking the nuances of narrative content. This challenge is rooted in a significant semantic gap: while current models can identify what is happening in a video, they struggle to discern why it matters. This limitation arises from an insufficient grasp of Theory of Mind (ToM)—the cognitive ability to infer implicit intentions, mental states, and narrative causality based on surface-level observations. To address this gap, researchers have introduced StoryTR, a pioneering video moment retrieval benchmark that necessitates ToM reasoning.

StoryTR consists of 8.1k samples derived from narrative short-form videos such as shorts and reels. These videos are particularly well-suited for this research, as their high information density encapsulates meaning through subtle multimodal cues. For example, the interpretation of a character’s glance paired with a sigh can drastically change based on context; alone, the glance may seem benign, but combined with the sigh, it may suggest concealed hostility. Such complexities highlight the necessity of ToM reasoning to fully understand and interpret video narratives.

The Significance of Theory of Mind in Video Retrieval

The introduction of StoryTR emphasizes the importance of ToM in video moment retrieval. The ability to decode narratives and infer intentions is critical for understanding character motivations and the underlying themes of a video. To teach this reasoning capability to models, the researchers propose an innovative Agentic Data Pipeline. This pipeline generates training data that incorporates explicit three-tier ToM chains, which include:

Intent Decoding: Understanding the character’s goals and motivations.
Narrative Reasoning: Making inferences about the relationships and events within the narrative.
Boundary Localization: Identifying the key moments that define the narrative structure.

Through this systematic approach, the models are trained to develop a deeper understanding of narrative dynamics, significantly improving their performance in video retrieval tasks.

Experimental Findings

The efficacy of the StoryTR benchmark has been tested through rigorous experiments. Initial results reveal a pronounced reasoning gap; for example, the Gemini-3.0-Pro model achieved only 0.53 average Intersection over Union (IoU) on the StoryTR dataset, highlighting the challenges contemporary models face in narrative understanding. In contrast, the newly developed 7B Shorts-Moment model, which was specifically trained using ToM-guided data, demonstrated a remarkable improvement, achieving a 15.1% relative increase in IoU over baseline models.

This finding underscores a critical insight: narrative reasoning capability is more impactful than mere parameter scale in enhancing model performance. As the field progresses, the integration of ToM reasoning into video moment retrieval systems may pave the way for more sophisticated and context-aware AI applications.

Conclusion

The StoryTR benchmark represents a significant stride toward bridging the semantic gap in video moment retrieval, emphasizing the necessity of Theory of Mind reasoning in understanding complex narratives. As AI continues to evolve, fostering models that grasp the intricacies of human intention and narrative will be essential for creating richer and more meaningful interactions between technology and users.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

StoryTR: Video Retrieval with Theory of Mind Reasoning

StoryTR: Narrative-Centric Video Temporal Retrieval with Theory of Mind Reasoning

The Significance of Theory of Mind in Video Retrieval

Experimental Findings

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related