StoryTR: Video Retrieval with Theory of Mind Reasoning

Date:

StoryTR: Narrative-Centric Video Temporal Retrieval with Theory of Mind Reasoning

Recent advancements in video moment retrieval have primarily focused on action-centric tasks, often overlooking the nuances of narrative content. This challenge is rooted in a significant semantic gap: while current models can identify what is happening in a video, they struggle to discern why it matters. This limitation arises from an insufficient grasp of Theory of Mind (ToM)—the cognitive ability to infer implicit intentions, mental states, and narrative causality based on surface-level observations. To address this gap, researchers have introduced StoryTR, a pioneering video moment retrieval benchmark that necessitates ToM reasoning.

StoryTR consists of 8.1k samples derived from narrative short-form videos such as shorts and reels. These videos are particularly well-suited for this research, as their high information density encapsulates meaning through subtle multimodal cues. For example, the interpretation of a character’s glance paired with a sigh can drastically change based on context; alone, the glance may seem benign, but combined with the sigh, it may suggest concealed hostility. Such complexities highlight the necessity of ToM reasoning to fully understand and interpret video narratives.

The Significance of Theory of Mind in Video Retrieval

The introduction of StoryTR emphasizes the importance of ToM in video moment retrieval. The ability to decode narratives and infer intentions is critical for understanding character motivations and the underlying themes of a video. To teach this reasoning capability to models, the researchers propose an innovative Agentic Data Pipeline. This pipeline generates training data that incorporates explicit three-tier ToM chains, which include:

  • Intent Decoding: Understanding the character’s goals and motivations.
  • Narrative Reasoning: Making inferences about the relationships and events within the narrative.
  • Boundary Localization: Identifying the key moments that define the narrative structure.

Through this systematic approach, the models are trained to develop a deeper understanding of narrative dynamics, significantly improving their performance in video retrieval tasks.

Experimental Findings

The efficacy of the StoryTR benchmark has been tested through rigorous experiments. Initial results reveal a pronounced reasoning gap; for example, the Gemini-3.0-Pro model achieved only 0.53 average Intersection over Union (IoU) on the StoryTR dataset, highlighting the challenges contemporary models face in narrative understanding. In contrast, the newly developed 7B Shorts-Moment model, which was specifically trained using ToM-guided data, demonstrated a remarkable improvement, achieving a 15.1% relative increase in IoU over baseline models.

This finding underscores a critical insight: narrative reasoning capability is more impactful than mere parameter scale in enhancing model performance. As the field progresses, the integration of ToM reasoning into video moment retrieval systems may pave the way for more sophisticated and context-aware AI applications.

Conclusion

The StoryTR benchmark represents a significant stride toward bridging the semantic gap in video moment retrieval, emphasizing the necessity of Theory of Mind reasoning in understanding complex narratives. As AI continues to evolve, fostering models that grasp the intricacies of human intention and narrative will be essential for creating richer and more meaningful interactions between technology and users.

Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.