WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning
Summary: arXiv:2512.02425v2 Announce Type: replace-cross
Abstract: Recent advances in video large language models have demonstrated strong capabilities in understanding short clips. However, scaling them to hours- or days-long videos remains highly challenging due to limited context capacity and the loss of critical visual details during abstraction. Existing memory-augmented methods mitigate this by leveraging textual summaries of video segments, yet they heavily rely on text and fail to utilize visual evidence when reasoning over complex scenes. Moreover, retrieving from fixed temporal scales further limits their flexibility in capturing events that span variable durations.
To address these challenges, we introduce WorldMM, a novel multimodal memory agent that constructs and retrieves from multiple complementary memories, encompassing both textual and visual representations. This innovative approach is designed to enhance the capabilities of video reasoning over extended durations while maintaining rich visual context.
Key Components of WorldMM
WorldMM comprises three distinct types of memory:
- Episodic Memory: This memory type indexes factual events across multiple temporal scales, allowing for a nuanced understanding of events that occur over varying durations.
- Semantic Memory: Continuously updating high-level conceptual knowledge, this memory helps in contextualizing information within broader themes and narratives.
- Visual Memory: This component preserves detailed information about scenes, ensuring that critical visual details are not lost during the reasoning process.
Adaptive Retrieval Mechanism
During the inference phase, WorldMM employs an adaptive retrieval agent that iteratively selects the most relevant memory source. This process is dynamic, allowing the agent to leverage multiple temporal granularities based on the specific query. The retrieval continues until the agent determines that sufficient information has been gathered to answer the query effectively.
Performance and Impact
WorldMM has demonstrated significant advancements over existing baselines in the realm of long video question-answering benchmarks. In comparative assessments, WorldMM achieved an impressive average performance gain of 8.4% over previous state-of-the-art methods. This notable improvement underscores its effectiveness in long video reasoning tasks, showcasing the potential for better understanding and interpretation of lengthy video content.
Conclusion
The introduction of WorldMM marks a significant step forward in the field of video reasoning. By integrating multiple memory types and employing a flexible retrieval mechanism, it addresses many challenges faced by existing models in processing long videos. As this technology continues to evolve, it holds promise for a wide range of applications, from education and entertainment to security and surveillance, enhancing our ability to understand and interact with the vast amounts of visual data generated every day.
