Scaling the Long Video Understanding of Multimodal Large Language Models via Visual Memory Mechanism
Summary: arXiv:2603.29252v1
Announce Type: cross
Abstract
Long video understanding is a key challenge that plagues the advancement of Multimodal Large Language Models (MLLMs). In this paper, we study this problem from the perspective of the visual memory mechanism and propose a novel and training-free approach, termed Flexible Memory (FlexMem). In principle, FlexMem aims to mimic human behavior of video watching, i.e., continually watching video content and recalling the most relevant memory fragments to answer the question. In this way, FlexMem can help MLLMs achieve video understanding of infinite lengths, unlike previous methods that process all video information at once and have input upper limits.
Introduction
The increasing volume of video content available online has led to a pressing need for advanced methods that can understand and process long videos effectively. Traditional MLLMs often struggle with this task due to their inherent limitations in handling extensive video data. FlexMem offers a solution by leveraging a memory mechanism that adapts to the way humans process visual information.
Methodology
FlexMem operates on two main principles:
- Visual KV Caches: These serve as the primary memory sources, allowing the model to store and retrieve important visual information efficiently.
- Dual-Pathway Compression Design: This approach enables effective memory transfer and writing, ensuring that relevant information is accessible during the video understanding process.
Memory Reading Strategies
To cater to diverse video understanding tasks, FlexMem explores various memory reading strategies, including:
- Streaming Video Processing: A strategy tailored for real-time video understanding, enhancing responsiveness and accuracy.
- Task-Specific Adaptation: Memory retrieval methods that adjust based on the specific requirements of different video analysis tasks.
Experimental Validation
To validate the effectiveness of FlexMem, extensive experiments were conducted on two popular video-MLLMs across five long video tasks and one streaming video task. The results indicate significant improvements:
- FlexMem demonstrated its capability to process over 1,000 frames on a single 3090 GPU.
- It achieved performance levels comparable to or exceeding state-of-the-art MLLMs such as GPT-4o and Gemini-1.5 Pro on certain benchmarks.
Conclusion
FlexMem represents a significant advancement in the realm of long video understanding for MLLMs. By mimicking human memory processes, it allows for the effective analysis of extensive video content, overcoming the limitations of traditional methods. Future work will focus on further refining the memory strategies and expanding the applicability of FlexMem across various multimodal tasks.
