Scaling the Long Video Understanding of Multimodal Large Language Models via Visual Memory Mechanism
Long video understanding poses a significant challenge in the field of Multimodal Large Language Models (MLLMs). A recent paper titled Scaling the Long Video Understanding of Multimodal Large Language Models via Visual Memory Mechanism, identified as arXiv:2603.29252v1, addresses this issue by introducing a novel approach known as Flexible Memory (FlexMem).
Introduction to FlexMem
The problem of long video understanding is critical for the advancement of MLLMs. Traditional methods often struggle with the input limitations of processing vast amounts of video data simultaneously. In contrast, FlexMem is designed to emulate human behavior in video consumption, where viewers continuously watch and recall relevant segments to formulate answers. This innovative approach enables MLLMs to manage video understanding tasks of virtually unlimited lengths.
Key Features of FlexMem
- Visual KV Caches: FlexMem utilizes visual key-value caches as memory sources, allowing for effective memory transfer and writing.
- Dual-Pathway Compression: The model employs a dual-pathway compression design to optimize memory management.
- Diverse Memory Reading Strategies: It explores various memory reading techniques tailored for different video understanding tasks, including prevalent streaming video scenarios.
Experimental Validation
To assess the efficacy of FlexMem, extensive experiments were conducted with two widely recognized video-MLLMs across five long video tasks and one streaming video task. The results were promising, showcasing significant improvements over existing efficient video understanding methodologies.
Performance Insights
Utilizing a single NVIDIA 3090 GPU, FlexMem demonstrated the capability to process over 1,000 frames effectively. The performance metrics indicated that the base MLLMs, when enhanced with FlexMem, achieved results comparable to or even superior to state-of-the-art (SOTA) MLLMs on certain benchmarks, including notable models like GPT-4o and Gemini-1.5 Pro.
Conclusion
The introduction of the Flexible Memory mechanism marks a significant step forward in long video understanding capabilities for MLLMs. By mimicking human-like memory recall behaviors and optimizing video processing strategies, FlexMem addresses previous limitations and enhances the potential for MLLMs to engage with extended video content effectively. This development not only elevates the performance of existing models but also paves the way for future exploration in multimodal learning frameworks.
