Query-Conditioned Evidential Keyframe Sampling for MLLM-Based Long-Form Video Understanding
Summary: arXiv:2604.01002v1 Announce Type: cross
Abstract
Multimodal Large Language Models (MLLMs) have shown strong performance on video question answering, but their application to long-form videos is constrained by limited context length and computational cost, making keyframe sampling essential. Existing approaches typically rely on semantic relevance or reinforcement learning, which either fail to capture evidential clues or suffer from inefficient combinatorial optimization. In this work, we propose an evidence-driven keyframe sampling framework grounded in information bottleneck theory.
Introduction
The rise of Multimodal Large Language Models (MLLMs) has transformed the landscape of video question answering by enabling more nuanced interactions between textual and visual content. However, the challenge of effectively analyzing long-form videos remains a significant hurdle due to constraints such as limited context length and high computational demands. This necessitates an efficient approach to keyframe sampling, which can significantly enhance the performance of MLLMs in this domain.
Challenges in Existing Approaches
Current methodologies for keyframe sampling often rely on:
- Semantic relevance: While this approach focuses on the relevance of frames to the content, it frequently overlooks crucial evidential clues that are essential for comprehensive understanding.
- Reinforcement learning: Although this technique can optimize for long-term performance, it often struggles with combinatorial optimization, leading to inefficiencies in frame selection.
Proposed Framework
To address these limitations, we introduce an evidence-driven keyframe sampling framework rooted in information bottleneck theory. Our approach formulates keyframe selection as a process of maximizing conditional mutual information between selected frames and the query. This provides a principled objective that accurately reflects each frame’s contribution to answering the posed question.
Optimization Strategy
To make the objective of keyframe selection tractable, we exploit its structural properties to derive a decomposed optimization approach. This effectively reduces the problem of subset selection to independent frame-level scoring, facilitating a more efficient selection process. Additionally, we present a query-conditioned evidence scoring network that is trained with a contrastive objective, enabling the efficient estimation of evidential importance for each frame.
Experimental Results
In our experiments, we evaluated our method on various long-form video understanding benchmarks. The results demonstrate that our approach consistently outperforms prior sampling strategies, even under strict token budgets. Furthermore, we observed a significant improvement in training efficiency, underscoring the effectiveness of our method.
Conclusion
In conclusion, our evidence-driven keyframe sampling framework represents a significant advancement in the field of long-form video understanding. By leveraging information bottleneck theory and optimizing for conditional mutual information, we provide a robust solution that enhances the performance of MLLMs in video question answering tasks. Future work will focus on refining this approach and exploring its applications across different domains.
