Efficient Keyframe Sampling for MLLM Long-Form Video QA

Query-Conditioned Evidential Keyframe Sampling for MLLM-Based Long-Form Video Understanding

Summary: arXiv:2604.01002v1 Announce Type: cross

Abstract

Multimodal Large Language Models (MLLMs) have shown strong performance on video question answering, but their application to long-form videos is constrained by limited context length and computational cost, making keyframe sampling essential. Existing approaches typically rely on semantic relevance or reinforcement learning, which either fail to capture evidential clues or suffer from inefficient combinatorial optimization. In this work, we propose an evidence-driven keyframe sampling framework grounded in information bottleneck theory.

Introduction

The rise of Multimodal Large Language Models (MLLMs) has transformed the landscape of video question answering by enabling more nuanced interactions between textual and visual content. However, the challenge of effectively analyzing long-form videos remains a significant hurdle due to constraints such as limited context length and high computational demands. This necessitates an efficient approach to keyframe sampling, which can significantly enhance the performance of MLLMs in this domain.

Challenges in Existing Approaches

Current methodologies for keyframe sampling often rely on:

Semantic relevance: While this approach focuses on the relevance of frames to the content, it frequently overlooks crucial evidential clues that are essential for comprehensive understanding.
Reinforcement learning: Although this technique can optimize for long-term performance, it often struggles with combinatorial optimization, leading to inefficiencies in frame selection.

Proposed Framework

To address these limitations, we introduce an evidence-driven keyframe sampling framework rooted in information bottleneck theory. Our approach formulates keyframe selection as a process of maximizing conditional mutual information between selected frames and the query. This provides a principled objective that accurately reflects each frame’s contribution to answering the posed question.

Optimization Strategy

To make the objective of keyframe selection tractable, we exploit its structural properties to derive a decomposed optimization approach. This effectively reduces the problem of subset selection to independent frame-level scoring, facilitating a more efficient selection process. Additionally, we present a query-conditioned evidence scoring network that is trained with a contrastive objective, enabling the efficient estimation of evidential importance for each frame.

Experimental Results

In our experiments, we evaluated our method on various long-form video understanding benchmarks. The results demonstrate that our approach consistently outperforms prior sampling strategies, even under strict token budgets. Furthermore, we observed a significant improvement in training efficiency, underscoring the effectiveness of our method.

Conclusion

In conclusion, our evidence-driven keyframe sampling framework represents a significant advancement in the field of long-form video understanding. By leveraging information bottleneck theory and optimizing for conditional mutual information, we provide a robust solution that enhances the performance of MLLMs in video question answering tasks. Future work will focus on refining this approach and exploring its applications across different domains.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

Efficient Keyframe Sampling for MLLM Long-Form Video QA

Query-Conditioned Evidential Keyframe Sampling for MLLM-Based Long-Form Video Understanding

Abstract

Introduction

Challenges in Existing Approaches

Proposed Framework

Optimization Strategy

Experimental Results

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related