Divide, then Ground: Adapting Frame Selection to Query Types for Long-Form Video Understanding
Summary: arXiv:2512.04000v2 Announce Type: replace-cross
Abstract
The application of Large Multimodal Models (LMMs) to long-form video understanding is constrained by limited context lengths and the computationally prohibitive cost of processing dense video tokens. Consequently, recent research has focused on query-aware frame selection, methods that often incur significant computational overhead. This paper challenges the assumption that such complex search mechanisms are universally necessary.
Key Findings
- Identification and validation of a query typology distinguishing between global and localized queries.
- Uniform sampling is effective for global queries.
- Localized queries require query-aware selection for optimal performance.
Introduction
As the demand for understanding long-form videos increases, the limitations of existing Large Multimodal Models (LMMs) become ever more apparent. The computational burden associated with processing large volumes of video data, combined with the constraints of context lengths, necessitates innovative approaches in video understanding. Traditional methods often employ query-aware frame selection techniques that, while effective, can result in significant computational costs.
Research Approach
This study proposes a reevaluation of the need for complex search mechanisms in video analysis. Through rigorous experimentation, we categorize queries into two distinct types:
- Global Queries: These queries require an overview of the entire video, allowing for uniform sampling strategies.
- Localized Queries: These queries focus on specific segments of the video and therefore benefit from more tailored selection methods.
The DIG Framework
Building upon the insights gained from query typology, we introduce DIG, a training-free frame selection framework that dynamically adapts its strategy based on the nature of the query posed. The DIG framework operates on two key principles:
- For global queries, DIG utilizes efficient uniform sampling techniques, which allows for a comprehensive overview while minimizing computational costs.
- For localized queries, DIG activates a specialized pipeline that extracts frames relevant to the specific query, ensuring optimized performance and relevance.
Experimental Results
To validate the effectiveness of the DIG framework, we conducted extensive experiments across three long-form video understanding benchmarks. The results indicate that DIG consistently outperforms existing baseline methods. Notably, even when scaling the input frame count to 256, DIG demonstrates robust improvements in the performance of LMMs.
Conclusion
The findings of this research underscore the importance of tailoring frame selection methods to the type of query being posed in long-form video understanding. By distinguishing between global and localized queries, and by implementing the DIG framework, we can significantly enhance the efficiency and effectiveness of LMMs in processing video data. Future work will focus on further refining these methods and exploring their applicability across various multimedia contexts.
