InstrAct: Towards Action-Centric Understanding in Instructional Videos
Understanding instructional videos requires recognizing fine-grained actions and modeling their temporal relations, which remains challenging for current Video Foundation Models (VFMs). This difficulty stems from noisy web supervision and a pervasive “static bias,” where models rely on objects rather than motion cues. To address this, we propose InstrAct, a pretraining framework for instructional videos’ action-centric representations.
Key Features of InstrAct
InstrAct introduces innovative strategies to enhance the understanding of instructional videos by focusing on actions rather than static objects. The following key features highlight the advancements made:
-
Data-Driven Strategy:
We filter noisy captions and generate action-centric hard negatives to disentangle actions from objects during contrastive learning, thereby improving the model’s focus on motion cues.
-
Action Perceiver:
An Action Perceiver extracts motion-relevant tokens from redundant video encodings, enhancing the model’s ability to understand dynamic movements.
-
Dynamic Time Warping Alignment (DTW-Align):
This auxiliary objective models sequential temporal structure, enabling the model to better align actions over time.
-
Masked Action Modeling (MAM):
MAM strengthens cross-modal grounding, ensuring that the understanding of actions is consistent across different modalities.
Evaluation and Performance
To assess the effectiveness of the InstrAct framework, we introduce the InstrAct Bench, a comprehensive evaluation suite designed to measure action-centric understanding. Our method consistently outperforms state-of-the-art VFMs across a variety of tasks, including:
- Semantic Reasoning: The ability to comprehend the meaning of actions within context.
- Procedural Logic: Understanding the logical sequence of actions in instructional content.
- Fine-Grained Retrieval: The capability to accurately retrieve specific actions from a pool of instructional content.
Conclusion
The InstrAct framework marks a significant advancement in the field of action-centric understanding in instructional videos. By addressing the challenges posed by noisy web supervision and static bias, InstrAct provides a robust solution that enhances the ability of Video Foundation Models to recognize and understand fine-grained actions. This innovative approach not only improves the performance of existing models but also sets a new standard for future research in instructional video analysis.
