InstrAct: Enhancing Action Understanding in Instructional Videos

InstrAct: Towards Action-Centric Understanding in Instructional Videos

Understanding instructional videos requires recognizing fine-grained actions and modeling their temporal relations, which remains challenging for current Video Foundation Models (VFMs). This difficulty stems from noisy web supervision and a pervasive “static bias,” where models rely on objects rather than motion cues. To address this, we propose InstrAct, a pretraining framework for instructional videos’ action-centric representations.

Key Features of InstrAct

InstrAct introduces innovative strategies to enhance the understanding of instructional videos by focusing on actions rather than static objects. The following key features highlight the advancements made:

Data-Driven Strategy:

We filter noisy captions and generate action-centric hard negatives to disentangle actions from objects during contrastive learning, thereby improving the model’s focus on motion cues.
Action Perceiver:

An Action Perceiver extracts motion-relevant tokens from redundant video encodings, enhancing the model’s ability to understand dynamic movements.
Dynamic Time Warping Alignment (DTW-Align):

This auxiliary objective models sequential temporal structure, enabling the model to better align actions over time.
Masked Action Modeling (MAM):

MAM strengthens cross-modal grounding, ensuring that the understanding of actions is consistent across different modalities.

Evaluation and Performance

To assess the effectiveness of the InstrAct framework, we introduce the InstrAct Bench, a comprehensive evaluation suite designed to measure action-centric understanding. Our method consistently outperforms state-of-the-art VFMs across a variety of tasks, including:

Semantic Reasoning: The ability to comprehend the meaning of actions within context.
Procedural Logic: Understanding the logical sequence of actions in instructional content.
Fine-Grained Retrieval: The capability to accurately retrieve specific actions from a pool of instructional content.

Conclusion

The InstrAct framework marks a significant advancement in the field of action-centric understanding in instructional videos. By addressing the challenges posed by noisy web supervision and static bias, InstrAct provides a robust solution that enhances the ability of Video Foundation Models to recognize and understand fine-grained actions. This innovative approach not only improves the performance of existing models but also sets a new standard for future research in instructional video analysis.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

InstrAct: Enhancing Action Understanding in Instructional Videos

InstrAct: Towards Action-Centric Understanding in Instructional Videos

Key Features of InstrAct

Evaluation and Performance

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related