OSCBench: Benchmarking Object State Change in Text-to-Video Generation
Recent advancements in text-to-video (T2V) generation models have led to significant progress in creating visually appealing and temporally coherent videos. However, current benchmarks primarily emphasize aspects such as perceptual quality, text-video alignment, and physical plausibility. This focus overlooks a crucial element of action understanding: object state change (OSC), which is explicitly defined in text prompts.
Object state change refers to the transformation that an object undergoes due to an action. For example, when a user is instructed to peel a potato or slice a lemon, the resulting state change of the object is integral to understanding the action being performed.
Introducing OSCBench
In response to the need for a comprehensive evaluation of OSC in T2V models, the authors of the paper introduce OSCBench, a benchmark specifically designed to assess OSC performance. OSCBench is derived from instructional cooking data, providing a rich context to explore action-object interactions.
- Regular Scenarios: Standard action-object interactions that are commonly found in cooking instructions.
- Novel Scenarios: Unique interactions that may not be frequently encountered, testing the model’s generalization capabilities.
- Compositional Scenarios: Complex interactions that involve multiple actions and objects, requiring a nuanced understanding of context and state change.
These structured scenarios allow for a robust evaluation of both in-distribution performance and the model’s ability to generalize to new situations.
Evaluation of T2V Models
The study evaluates six representative open-source and proprietary T2V models, employing both human user studies and automatic evaluations based on multimodal large language models (MLLM). The results indicate a consistent trend where, despite achieving strong performance in semantic alignment and scene coherence, T2V models exhibit challenges in accurately capturing and maintaining temporal consistency of object state changes.
Key Findings
The analysis reveals several critical insights:
- Current T2V models often struggle with accurately representing object state changes, particularly in novel and compositional scenarios.
- High performance in semantic and scene alignment does not necessarily translate to effective action understanding in the context of OSC.
- OSCBench serves as a diagnostic tool to identify weaknesses in state-aware video generation models.
Conclusion
In conclusion, OSCBench emerges as an essential benchmark in the evaluation of text-to-video generation models. By focusing on object state change, this benchmark highlights a key bottleneck in the current capabilities of T2V systems. The findings underscore the necessity for further research and development aimed at enhancing state-aware video generation, ultimately pushing the boundaries of what T2V models can achieve.
The establishment of OSCBench not only provides a framework for future evaluations but also encourages the development of more sophisticated models that can effectively understand and represent the dynamic nature of actions and their corresponding effects on objects.
