PhyAVBench: A Challenging Audio Physics-Sensitivity Benchmark for Physically Grounded Text-to-Audio-Video Generation
In the realm of artificial intelligence, the generation of audio-visual content from textual descriptions (Text-to-Audio-Video or T2AV generation) has become a critical focus, particularly for applications in filmmaking, gaming, and immersive world modeling. Despite the advancements in this technology, many existing models exhibit a significant limitation: they often fail to produce sounds that are physically plausible. This shortcoming underscores the necessity for a robust evaluation framework that addresses the grounding of audio physics in generated outputs.
To fill this gap, researchers have introduced a novel benchmark known as PhyAVBench, which aims to systematically assess the capabilities of T2AV models in generating audio that aligns with physical realities. Unlike previous benchmarks that primarily focused on the synchronization of audio and video, PhyAVBench emphasizes the evaluation of audio-physics grounding, thereby paving the way for advancements in physically plausible audio-visual generation.
Overview of PhyAVBench
PhyAVBench comprises several critical components designed to enhance the understanding and development of audio-visual generation models:
- Dataset Creation: The benchmark includes PhyAV-Sound-11K, a comprehensive dataset featuring 25.5 hours of audio-visual content collected from 184 participants. This dataset consists of 11,605 audible videos, ensuring a diverse range of inputs while preventing data leakage.
- Controlled Physical Variations: The dataset is structured around 337 paired-prompt groups that highlight controlled physical variations influencing sound differences. Each group is grounded with an average of 17 videos across six audio-physics dimensions and 41 fine-grained test points.
- Annotation of Physical Factors: Each prompt pair within the dataset is meticulously annotated with the underlying physical factors contributing to their acoustic differences, providing a rich context for evaluation.
Evaluation Paradigm and Metrics
PhyAVBench introduces a unique evaluation paradigm termed the Audio-Physics Sensitivity Test (APST), designed to rigorously assess the audio-physics grounding capabilities of various models. A key innovation in this framework is the Contrastive Physical Response Score (CPRS), which quantifies the acoustic consistency between generated videos and their real-world counterparts. This metric serves as a vital tool for researchers to gauge the effectiveness of their models in producing plausibly grounded audio.
Results and Future Directions
A comprehensive evaluation of 17 state-of-the-art models using the PhyAVBench benchmark has revealed concerning insights. Even some leading commercial models struggle to accurately replicate fundamental audio-physical phenomena. This finding not only exposes a critical gap in current audio-visual generation capabilities, but also indicates promising avenues for future research and development.
With the introduction of PhyAVBench, the research community now has a solid foundation to advance the field of physically grounded audio-visual generation. For those interested, prompts, ground-truth data, and generated video samples are publicly available at PhyAVBench Official Site.
