PhysVid: Physics Aware Local Conditioning for Generative Video Models
Summary: arXiv:2603.26285v1 Announce Type: cross
Abstract: Generative video models achieve high visual fidelity but often violate basic physical principles, limiting reliability in real-world settings. Prior attempts to inject physics rely on conditioning: frame-level signals are domain-specific and short-horizon, while global text prompts are coarse and noisy, missing fine-grained dynamics. We present PhysVid, a physics-aware local conditioning scheme that operates over temporally contiguous chunks of frames. Each chunk is annotated with physics-grounded descriptions of states, interactions, and constraints, which are fused with the global prompt via chunk-aware cross-attention during training. At inference, we introduce negative physics prompts (descriptions of locally relevant law violations) to steer generation away from implausible trajectories. On VideoPhy, PhysVid improves physical commonsense scores by approximately 33% over baseline video generators, and by up to approximately 8% on VideoPhy2. These results show that local, physics-aware guidance substantially increases physical plausibility in generative video and marks a step toward physics-grounded video models.
Introduction
Generative video modeling has made significant strides in recent years, achieving high levels of visual quality. However, these models often fail to adhere to fundamental physical principles, which can limit their applicability in real-world scenarios. Existing methods that attempt to incorporate physics into generative models have encountered challenges, primarily due to their reliance on conditioning that is often either too broad or too specific.
The Challenge of Conditioning
Traditionally, conditioning methods can be categorized as follows:
- Frame-level signals: These are often domain-specific and short-horizon, making them less effective for capturing long-term dynamics.
- Global text prompts: While they provide a broader context, these prompts tend to be coarse and noisy, lacking the granularity necessary to guide fine-grained dynamics.
Introducing PhysVid
To address these limitations, we introduce PhysVid, a novel approach that utilizes a physics-aware local conditioning scheme. This method operates over temporally contiguous chunks of frames, allowing for a more nuanced understanding of the dynamics at play. Each chunk is meticulously annotated with physics-grounded descriptions of:
- States
- Interactions
- Constraints
This detailed annotation is then fused with the global prompt through a mechanism known as chunk-aware cross-attention during the training process.
Inference and Negative Physics Prompts
During inference, PhysVid employs a unique strategy by introducing negative physics prompts. These prompts describe locally relevant violations of physical laws, effectively guiding the model away from generating implausible trajectories. This innovative approach significantly enhances the reliability of generative video outputs.
Results and Impact
Testing PhysVid on the VideoPhy dataset revealed promising results. The implementation improved physical commonsense scores by approximately 33% compared to baseline video generators. Furthermore, on the VideoPhy2 dataset, the improvement reached up to approximately 8%. These findings indicate that local, physics-aware guidance can substantially enhance the physical plausibility of generative video models.
Conclusion
PhysVid represents a significant advancement in the integration of physics into generative video modeling. By focusing on local conditioning and employing innovative techniques such as negative physics prompts, PhysVid showcases the potential for creating more reliable and realistic generative video models. This research marks an important step towards developing physics-grounded video generation that can work effectively in real-world scenarios.
