DiReCT: Physics-Based Video Generation with Contrastive Learning

DiReCT: Disentangled Regularization of Contrastive Trajectories for Physics-Refined Video Generation

Summary: arXiv:2603.25931v1 | Announce Type: cross

Introduction

In the rapidly evolving field of artificial intelligence, the generation of high-fidelity, temporally coherent videos has become a major focus of research. Recent advancements in flow-matching video generators have demonstrated impressive results; however, these models often struggle with maintaining adherence to basic physical principles. The challenge stems from their reliance on reconstruction objectives that penalize deviations on a per-frame basis without adequately distinguishing between physically plausible dynamics and those that are not.

Contrastive Flow Matching

To address these limitations, researchers have turned to contrastive flow matching, a technique that offers a more principled approach by separating velocity-field trajectories of different conditions. Despite its potential benefits, a significant obstacle exists in the context of text-conditioned video generation: the issue of semantic-physics entanglement. Natural-language prompts, which couple scene content with physical behavior, create a situation where naive negative sampling results in overlapping velocity fields between positive and negative samples. This overlap leads to a counterproductive contrastive gradient that works against the flow-matching objective, thereby impeding effective training.

DiReCT Framework

To formalize and address the gradient conflicts identified in previous studies, researchers have introduced DiReCT (Disentangled Regularization of Contrastive Trajectories). This innovative framework operates as a lightweight post-training enhancement that effectively decomposes the contrastive signal into two complementary scales:

Macro-Contrastive Term: This component draws partition-exclusive negatives from semantically distant regions, ensuring interference-free global trajectory separation.
Micro-Contrastive Term: This aspect constructs hard negatives that share full scene semantics with the positive sample but differ along a single axis of physical behavior, which is perturbed by a large language model (LLM). This axis can include various factors such as kinematics, forces, materials, interactions, and magnitudes.

Preventing Catastrophic Forgetting

To further enhance the effectiveness of the DiReCT framework, a velocity-space distributional regularizer is employed. This regularizer plays a critical role in preventing catastrophic forgetting of the pre-trained visual quality, thereby maintaining the integrity of the video generation process.

Results and Conclusion

When applied to the Wan 2.1-1.3B model, the DiReCT framework significantly improved the physical commonsense score on the VideoPhy benchmark, achieving a remarkable 16.7% enhancement compared to the baseline and an 11.3% improvement over the soft fine-tuning (SFT) approach. Importantly, these improvements were accomplished without extending the training time, showcasing DiReCT’s efficiency and effectiveness in refining video generation capabilities.

As the field of AI progresses, solutions like DiReCT represent crucial steps toward creating video generators that not only produce visually impressive outputs but also adhere to the fundamental laws of physics, enhancing their applicability in real-world scenarios.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

DiReCT: Physics-Based Video Generation with Contrastive Learning

DiReCT: Disentangled Regularization of Contrastive Trajectories for Physics-Refined Video Generation

Introduction

Contrastive Flow Matching

DiReCT Framework

Preventing Catastrophic Forgetting

Results and Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related