DiReCT: Disentangled Regularization of Contrastive Trajectories for Physics-Refined Video Generation
Summary: arXiv:2603.25931v1 | Announce Type: cross
Introduction
In the rapidly evolving field of artificial intelligence, the generation of high-fidelity, temporally coherent videos has become a major focus of research. Recent advancements in flow-matching video generators have demonstrated impressive results; however, these models often struggle with maintaining adherence to basic physical principles. The challenge stems from their reliance on reconstruction objectives that penalize deviations on a per-frame basis without adequately distinguishing between physically plausible dynamics and those that are not.
Contrastive Flow Matching
To address these limitations, researchers have turned to contrastive flow matching, a technique that offers a more principled approach by separating velocity-field trajectories of different conditions. Despite its potential benefits, a significant obstacle exists in the context of text-conditioned video generation: the issue of semantic-physics entanglement. Natural-language prompts, which couple scene content with physical behavior, create a situation where naive negative sampling results in overlapping velocity fields between positive and negative samples. This overlap leads to a counterproductive contrastive gradient that works against the flow-matching objective, thereby impeding effective training.
DiReCT Framework
To formalize and address the gradient conflicts identified in previous studies, researchers have introduced DiReCT (Disentangled Regularization of Contrastive Trajectories). This innovative framework operates as a lightweight post-training enhancement that effectively decomposes the contrastive signal into two complementary scales:
- Macro-Contrastive Term: This component draws partition-exclusive negatives from semantically distant regions, ensuring interference-free global trajectory separation.
- Micro-Contrastive Term: This aspect constructs hard negatives that share full scene semantics with the positive sample but differ along a single axis of physical behavior, which is perturbed by a large language model (LLM). This axis can include various factors such as kinematics, forces, materials, interactions, and magnitudes.
Preventing Catastrophic Forgetting
To further enhance the effectiveness of the DiReCT framework, a velocity-space distributional regularizer is employed. This regularizer plays a critical role in preventing catastrophic forgetting of the pre-trained visual quality, thereby maintaining the integrity of the video generation process.
Results and Conclusion
When applied to the Wan 2.1-1.3B model, the DiReCT framework significantly improved the physical commonsense score on the VideoPhy benchmark, achieving a remarkable 16.7% enhancement compared to the baseline and an 11.3% improvement over the soft fine-tuning (SFT) approach. Importantly, these improvements were accomplished without extending the training time, showcasing DiReCT’s efficiency and effectiveness in refining video generation capabilities.
As the field of AI progresses, solutions like DiReCT represent crucial steps toward creating video generators that not only produce visually impressive outputs but also adhere to the fundamental laws of physics, enhancing their applicability in real-world scenarios.
