DiReCT: Physics-Based Video Generation with Contrastive Learning

Date:

DiReCT: Disentangled Regularization of Contrastive Trajectories for Physics-Refined Video Generation

Summary: arXiv:2603.25931v1 | Announce Type: cross

Introduction

In the rapidly evolving field of artificial intelligence, the generation of high-fidelity, temporally coherent videos has become a major focus of research. Recent advancements in flow-matching video generators have demonstrated impressive results; however, these models often struggle with maintaining adherence to basic physical principles. The challenge stems from their reliance on reconstruction objectives that penalize deviations on a per-frame basis without adequately distinguishing between physically plausible dynamics and those that are not.

Contrastive Flow Matching

To address these limitations, researchers have turned to contrastive flow matching, a technique that offers a more principled approach by separating velocity-field trajectories of different conditions. Despite its potential benefits, a significant obstacle exists in the context of text-conditioned video generation: the issue of semantic-physics entanglement. Natural-language prompts, which couple scene content with physical behavior, create a situation where naive negative sampling results in overlapping velocity fields between positive and negative samples. This overlap leads to a counterproductive contrastive gradient that works against the flow-matching objective, thereby impeding effective training.

DiReCT Framework

To formalize and address the gradient conflicts identified in previous studies, researchers have introduced DiReCT (Disentangled Regularization of Contrastive Trajectories). This innovative framework operates as a lightweight post-training enhancement that effectively decomposes the contrastive signal into two complementary scales:

  • Macro-Contrastive Term: This component draws partition-exclusive negatives from semantically distant regions, ensuring interference-free global trajectory separation.
  • Micro-Contrastive Term: This aspect constructs hard negatives that share full scene semantics with the positive sample but differ along a single axis of physical behavior, which is perturbed by a large language model (LLM). This axis can include various factors such as kinematics, forces, materials, interactions, and magnitudes.

Preventing Catastrophic Forgetting

To further enhance the effectiveness of the DiReCT framework, a velocity-space distributional regularizer is employed. This regularizer plays a critical role in preventing catastrophic forgetting of the pre-trained visual quality, thereby maintaining the integrity of the video generation process.

Results and Conclusion

When applied to the Wan 2.1-1.3B model, the DiReCT framework significantly improved the physical commonsense score on the VideoPhy benchmark, achieving a remarkable 16.7% enhancement compared to the baseline and an 11.3% improvement over the soft fine-tuning (SFT) approach. Importantly, these improvements were accomplished without extending the training time, showcasing DiReCT’s efficiency and effectiveness in refining video generation capabilities.

As the field of AI progresses, solutions like DiReCT represent crucial steps toward creating video generators that not only produce visually impressive outputs but also adhere to the fundamental laws of physics, enhancing their applicability in real-world scenarios.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.