OmniCustom: Sync Audio-Video Customization Via Joint Audio-Video Generation Model
Summary: arXiv:2602.12304v3 Announce Type: replace-cross
Abstract: Existing mainstream video customization methods focus on generating identity-consistent videos based on given reference images and textual prompts. Benefiting from the rapid advancement of joint audio-video generation, this paper proposes a more compelling new task: sync audio-video customization, which aims to synchronously customize both video identity and audio timbre.
Introduction
In the era of artificial intelligence, advancements in video customization have become increasingly sophisticated. Traditional methods primarily focus on either visual consistency or audio fidelity, but recent innovations are merging these two elements. The new task of sync audio-video customization, presented in the OmniCustom framework, aims to revolutionize how we generate audio-visual content by allowing users to specify both visual identities and audio characteristics simultaneously.
Key Features of OmniCustom
OmniCustom’s framework is designed to enable a seamless audio-video generation experience. This is achieved through several key features:
- Dual Control Mechanism: The framework utilizes separate reference identity and audio Low-Rank Adaptation (LoRA) modules. These modules function through self-attention layers within the base audio-video generation model, allowing for fine-tuned control over both visual and auditory aspects.
- Contrastive Learning Objective: To enhance the model’s ability to maintain identity and timbre, a novel contrastive learning objective is introduced. This approach contrasts predicted flows conditioned on the reference inputs with those lacking reference conditions, ensuring a higher fidelity in both audio and video outputs.
- Large-Scale Training Dataset: OmniCustom is trained on a carefully constructed large-scale, high-quality audio-visual human dataset. This extensive dataset enables the model to learn diverse audio-visual correlations, improving its overall generation capabilities.
Methodology
The process begins with the input of a reference image Ir and a reference audio Ar. Users can then specify the spoken content through textual prompts, allowing for a flexible and user-driven experience. The model synthesizes a video that not only preserves the identity of the reference image but also imitates the timbre of the reference audio.
Results and Performance
Extensive experiments demonstrate that OmniCustom significantly outperforms existing video generation methods. The results showcase a marked improvement in the fidelity of audio-video content, with consistent identity preservation and accurate audio timbre replication. This performance leap is attributed to the innovative architecture and training methodologies employed in OmniCustom.
Conclusion
OmniCustom represents a significant advancement in the field of audio-video generation, merging identity and audio timbre customization into a cohesive framework. Its ability to operate in a zero-shot manner, combined with the introduction of contrastive learning objectives, positions OmniCustom as a frontrunner in the drive towards more intelligent and versatile multimedia generation tools.
For more information and to explore the project further, visit the official project page: OmniCustom Project Page.
