Beyond Monologue: Interactive Talking-Listening Avatar Generation with Conversational Audio Context-Aware Kernels
Summary: arXiv:2604.10367v1 Announce Type: new
Abstract: Audio-driven human video generation has achieved remarkable success in monologue scenarios, largely driven by advancements in powerful video generation foundation models. Moving beyond monologues, authentic human communication is inherently a full-duplex interactive process, requiring virtual agents not only to articulate their own speech but also to react naturally to incoming conversational audio. Most existing methods simply extend conventional audio-driven paradigms to listening scenarios. However, relying on strict frame-to-frame alignment renders the model’s response to long-range conversational dynamics rigid, whereas directly introducing global attention catastrophically degrades lip synchronization. Recognizing the unique temporal Scale Discrepancy between talking and listening behaviors, we introduce a multi-head Gaussian kernel to explicitly inject this physical intuition into the model as a progressive temporal inductive bias.
Introduction
The landscape of digital communication is evolving with advancements in AI and machine learning, particularly in the realm of virtual avatars. This study explores the frontiers of audio-driven human video generation, focusing on the necessity of effective interaction between talking and listening behaviors in a digital environment.
Challenges in Current Methods
While significant progress has been made in monologue scenarios, several challenges remain in achieving realistic interactive communication. Key issues include:
- Rigid Response Mechanisms: Current models often rely on frame-to-frame alignment, which hampers their ability to respond fluidly to conversational dynamics.
- Poor Lip Synchronization: Introducing global attention mechanisms can lead to synchronization issues, undermining the realism of the avatars.
- Temporal Scale Discrepancy: The differences in timing between speaking and listening require a more nuanced approach to model training and interaction.
Innovative Solutions
In response to these challenges, our team has developed a multi-head Gaussian kernel that allows for the integration of temporal inductive biases into the model. This innovative approach enables the construction of a full-duplex interactive virtual agent capable of processing dual-stream audio inputs effectively:
- Handling Dual-Stream Audio: The agent can engage in both talking and listening simultaneously, mimicking natural human interaction.
- VoxHear Dataset: We have created a rigorously cleaned dataset, VoxHear, featuring perfectly decoupled speech and background audio tracks to enhance training efficiency and effectiveness.
- Temporal Alignment and Contextual Semantics: Our method successfully merges strong temporal alignment with deep contextual semantics, creating a more responsive and natural avatar.
Results and Future Directions
Extensive experiments demonstrate that our approach sets a new state-of-the-art for generating highly natural and responsive full-duplex interactive digital humans. The potential applications of this technology span various fields, including virtual reality, gaming, and online education.
For those interested in further exploring this innovative research, additional resources are available at the project page: Beyond Monologue Project Page.
Conclusion
As we continue to push the boundaries of AI-driven communication, the development of full-duplex interactive avatars represents a significant leap forward in creating more engaging and realistic digital interactions. The integration of advanced methodologies, such as the multi-head Gaussian kernel and the VoxHear dataset, will pave the way for future innovations in this rapidly evolving field.
