Gelina: Unified Speech and Gesture Synthesis via Interleaved Token Prediction
Summary: arXiv:2510.12834v3 Announce Type: replace-cross
Introduction
Human communication is inherently multimodal, integrating both verbal and non-verbal elements such as speech and gestures. Despite the natural coupling of these modalities, most existing computational methods for generating speech and gestures tend to synthesize them sequentially. This approach often results in reduced synchrony and misalignment in prosody, which can hinder the overall effectiveness of communication. To address this challenge, we introduce Gelina, a novel framework designed to jointly synthesize speech and co-speech gestures from text using interleaved token sequences.
Overview of Gelina
Gelina operates on a discrete autoregressive backbone, employing modality-specific decoders that work in tandem to generate both speech and gestures simultaneously. By interleaving the token generation process, Gelina ensures that the timing and expression of gestures align closely with the spoken words, thereby enhancing the naturalness and fluency of the output. This innovative approach marks a significant advancement over traditional unimodal synthesis methods.
Key Features
- Joint Synthesis: Gelina synthesizes speech and gestures together, promoting better synchrony and alignment.
- Interleaved Token Sequences: The framework utilizes a unique token generation strategy that interlaces speech and gesture tokens.
- Modality-Specific Decoders: Dedicated decoders for speech and gestures ensure high-quality outputs for both modalities.
- Multi-Speaker Cloning: Gelina supports the cloning of multiple speakers, enhancing its versatility in various applications.
- Gesture-Only Synthesis: The framework allows for the synthesis of gestures solely from speech inputs, opening new avenues for gesture-based communication.
Evaluation and Results
To validate the effectiveness of Gelina, we conducted both subjective and objective evaluations. Participants were tasked with assessing the quality of the synthesized speech and gestures. The results indicated that Gelina achieved competitive speech quality while significantly improving gesture generation compared to unimodal baselines. This demonstrates the potential of Gelina as a robust solution for multimodal communication synthesis.
Conclusion
Gelina represents a pioneering step towards more natural and effective multimodal communication systems. By integrating speech and gesture synthesis into a unified framework, it not only enhances the quality of generated outputs but also sets the stage for future advancements in AI-driven communication technologies. The implications of this research extend beyond simple speech synthesis, paving the way for more immersive and interactive experiences in various domains, including virtual reality, animation, and human-computer interaction.
Future Directions
As we look to the future, further research will focus on refining the underlying algorithms and expanding the framework’s capabilities. This includes exploring additional modalities, increasing the diversity of synthesized gestures, and enhancing the real-time performance of the system. The ultimate goal is to create a comprehensive and adaptable synthesis solution that truly mirrors the richness of human communication.
