AudioRole: An Audio Dataset for Character Role-Playing in Large Language Models
In the rapidly evolving landscape of artificial intelligence, the creation of high-quality multimodal datasets is fundamental for advancing the role-playing capabilities in large language models (LLMs). A new dataset, named AudioRole, aims to bridge existing gaps by providing a meticulously curated collection of audio and text data specifically designed for Audio Role-Playing (ARP).
Traditionally, research in this field has predominantly focused on text-based persona simulation. However, ARP introduces unique challenges due to the necessity for synchronized alignment between semantic content and vocal characteristics. Recognizing the importance of this alignment, the creators of AudioRole have assembled a dataset that includes over 1,000 hours of audio from 13 popular TV series, featuring more than 1 million character-grounded dialogues.
Key Features of the AudioRole Dataset
AudioRole is not just another dataset; it offers a wealth of features that make it a vital resource for researchers and developers. The key features include:
- Synchronized Audio-Text Pairs: Each dialogue is paired with corresponding audio, ensuring that users can study the nuances of character interactions effectively.
- Speaker Identity Annotations: The dataset includes detailed annotations of speaker identities, allowing for precise role-playing simulations.
- Contextual Metadata: Contextual information accompanying dialogues enhances the understanding of character dynamics and situational contexts.
- Diverse Character Representation: With dialogues from over 115 main characters, the dataset captures a wide array of personalities and voices.
Introducing ARP-Eval: A Dual-Aspect Evaluation Framework
To ensure the effectiveness of the AudioRole dataset, the creators also introduced ARP-Eval, a dual-aspect evaluation framework. This framework assesses:
- Response Quality: Evaluating how well the generated responses align with the character’s persona.
- Role Fidelity: Measuring the accuracy of the character portrayal in role-playing scenarios.
Performance Validation of ARP-Model
Empirical validation of the dataset was conducted using a model specifically trained on AudioRole, referred to as ARP-Model. The findings revealed that ARP-Model achieved an average Acoustic Personalization score of 0.31. This score significantly outperformed both the original GLM-4-Voice and the more powerful MiniCPM-O-2.6 model, which is tailored for one-shot role-playing scenarios.
Furthermore, the ARP-Model attained a Content Personalization score of 0.36, surpassing the untrained original model by approximately 38%, while maintaining comparable performance to MiniCPM-O-2.6. These results underscore the potential of AudioRole in enhancing audio-grounded role-playing research.
Conclusion
In summary, AudioRole is a groundbreaking dataset that offers a plethora of resources for advancing the fields of audio-grounded role-playing and character simulation in large language models. With its unique features and robust evaluation framework, AudioRole is poised to become an essential tool for researchers and developers aiming to push the boundaries of AI-driven role-playing experiences.
