Bangla-WhisperDiar: Enhanced ASR & Speaker Diarization

Bangla-WhisperDiar: Fine-Tuning Whisper and PyAnnote for Bangla Long-Form Speech Recognition and Speaker Diarization

In a significant advancement for automatic speech recognition (ASR) and speaker diarization in the Bangla language, researchers have introduced a robust framework called Bangla-WhisperDiar. This innovative approach aims to tackle the unique challenges associated with long-form recordings, diverse acoustic conditions, and the substantial variability in speakers. By fine-tuning existing models, this work demonstrates the potential for improved performance in Bangla spoken language understanding.

The Challenges of Bangla ASR and Diarization

Automatic Speech Recognition in Bangla has faced persistent difficulties due to various factors:

Long-form Recordings: Traditional ASR systems often struggle with extended audio segments, which can contain multiple speakers and varying contexts.
Diverse Acoustic Conditions: Variability in recording environments can significantly impact the accuracy of speech recognition systems.
Speaker Variability: The differences in accents, speech patterns, and voice characteristics among speakers further complicate the recognition process.

Methodology

The researchers approached the ASR challenge (Problem 1) by fine-tuning the tugstugi bengaliai regional ASR Whisper medium model. This was accomplished using a custom-curated dataset comprising approximately 15,000 chunked and aligned Bangla audio segments. The methodology included:

Full Weight Training: The model underwent extensive training with a focus on enhancing its understanding of the Bangla language.
Data Augmentation: Techniques such as noise injection, reverb simulation, echo, clipping distortion, and pitch/time perturbation were employed to improve robustness.

For the speaker diarization challenge (Problem 2), the researchers utilized the pyannote/segmentation-3.0 model, fine-tuning it with PyTorch Lightning on a competition-annotated diarization dataset. This involved:

Integration of Fine-Tuned Segmentation Backbone: The fine-tuned segmentation model was incorporated into the pyannote/speaker-diarization-community-1 pipeline.
Retention of Pretrained Components: The pretrained speaker embedding and clustering components were preserved to maintain the effectiveness of the diarization process.

Results and Achievements

The efforts resulted in notable advancements in both ASR and speaker diarization systems:

Word Error Rate (WER): The ASR system achieved a WER of 0.2441, showcasing significant improvements over the baseline models.
Diarization Error Rate (DER): The diarization system recorded a DER of 0.2392, marking a substantial enhancement in speaker segmentation accuracy.

Conclusion

The Bangla-WhisperDiar project represents a pivotal step toward overcoming the limitations faced by existing systems in Bangla ASR and speaker diarization. By detailing the complete pipeline, including data preprocessing, text normalization, audio augmentation, training strategies, inference optimization, and post-processing, the researchers provide a comprehensive resource for future advancements in this critical field. This work not only contributes to the academic landscape but also holds the potential to enhance practical applications in Bangla speech technology.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

Bangla-WhisperDiar: Enhanced ASR & Speaker Diarization

Bangla-WhisperDiar: Fine-Tuning Whisper and PyAnnote for Bangla Long-Form Speech Recognition and Speaker Diarization

The Challenges of Bangla ASR and Diarization

Methodology

Results and Achievements

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related