Bangla-WhisperDiar: Fine-Tuning Whisper and PyAnnote for Bangla Long-Form Speech Recognition and Speaker Diarization
In a significant advancement for automatic speech recognition (ASR) and speaker diarization in the Bangla language, researchers have introduced a robust framework called Bangla-WhisperDiar. This innovative approach aims to tackle the unique challenges associated with long-form recordings, diverse acoustic conditions, and the substantial variability in speakers. By fine-tuning existing models, this work demonstrates the potential for improved performance in Bangla spoken language understanding.
The Challenges of Bangla ASR and Diarization
Automatic Speech Recognition in Bangla has faced persistent difficulties due to various factors:
- Long-form Recordings: Traditional ASR systems often struggle with extended audio segments, which can contain multiple speakers and varying contexts.
- Diverse Acoustic Conditions: Variability in recording environments can significantly impact the accuracy of speech recognition systems.
- Speaker Variability: The differences in accents, speech patterns, and voice characteristics among speakers further complicate the recognition process.
Methodology
The researchers approached the ASR challenge (Problem 1) by fine-tuning the tugstugi bengaliai regional ASR Whisper medium model. This was accomplished using a custom-curated dataset comprising approximately 15,000 chunked and aligned Bangla audio segments. The methodology included:
- Full Weight Training: The model underwent extensive training with a focus on enhancing its understanding of the Bangla language.
- Data Augmentation: Techniques such as noise injection, reverb simulation, echo, clipping distortion, and pitch/time perturbation were employed to improve robustness.
For the speaker diarization challenge (Problem 2), the researchers utilized the pyannote/segmentation-3.0 model, fine-tuning it with PyTorch Lightning on a competition-annotated diarization dataset. This involved:
- Integration of Fine-Tuned Segmentation Backbone: The fine-tuned segmentation model was incorporated into the pyannote/speaker-diarization-community-1 pipeline.
- Retention of Pretrained Components: The pretrained speaker embedding and clustering components were preserved to maintain the effectiveness of the diarization process.
Results and Achievements
The efforts resulted in notable advancements in both ASR and speaker diarization systems:
- Word Error Rate (WER): The ASR system achieved a WER of 0.2441, showcasing significant improvements over the baseline models.
- Diarization Error Rate (DER): The diarization system recorded a DER of 0.2392, marking a substantial enhancement in speaker segmentation accuracy.
Conclusion
The Bangla-WhisperDiar project represents a pivotal step toward overcoming the limitations faced by existing systems in Bangla ASR and speaker diarization. By detailing the complete pipeline, including data preprocessing, text normalization, audio augmentation, training strategies, inference optimization, and post-processing, the researchers provide a comprehensive resource for future advancements in this critical field. This work not only contributes to the academic landscape but also holds the potential to enhance practical applications in Bangla speech technology.
Related AI Insights
- Enhancing TMS EEG Signal Quality with Source-Domain Denoising
- CERSA: Memory-Efficient Fine-Tuning for Large AI Models
- Quantile Geometry Regularization in Distributional RL
- MULTITEXTEDIT: Benchmarking Multilingual Text-in-Image Editing
- Robotic Service Governance: Ensuring Admissible Reconfiguration
- SPECTRE: Efficient Hybrid Serving for Faster LLM Inference
- Efficient Culprit Identification with MobileNet & Attention
- FairHealth: Open-Source Python AI for Healthcare Equity
- Advanced Image Forgery Detection with Transfer Learning
- FreqAdapter: Efficient Text-Guided Multi-Scale Fine-Tuning
