Bangla-WhisperDiar: Enhanced ASR & Speaker Diarization

Date:

Bangla-WhisperDiar: Fine-Tuning Whisper and PyAnnote for Bangla Long-Form Speech Recognition and Speaker Diarization

In a significant advancement for automatic speech recognition (ASR) and speaker diarization in the Bangla language, researchers have introduced a robust framework called Bangla-WhisperDiar. This innovative approach aims to tackle the unique challenges associated with long-form recordings, diverse acoustic conditions, and the substantial variability in speakers. By fine-tuning existing models, this work demonstrates the potential for improved performance in Bangla spoken language understanding.

The Challenges of Bangla ASR and Diarization

Automatic Speech Recognition in Bangla has faced persistent difficulties due to various factors:

  • Long-form Recordings: Traditional ASR systems often struggle with extended audio segments, which can contain multiple speakers and varying contexts.
  • Diverse Acoustic Conditions: Variability in recording environments can significantly impact the accuracy of speech recognition systems.
  • Speaker Variability: The differences in accents, speech patterns, and voice characteristics among speakers further complicate the recognition process.

Methodology

The researchers approached the ASR challenge (Problem 1) by fine-tuning the tugstugi bengaliai regional ASR Whisper medium model. This was accomplished using a custom-curated dataset comprising approximately 15,000 chunked and aligned Bangla audio segments. The methodology included:

  • Full Weight Training: The model underwent extensive training with a focus on enhancing its understanding of the Bangla language.
  • Data Augmentation: Techniques such as noise injection, reverb simulation, echo, clipping distortion, and pitch/time perturbation were employed to improve robustness.

For the speaker diarization challenge (Problem 2), the researchers utilized the pyannote/segmentation-3.0 model, fine-tuning it with PyTorch Lightning on a competition-annotated diarization dataset. This involved:

  • Integration of Fine-Tuned Segmentation Backbone: The fine-tuned segmentation model was incorporated into the pyannote/speaker-diarization-community-1 pipeline.
  • Retention of Pretrained Components: The pretrained speaker embedding and clustering components were preserved to maintain the effectiveness of the diarization process.

Results and Achievements

The efforts resulted in notable advancements in both ASR and speaker diarization systems:

  • Word Error Rate (WER): The ASR system achieved a WER of 0.2441, showcasing significant improvements over the baseline models.
  • Diarization Error Rate (DER): The diarization system recorded a DER of 0.2392, marking a substantial enhancement in speaker segmentation accuracy.

Conclusion

The Bangla-WhisperDiar project represents a pivotal step toward overcoming the limitations faced by existing systems in Bangla ASR and speaker diarization. By detailing the complete pipeline, including data preprocessing, text normalization, audio augmentation, training strategies, inference optimization, and post-processing, the researchers provide a comprehensive resource for future advancements in this critical field. This work not only contributes to the academic landscape but also holds the potential to enhance practical applications in Bangla speech technology.

Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.