Chitrakshara: Massive Multilingual Dataset for Indian Languages

Date:

Chitrakshara: A Large Multilingual Multimodal Dataset for Indian Languages

In recent years, multimodal research has gained significant traction, primarily focusing on single-image reasoning. However, there has been a notable lack of exploration into multi-image scenarios. The advancement of Vision-Language Models (VLMs) has seen a push towards enhancing multi-image understanding, often through large-scale pretraining on interleaved image-text datasets. A prevailing issue is that most of these VLMs have been predominantly trained on English datasets. This has led to an inadequate representation of diverse languages, particularly those of India.

To bridge this gap, we are excited to announce the introduction of the Chitrakshara dataset series, which aims to encompass a wide array of Indian languages. This dataset series includes two main components: Chitrakshara-IL and Chitrakshara-Cap.

Overview of the Chitrakshara Dataset Series

  • Chitrakshara-IL: This is a large-scale interleaved pretraining dataset consisting of:

    • 193 million images
    • 30 billion text tokens
    • 50 million multilingual documents
  • Chitrakshara-Cap: This component features:

    • 44 million image-text pairs
    • 733 million tokens

Data Collection Methodologies

The creation of the Chitrakshara dataset series involved a meticulous data collection pipeline. The methodologies employed included:

  • Curation: Sourcing data from Common Crawl to ensure a diverse and extensive dataset.
  • Filtering: Implementing strict criteria to maintain high-quality data that is representative of various Indian languages.
  • Processing: Utilizing advanced techniques to prepare the dataset for effective model training.

Quality and Diversity Analysis

A significant aspect of the Chitrakshara dataset series is the comprehensive quality and diversity analysis conducted to assess its representativeness across Indic languages. This evaluation is crucial in determining the dataset’s potential for developing more culturally inclusive VLMs. By ensuring that a wide range of Indian languages are represented, we aim to foster advancements in AI models that are not only robust but also sensitive to the cultural nuances of these languages.

Conclusion

The Chitrakshara dataset series represents a significant step forward in addressing the underrepresentation of Indian languages in multimodal AI research. By providing a rich, multilingual, and multimodal dataset, we hope to empower researchers and developers to create more inclusive and effective Vision-Language Models that cater to the diverse linguistic landscape of India.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.