Chitrakshara: A Large Multilingual Multimodal Dataset for Indian Languages
In recent years, multimodal research has gained significant traction, primarily focusing on single-image reasoning. However, there has been a notable lack of exploration into multi-image scenarios. The advancement of Vision-Language Models (VLMs) has seen a push towards enhancing multi-image understanding, often through large-scale pretraining on interleaved image-text datasets. A prevailing issue is that most of these VLMs have been predominantly trained on English datasets. This has led to an inadequate representation of diverse languages, particularly those of India.
To bridge this gap, we are excited to announce the introduction of the Chitrakshara dataset series, which aims to encompass a wide array of Indian languages. This dataset series includes two main components: Chitrakshara-IL and Chitrakshara-Cap.
Overview of the Chitrakshara Dataset Series
-
Chitrakshara-IL: This is a large-scale interleaved pretraining dataset consisting of:
- 193 million images
- 30 billion text tokens
- 50 million multilingual documents
-
Chitrakshara-Cap: This component features:
- 44 million image-text pairs
- 733 million tokens
Data Collection Methodologies
The creation of the Chitrakshara dataset series involved a meticulous data collection pipeline. The methodologies employed included:
- Curation: Sourcing data from Common Crawl to ensure a diverse and extensive dataset.
- Filtering: Implementing strict criteria to maintain high-quality data that is representative of various Indian languages.
- Processing: Utilizing advanced techniques to prepare the dataset for effective model training.
Quality and Diversity Analysis
A significant aspect of the Chitrakshara dataset series is the comprehensive quality and diversity analysis conducted to assess its representativeness across Indic languages. This evaluation is crucial in determining the dataset’s potential for developing more culturally inclusive VLMs. By ensuring that a wide range of Indian languages are represented, we aim to foster advancements in AI models that are not only robust but also sensitive to the cultural nuances of these languages.
Conclusion
The Chitrakshara dataset series represents a significant step forward in addressing the underrepresentation of Indian languages in multimodal AI research. By providing a rich, multilingual, and multimodal dataset, we hope to empower researchers and developers to create more inclusive and effective Vision-Language Models that cater to the diverse linguistic landscape of India.
