DanQing: An Up-to-Date Large-Scale Chinese Vision-Language Pre-training Dataset
In the realm of artificial intelligence, Vision-Language Pre-training (VLP) models have made significant strides by utilizing vast collections of image-text pairs. However, while models predominantly based on English, such as CLIP and SigLIP, leverage extensive datasets like LAION-400M, the growth of Chinese VLP has faced challenges due to a scarcity of high-quality, large-scale open-source data. Addressing this gap, researchers have introduced DanQing, a large-scale dataset specifically designed for Chinese cross-modal applications.
DanQing encompasses 100 million high-quality image-text pairs curated from the Common Crawl dataset. This substantial effort aims not only to enhance the quality of data available for VLP models but also to ensure that these models can keep pace with contemporary semantic trends and emerging concepts by incorporating data from 2024 to 2025.
Key Features of DanQing
The development of DanQing involved an effective systematic pipeline that includes several critical steps to ensure data quality:
- Data Source Selection: Careful curation of sources to maximize data relevance and reliability.
- Text Refinement: Improving the quality of text data to enhance contextual understanding.
- Visual Diversification: Ensuring a broad range of visual content to support diverse learning scenarios.
- Cross-modal Cross-batch Filtering: Mitigating intrinsic noise typically found in web-sourced data.
This systematic approach has effectively addressed the challenges posed by noise in web data, resulting in a dataset that stands out for its quality and relevance.
Performance and Impact
Extensive experiments involving the continued pretraining of SigLIP2 models have demonstrated that DanQing consistently outperforms existing Chinese datasets across a variety of downstream tasks. Some of these tasks include:
- Zero-shot classification
- Cross-modal retrieval
- Chinese-centric large multimodal model tasks
Furthermore, a comprehensive analysis of DanQing reveals that it showcases a more balanced semantic distribution and superior scaling capabilities when compared to its predecessors. This positions DanQing as a vital resource for researchers and developers engaged in the field of Chinese vision-language pre-training.
Open Access and Future Research
In a commitment to advancing research in the field, the creators of DanQing have announced plans to open-source the dataset under the Creative Commons CC-BY-NC 4.0 license. This move aims to encourage further exploration and development within the Chinese VLP landscape, fostering innovation and collaboration among researchers worldwide.
As the demand for advanced AI models continues to grow, datasets like DanQing are not only crucial for the development of more effective VLP models but also serve as a foundation for future research initiatives aimed at enhancing the interplay between vision and language in the Chinese context.
