DanQing: Large-Scale Chinese Vision-Language Dataset 2024

Date:

DanQing: An Up-to-Date Large-Scale Chinese Vision-Language Pre-training Dataset

In the realm of artificial intelligence, Vision-Language Pre-training (VLP) models have made significant strides by utilizing vast collections of image-text pairs. However, while models predominantly based on English, such as CLIP and SigLIP, leverage extensive datasets like LAION-400M, the growth of Chinese VLP has faced challenges due to a scarcity of high-quality, large-scale open-source data. Addressing this gap, researchers have introduced DanQing, a large-scale dataset specifically designed for Chinese cross-modal applications.

DanQing encompasses 100 million high-quality image-text pairs curated from the Common Crawl dataset. This substantial effort aims not only to enhance the quality of data available for VLP models but also to ensure that these models can keep pace with contemporary semantic trends and emerging concepts by incorporating data from 2024 to 2025.

Key Features of DanQing

The development of DanQing involved an effective systematic pipeline that includes several critical steps to ensure data quality:

  • Data Source Selection: Careful curation of sources to maximize data relevance and reliability.
  • Text Refinement: Improving the quality of text data to enhance contextual understanding.
  • Visual Diversification: Ensuring a broad range of visual content to support diverse learning scenarios.
  • Cross-modal Cross-batch Filtering: Mitigating intrinsic noise typically found in web-sourced data.

This systematic approach has effectively addressed the challenges posed by noise in web data, resulting in a dataset that stands out for its quality and relevance.

Performance and Impact

Extensive experiments involving the continued pretraining of SigLIP2 models have demonstrated that DanQing consistently outperforms existing Chinese datasets across a variety of downstream tasks. Some of these tasks include:

  • Zero-shot classification
  • Cross-modal retrieval
  • Chinese-centric large multimodal model tasks

Furthermore, a comprehensive analysis of DanQing reveals that it showcases a more balanced semantic distribution and superior scaling capabilities when compared to its predecessors. This positions DanQing as a vital resource for researchers and developers engaged in the field of Chinese vision-language pre-training.

Open Access and Future Research

In a commitment to advancing research in the field, the creators of DanQing have announced plans to open-source the dataset under the Creative Commons CC-BY-NC 4.0 license. This move aims to encourage further exploration and development within the Chinese VLP landscape, fostering innovation and collaboration among researchers worldwide.

As the demand for advanced AI models continues to grow, datasets like DanQing are not only crucial for the development of more effective VLP models but also serve as a foundation for future research initiatives aimed at enhancing the interplay between vision and language in the Chinese context.

Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.