How to Synthesize High-Quality Pretraining Data Effectively

Date:

How Can We Synthesize High-Quality Pretraining Data? A Systematic Study of Prompt Design, Generator Model, and Source Data

Summary: arXiv:2604.13977v1 Announce Type: cross

Abstract: Synthetic data is a standard component in training large language models, yet systematic comparisons across design dimensions, including rephrasing strategy, generator model, and source data, remain absent. We conduct extensive controlled experiments, generating over one trillion tokens, to identify critical factors in rephrasing web text into synthetic pretraining data. Our results reveal that structured output formats, such as tables, math problems, FAQs, and tutorials, consistently outperform both curated web baselines and prior synthetic methods. Notably, increasing the size of the generator model beyond 1B parameters provides no additional benefit. Our analysis also demonstrates that the selection of the original data used for mixing substantially influences performance. By applying our findings, we develop FinePhrase, a 486-billion-token open dataset of rephrased web text. We show that FinePhrase outperforms all existing synthetic data baselines while reducing generation costs by up to 30 times. We provide the dataset, all prompts, and the generation framework to the research community.

The Importance of High-Quality Pretraining Data

As artificial intelligence continues to advance, the demand for high-quality training data is becoming increasingly critical. Pretraining data plays a pivotal role in developing robust language models. However, generating this data remains a challenge, particularly in ensuring its quality and relevance.

Key Findings from Our Research

Our systematic study focused on three primary dimensions of synthetic data generation:

  • Rephrasing Strategy: The method used to transform existing web text into synthetic data.
  • Generator Model: The architecture and size of the model used to generate new text.
  • Source Data: The original data that serves as the basis for rephrasing.

Structured Output Formats Yield Better Results

One of the most significant findings from our research is the effectiveness of structured output formats. We observed that formats which organize information, such as:

  • Tables
  • Math Problems
  • FAQs
  • Tutorials

These formats consistently outperformed both curated web baselines and traditional synthetic methods. This insight underscores the importance of how data is structured for optimal performance.

Generator Model Size and Performance

Our analysis revealed a surprising conclusion regarding the size of the generator model. While larger models are often assumed to provide better performance, we found that increasing the model size beyond 1 billion parameters did not yield additional benefits. This suggests that factors other than model size are more critical in determining the quality of generated data.

The Impact of Source Data Selection

The selection of original data for mixing is another crucial factor influencing the performance of synthetic pretraining data. By carefully curating the source data, we can significantly enhance the quality of the generated outputs, leading to better-trained language models.

Introducing FinePhrase

Based on our findings, we developed FinePhrase, an extensive dataset comprising 486 billion tokens of rephrased web text. This dataset not only outperforms existing synthetic data baselines but also reduces generation costs by an impressive 30 times. We are committed to sharing this dataset, along with the prompts and the generation framework, with the research community to facilitate further advancements in the field.

Conclusion

As the AI landscape evolves, the need for high-quality synthetic pretraining data becomes paramount. Our research highlights the critical factors that influence data quality and provides a pathway for achieving superior results in training language models.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.