How to Synthesize High-Quality Pretraining Data Effectively

How Can We Synthesize High-Quality Pretraining Data? A Systematic Study of Prompt Design, Generator Model, and Source Data

Summary: arXiv:2604.13977v1 Announce Type: cross

Abstract: Synthetic data is a standard component in training large language models, yet systematic comparisons across design dimensions, including rephrasing strategy, generator model, and source data, remain absent. We conduct extensive controlled experiments, generating over one trillion tokens, to identify critical factors in rephrasing web text into synthetic pretraining data. Our results reveal that structured output formats, such as tables, math problems, FAQs, and tutorials, consistently outperform both curated web baselines and prior synthetic methods. Notably, increasing the size of the generator model beyond 1B parameters provides no additional benefit. Our analysis also demonstrates that the selection of the original data used for mixing substantially influences performance. By applying our findings, we develop FinePhrase, a 486-billion-token open dataset of rephrased web text. We show that FinePhrase outperforms all existing synthetic data baselines while reducing generation costs by up to 30 times. We provide the dataset, all prompts, and the generation framework to the research community.

The Importance of High-Quality Pretraining Data

As artificial intelligence continues to advance, the demand for high-quality training data is becoming increasingly critical. Pretraining data plays a pivotal role in developing robust language models. However, generating this data remains a challenge, particularly in ensuring its quality and relevance.

Key Findings from Our Research

Our systematic study focused on three primary dimensions of synthetic data generation:

Rephrasing Strategy: The method used to transform existing web text into synthetic data.
Generator Model: The architecture and size of the model used to generate new text.
Source Data: The original data that serves as the basis for rephrasing.

Structured Output Formats Yield Better Results

One of the most significant findings from our research is the effectiveness of structured output formats. We observed that formats which organize information, such as:

Tables
Math Problems
FAQs
Tutorials

These formats consistently outperformed both curated web baselines and traditional synthetic methods. This insight underscores the importance of how data is structured for optimal performance.

Generator Model Size and Performance

Our analysis revealed a surprising conclusion regarding the size of the generator model. While larger models are often assumed to provide better performance, we found that increasing the model size beyond 1 billion parameters did not yield additional benefits. This suggests that factors other than model size are more critical in determining the quality of generated data.

The Impact of Source Data Selection

The selection of original data for mixing is another crucial factor influencing the performance of synthetic pretraining data. By carefully curating the source data, we can significantly enhance the quality of the generated outputs, leading to better-trained language models.

Introducing FinePhrase

Based on our findings, we developed FinePhrase, an extensive dataset comprising 486 billion tokens of rephrased web text. This dataset not only outperforms existing synthetic data baselines but also reduces generation costs by an impressive 30 times. We are committed to sharing this dataset, along with the prompts and the generation framework, with the research community to facilitate further advancements in the field.

Conclusion

As the AI landscape evolves, the need for high-quality synthetic pretraining data becomes paramount. Our research highlights the critical factors that influence data quality and provides a pathway for achieving superior results in training language models.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

How to Synthesize High-Quality Pretraining Data Effectively

How Can We Synthesize High-Quality Pretraining Data? A Systematic Study of Prompt Design, Generator Model, and Source Data

The Importance of High-Quality Pretraining Data

Key Findings from Our Research

Structured Output Formats Yield Better Results

Generator Model Size and Performance

The Impact of Source Data Selection

Introducing FinePhrase

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related