How Can We Synthesize High-Quality Pretraining Data? A Systematic Study of Prompt Design, Generator Model, and Source Data
Summary: arXiv:2604.13977v1 Announce Type: cross
Abstract: Synthetic data is a standard component in training large language models, yet systematic comparisons across design dimensions, including rephrasing strategy, generator model, and source data, remain absent. We conduct extensive controlled experiments, generating over one trillion tokens, to identify critical factors in rephrasing web text into synthetic pretraining data. Our results reveal that structured output formats, such as tables, math problems, FAQs, and tutorials, consistently outperform both curated web baselines and prior synthetic methods. Notably, increasing the size of the generator model beyond 1B parameters provides no additional benefit. Our analysis also demonstrates that the selection of the original data used for mixing substantially influences performance. By applying our findings, we develop FinePhrase, a 486-billion-token open dataset of rephrased web text. We show that FinePhrase outperforms all existing synthetic data baselines while reducing generation costs by up to 30 times. We provide the dataset, all prompts, and the generation framework to the research community.
The Importance of High-Quality Pretraining Data
As artificial intelligence continues to advance, the demand for high-quality training data is becoming increasingly critical. Pretraining data plays a pivotal role in developing robust language models. However, generating this data remains a challenge, particularly in ensuring its quality and relevance.
Key Findings from Our Research
Our systematic study focused on three primary dimensions of synthetic data generation:
- Rephrasing Strategy: The method used to transform existing web text into synthetic data.
- Generator Model: The architecture and size of the model used to generate new text.
- Source Data: The original data that serves as the basis for rephrasing.
Structured Output Formats Yield Better Results
One of the most significant findings from our research is the effectiveness of structured output formats. We observed that formats which organize information, such as:
- Tables
- Math Problems
- FAQs
- Tutorials
These formats consistently outperformed both curated web baselines and traditional synthetic methods. This insight underscores the importance of how data is structured for optimal performance.
Generator Model Size and Performance
Our analysis revealed a surprising conclusion regarding the size of the generator model. While larger models are often assumed to provide better performance, we found that increasing the model size beyond 1 billion parameters did not yield additional benefits. This suggests that factors other than model size are more critical in determining the quality of generated data.
The Impact of Source Data Selection
The selection of original data for mixing is another crucial factor influencing the performance of synthetic pretraining data. By carefully curating the source data, we can significantly enhance the quality of the generated outputs, leading to better-trained language models.
Introducing FinePhrase
Based on our findings, we developed FinePhrase, an extensive dataset comprising 486 billion tokens of rephrased web text. This dataset not only outperforms existing synthetic data baselines but also reduces generation costs by an impressive 30 times. We are committed to sharing this dataset, along with the prompts and the generation framework, with the research community to facilitate further advancements in the field.
Conclusion
As the AI landscape evolves, the need for high-quality synthetic pretraining data becomes paramount. Our research highlights the critical factors that influence data quality and provides a pathway for achieving superior results in training language models.
