Synthetic Mixed Training: Scaling Parametric Knowledge Acquisition Beyond RAG
Summary: arXiv:2603.23562v1
Type: Cross
Abstract: Synthetic data augmentation helps language models learn new knowledge in data-constrained domains. However, naively scaling existing synthetic data methods by training on more synthetic tokens or using stronger generators yields diminishing returns below the performance of RAG. To break the RAG ceiling, we introduce Synthetic Mixed Training, which combines synthetic QAs and synthetic documents. This leverages their complementary training signals, and enables log-linear improvements as both synthetic data volume and generator strength increase. This allows the model to outperform RAG by a 2.6% relative gain on QuaLITY, a long-document reading comprehension benchmark.
Introduction
The landscape of language models has evolved significantly, especially with the advent of synthetic data augmentation techniques. These approaches have proven beneficial in enhancing the learning capabilities of models operating in data-constrained environments. Despite this progress, challenges remain, particularly when attempting to scale existing methodologies. This article explores a novel approach called Synthetic Mixed Training, designed to surpass the limitations associated with traditional synthetic data augmentation.
Challenges with Current Synthetic Data Methods
Current synthetic data methods often rely on two primary strategies:
- Increasing the volume of synthetic tokens used for training.
- Employing stronger generators to produce high-quality synthetic data.
However, these approaches have demonstrated diminishing returns, frequently resulting in performance that falls short of models that utilize RAG (Retrieval-Augmented Generation). This performance ceiling presents a barrier to further advancements in model capabilities.
Introducing Synthetic Mixed Training
Synthetic Mixed Training addresses these challenges by integrating two distinct types of synthetic data: synthetic question-answer pairs (QAs) and synthetic documents. This dual approach harnesses the complementary nature of both data forms, allowing for enhanced training signals. Key benefits of this method include:
- Log-linear improvements as the volume of synthetic data and the strength of generators increase.
- A significant performance boost over traditional RAG approaches, achieving a 2.6% relative gain on the QuaLITY benchmark.
Innovative Techniques: Focal Rewriting
In addition to Synthetic Mixed Training, the introduction of Focal Rewriting presents another breakthrough in synthetic document generation. This technique explicitly conditions the generation process on specific questions, resulting in:
- Increased diversity in synthetic documents.
- A steeper log-linear scaling curve, leading to improved performance metrics.
Results and Performance Metrics
The culmination of these strategies has led to the development of a Llama 8B model that surpasses RAG by a remarkable 4.4% relative gain on the QuaLITY benchmark. Moreover, across various models and testing frameworks, including:
- QuaLITY
- LongHealth
- FinanceBench
Our training methodology has enabled models to outperform RAG in five out of six settings, achieving an impressive 9.1% gain when combined with RAG. This establishes Synthetic Mixed Training as a formidable approach in the field of language model enhancement.
Conclusion
As the demand for more capable language models continues to grow, innovative strategies like Synthetic Mixed Training and Focal Rewriting are crucial for pushing the boundaries of what is achievable. By effectively combining synthetic QAs and documents, researchers and practitioners can unlock new levels of performance while overcoming the limitations of existing methods.
