Synthetic Mixed Training: Boosting Language Models Beyond RAG

Synthetic Mixed Training: Scaling Parametric Knowledge Acquisition Beyond RAG

Summary: arXiv:2603.23562v1

Type: Cross

Abstract: Synthetic data augmentation helps language models learn new knowledge in data-constrained domains. However, naively scaling existing synthetic data methods by training on more synthetic tokens or using stronger generators yields diminishing returns below the performance of RAG. To break the RAG ceiling, we introduce Synthetic Mixed Training, which combines synthetic QAs and synthetic documents. This leverages their complementary training signals, and enables log-linear improvements as both synthetic data volume and generator strength increase. This allows the model to outperform RAG by a 2.6% relative gain on QuaLITY, a long-document reading comprehension benchmark.

Introduction

The landscape of language models has evolved significantly, especially with the advent of synthetic data augmentation techniques. These approaches have proven beneficial in enhancing the learning capabilities of models operating in data-constrained environments. Despite this progress, challenges remain, particularly when attempting to scale existing methodologies. This article explores a novel approach called Synthetic Mixed Training, designed to surpass the limitations associated with traditional synthetic data augmentation.

Challenges with Current Synthetic Data Methods

Current synthetic data methods often rely on two primary strategies:

Increasing the volume of synthetic tokens used for training.
Employing stronger generators to produce high-quality synthetic data.

However, these approaches have demonstrated diminishing returns, frequently resulting in performance that falls short of models that utilize RAG (Retrieval-Augmented Generation). This performance ceiling presents a barrier to further advancements in model capabilities.

Introducing Synthetic Mixed Training

Synthetic Mixed Training addresses these challenges by integrating two distinct types of synthetic data: synthetic question-answer pairs (QAs) and synthetic documents. This dual approach harnesses the complementary nature of both data forms, allowing for enhanced training signals. Key benefits of this method include:

Log-linear improvements as the volume of synthetic data and the strength of generators increase.
A significant performance boost over traditional RAG approaches, achieving a 2.6% relative gain on the QuaLITY benchmark.

Innovative Techniques: Focal Rewriting

In addition to Synthetic Mixed Training, the introduction of Focal Rewriting presents another breakthrough in synthetic document generation. This technique explicitly conditions the generation process on specific questions, resulting in:

Increased diversity in synthetic documents.
A steeper log-linear scaling curve, leading to improved performance metrics.

Results and Performance Metrics

The culmination of these strategies has led to the development of a Llama 8B model that surpasses RAG by a remarkable 4.4% relative gain on the QuaLITY benchmark. Moreover, across various models and testing frameworks, including:

QuaLITY
LongHealth
FinanceBench

Our training methodology has enabled models to outperform RAG in five out of six settings, achieving an impressive 9.1% gain when combined with RAG. This establishes Synthetic Mixed Training as a formidable approach in the field of language model enhancement.

Conclusion

As the demand for more capable language models continues to grow, innovative strategies like Synthetic Mixed Training and Focal Rewriting are crucial for pushing the boundaries of what is achievable. By effectively combining synthetic QAs and documents, researchers and practitioners can unlock new levels of performance while overcoming the limitations of existing methods.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

Synthetic Mixed Training: Boosting Language Models Beyond RAG

Synthetic Mixed Training: Scaling Parametric Knowledge Acquisition Beyond RAG

Introduction

Challenges with Current Synthetic Data Methods

Introducing Synthetic Mixed Training

Innovative Techniques: Focal Rewriting

Results and Performance Metrics

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related