WRAP++: Web discoveRy Amplified Pretraining
Summary: arXiv:2604.06829v1 Announce Type: cross
Abstract
Synthetic data rephrasing has emerged as a powerful technique for enhancing knowledge acquisition during large language model (LLM) pretraining. However, existing approaches operate at the single-document level, rewriting individual web pages in isolation. This confines synthesized examples to intra-document knowledge, missing cross-document relationships and leaving facts with limited associative context.
We propose WRAP++ (Web discoveRy Amplified Pretraining), which amplifies the associative context of factual knowledge by discovering cross-document relationships from web hyperlinks and synthesizing joint QA over each discovered document pair. Concretely, WRAP++ discovers high-confidence relational motifs including dual-links and co-mentions, and synthesizes QA that requires reasoning across both documents. This produces relational knowledge absent from either source document alone, creating diverse entry points to the same facts.
Because the number of valid entity pairs grows combinatorially, this discovery-driven synthesis also amplifies data scale far beyond single-document rewriting. Instantiating WRAP++ on Wikipedia, we amplify ~8.4B tokens of raw text into 80B tokens of cross-document QA data.
On SimpleQA, OLMo-based models at both 7B and 32B scales trained with WRAP++ substantially outperform single-document approaches and exhibit sustained scaling gains, underscoring the advantage of cross-document knowledge discovery and amplification.
Key Features of WRAP++
- Cross-Document Relationships: WRAP++ leverages hyperlinks to discover relationships between different documents, enhancing the contextuality of the knowledge being synthesized.
- Joint QA Synthesis: By creating question-answer pairs that require reasoning across multiple documents, WRAP++ allows for deeper understanding and retrieval of relational knowledge.
- Scalability: The methodology increases the amount of usable data, transforming 8.4 billion tokens into a staggering 80 billion tokens, thereby providing a richer dataset for LLM training.
- Performance Improvements: Models trained with WRAP++ show significant improvements over traditional single-document approaches, highlighting the effectiveness of cross-document data synthesis.
- High-Confidence Relational Motifs: The discovery of motifs such as dual-links and co-mentions ensures that the synthesized data is relevant and contextually rich.
Implications for Future Research
The introduction of WRAP++ has profound implications for the field of natural language processing and machine learning. By shifting focus from isolated document processing to a more interconnected approach, the research community can harness the vast potential of data available across the web.
This method not only enhances the quality of training datasets but also paves the way for future advancements in LLM capabilities. As models become more adept at understanding and reasoning across multiple documents, the potential applications in various fields, including education, information retrieval, and conversational AI, could be transformative.
Conclusion
WRAP++ represents a significant step forward in the utilization of web data for training large language models. By amplifying the associative context of factual knowledge through innovative cross-document synthesis, WRAP++ stands to enhance the way we approach knowledge acquisition in AI, ultimately leading to smarter and more capable language models.
