WRAP++: Enhancing LLM Training with Cross-Document QA

Date:

WRAP++: Web discoveRy Amplified Pretraining

Summary: arXiv:2604.06829v1 Announce Type: cross

Abstract

Synthetic data rephrasing has emerged as a powerful technique for enhancing knowledge acquisition during large language model (LLM) pretraining. However, existing approaches operate at the single-document level, rewriting individual web pages in isolation. This confines synthesized examples to intra-document knowledge, missing cross-document relationships and leaving facts with limited associative context.

We propose WRAP++ (Web discoveRy Amplified Pretraining), which amplifies the associative context of factual knowledge by discovering cross-document relationships from web hyperlinks and synthesizing joint QA over each discovered document pair. Concretely, WRAP++ discovers high-confidence relational motifs including dual-links and co-mentions, and synthesizes QA that requires reasoning across both documents. This produces relational knowledge absent from either source document alone, creating diverse entry points to the same facts.

Because the number of valid entity pairs grows combinatorially, this discovery-driven synthesis also amplifies data scale far beyond single-document rewriting. Instantiating WRAP++ on Wikipedia, we amplify ~8.4B tokens of raw text into 80B tokens of cross-document QA data.

On SimpleQA, OLMo-based models at both 7B and 32B scales trained with WRAP++ substantially outperform single-document approaches and exhibit sustained scaling gains, underscoring the advantage of cross-document knowledge discovery and amplification.

Key Features of WRAP++

  • Cross-Document Relationships: WRAP++ leverages hyperlinks to discover relationships between different documents, enhancing the contextuality of the knowledge being synthesized.
  • Joint QA Synthesis: By creating question-answer pairs that require reasoning across multiple documents, WRAP++ allows for deeper understanding and retrieval of relational knowledge.
  • Scalability: The methodology increases the amount of usable data, transforming 8.4 billion tokens into a staggering 80 billion tokens, thereby providing a richer dataset for LLM training.
  • Performance Improvements: Models trained with WRAP++ show significant improvements over traditional single-document approaches, highlighting the effectiveness of cross-document data synthesis.
  • High-Confidence Relational Motifs: The discovery of motifs such as dual-links and co-mentions ensures that the synthesized data is relevant and contextually rich.

Implications for Future Research

The introduction of WRAP++ has profound implications for the field of natural language processing and machine learning. By shifting focus from isolated document processing to a more interconnected approach, the research community can harness the vast potential of data available across the web.

This method not only enhances the quality of training datasets but also paves the way for future advancements in LLM capabilities. As models become more adept at understanding and reasoning across multiple documents, the potential applications in various fields, including education, information retrieval, and conversational AI, could be transformative.

Conclusion

WRAP++ represents a significant step forward in the utilization of web data for training large language models. By amplifying the associative context of factual knowledge through innovative cross-document synthesis, WRAP++ stands to enhance the way we approach knowledge acquisition in AI, ultimately leading to smarter and more capable language models.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.