Tracing the Roots: A Multi-Agent Framework for Uncovering Data Lineage in Post-Training LLMs
Summary: arXiv:2604.10480v1 Announce Type: new
Abstract
Post-training data plays a pivotal role in shaping the capabilities of Large Language Models (LLMs), yet datasets are often treated as isolated artifacts, overlooking the systemic connections that underlie their evolution. To disentangle these complex relationships, we introduce the concept of data lineage to the LLM ecosystem and propose an automated multi-agent framework to reconstruct the evolutionary graph of dataset development.
Key Findings
Through large-scale lineage analysis, we characterize domain-specific structural patterns, such as:
- Vertical Refinement: Observed in math-oriented datasets.
- Horizontal Aggregation: Identified in general-domain corpora.
Systemic Issues Uncovered
Our analysis reveals several pervasive systemic issues, including:
- Structural Redundancy: Induced by implicit dataset intersections.
- Propagation of Benchmark Contamination: Occurring along lineage paths.
Practical Applications of Lineage Analysis
To demonstrate the practical value of lineage analysis for data construction, we leverage the reconstructed lineage graph to create a lineage-aware diversity-oriented dataset. By anchoring instruction sampling at upstream root sources, this approach:
- Mitigates downstream homogenization.
- Reduces hidden redundancy.
- Yields a more diverse post-training corpus.
Advancements in Data Curation
We further highlight lineage-centric analysis as an efficient and robust topological alternative to sample-level dataset comparison for large-scale data ecosystems. By grounding data construction in explicit lineage structures, our work advances post-training data curation toward a more systematic and controllable paradigm. This systematic approach enhances the understanding of dataset evolution and fosters improved practices in data utilization for training LLMs.
Conclusion
The introduction of a multi-agent framework for data lineage in LLMs presents a significant step toward refining data curation methodologies. As the field of AI continues to evolve, understanding the intricate relationships between datasets will become increasingly crucial for developing more capable and reliable models.
