daVinci-LLM: Towards the Science of Pretraining
Summary: arXiv:2603.27164v1 Announce Type: new
The foundational pretraining phase is critical to determining a model’s capability ceiling. Interestingly, the post-training phase often struggles to overcome the capability foundations established during pretraining, a topic that remains critically under-explored in the field of artificial intelligence. This situation arises from a structural paradox: organizations with substantial computational resources often operate under commercial pressures that inhibit transparent disclosure of their methodologies. Conversely, academic institutions may possess the freedom to conduct research but typically lack the pretraining-scale computational resources necessary to advance the field significantly.
In response to this gap, daVinci-LLM emerges as a pioneering initiative that occupies this unexplored intersection. By combining industrial-scale resources with full research freedom, daVinci-LLM aims to advance the science of pretraining in a meaningful way. This ambitious project adopts a fully-open paradigm, treating openness not merely as a principle but as an essential scientific methodology. To this end, daVinci-LLM is committed to releasing complete data processing pipelines, comprehensive training processes, and systematic exploration results.
Recognizing that the field currently lacks systematic methodologies for data processing, the daVinci-LLM team employs the Data Darwinism framework. This framework provides a principled L0-L9 taxonomy that spans the entire spectrum from data filtering to synthesis. The project entails training a 3B-parameter model from random initialization across an impressive 8 trillion tokens. This process utilizes a two-stage adaptive curriculum that progressively shifts focus from foundational capabilities to reasoning-intensive enhancements.
Through over 200 controlled ablations, the team establishes several critical findings:
- Processing Depth: Systematic enhancement of capabilities is directly linked to processing depth, establishing it as a critical dimension alongside volume scaling.
- Diverse Domain Dynamics: Different domains exhibit distinct saturation dynamics, necessitating the adoption of adaptive strategies that include proportion adjustments and format shifts.
- Compositional Balance: Maintaining compositional balance enables targeted intensification of capabilities while preventing performance collapse.
- Evaluation Protocols: The choices made regarding evaluation protocols significantly shape our understanding of pretraining progress and its implications.
By releasing the complete exploration process and results to the broader research community, daVinci-LLM enables others to build upon these findings. This initiative aims to foster systematic methodologies that contribute to the accumulation of scientific knowledge in the field of pretraining. The implications of this work are profound, as they not only enhance our understanding of pretraining but also pave the way for further advancements in artificial intelligence.
In conclusion, daVinci-LLM represents a significant step forward in the science of pretraining. By addressing the existing gaps and promoting an open and collaborative research environment, it has the potential to redefine our understanding of model capabilities and the foundational processes that underpin them.
