Suiren-1.0 Technical Report: A Family of Molecular Foundation Models
Summary: arXiv:2603.21942v3 Announce Type: replace-cross
Abstract: We introduce Suiren-1.0, a family of molecular foundation models for the accurate modeling of diverse organic systems. Suiren-1.0 comprising three specialized variants (Suiren-Base, Suiren-Dimer, and Suiren-ConfAvg) is integrated within an algorithmic framework that bridges the gap between 3D conformational geometry and 2D statistical ensemble spaces.
Our research first focuses on the pre-training of Suiren-Base, which consists of 1.8 billion parameters and is trained on a dataset of 70 million samples derived from Density Functional Theory (DFT). This model utilizes spatial self-supervision and SE(3)-equivariant architectures, which enable it to achieve robust performance in predicting quantum properties of molecules.
Key Features of Suiren-1.0
- Suiren-Base:
This is the foundational model pre-trained on a large-scale DFT dataset, providing a strong baseline for quantum property predictions.
- Suiren-Dimer:
Building upon Suiren-Base, this variant undergoes additional pre-training on a dataset of 13.5 million intermolecular interaction samples, enhancing its predictive capabilities.
- Suiren-ConfAvg:
This lightweight model is designed for efficient downstream applications, utilizing a novel technique called Conformation Compression Distillation (CCD). This method distills complex 3D structural data into simplified 2D conformation-averaged representations.
Conformation Compression Distillation (CCD)
The CCD framework plays a pivotal role in enabling Suiren-ConfAvg to generate high-fidelity representations directly from SMILES (Simplified Molecular Input Line Entry System) or molecular graphs. This diffusion-based approach significantly reduces the complexity of 3D structural representations, making it easier for various applications in molecular modeling.
Results and Performance
Our extensive evaluations demonstrate that Suiren-1.0 establishes state-of-the-art results across a variety of tasks, outperforming existing models in several key areas. The introduction of these models represents a significant step forward in the field of molecular modeling, providing researchers and practitioners with powerful tools to explore organic systems with unprecedented accuracy.
Open Source Availability
All models and benchmarks associated with Suiren-1.0 have been open-sourced, promoting transparency and collaboration within the scientific community. This initiative not only fosters innovation but also encourages further research and development in the area of molecular foundation models.
Conclusion
Suiren-1.0 signifies a breakthrough in molecular modeling, combining advanced machine learning techniques with rich datasets to deliver robust predictive capabilities. As the field continues to evolve, we anticipate that these models will play a crucial role in advancing our understanding of molecular interactions and properties.
