Ontology-based Knowledge Graph Infrastructure for Interoperable Atomistic Simulation Data
Summary: arXiv:2604.06230v1 Announce Type: cross
Abstract
The reuse of atomistic simulation data is often limited by heterogeneous formats, incomplete metadata, and a lack of standardized representations of workflows and provenance. Here we present an ontology-based infrastructure for representing and integrating atomistic simulation data as a knowledge graph. The approach combines domain ontologies with a software framework that enables data capture both from existing datasets and directly from simulation workflows at the point of generation.
Introduction
Atomistic simulations are essential in various fields of materials science, chemistry, and physics. However, the effective reuse of simulation data is frequently hindered due to several challenges. These challenges include:
- Heterogeneous data formats that complicate integration.
- Incomplete or inconsistent metadata that limits data discoverability.
- A lack of standardized workflows that makes provenance tracking difficult.
The Ontology-based Framework
The proposed ontology-based infrastructure aims to address these issues by providing a robust framework for the representation and integration of atomistic simulation data. This framework utilizes:
- Domain Ontologies: These form the backbone of the knowledge graph, enabling a structured representation of different concepts and relationships relevant to atomistic simulations.
- Software Framework: This component facilitates data capture from both existing datasets and real-time simulation workflows, ensuring that data is collected at the point of generation.
Data Normalization and Integration
One of the key advantages of this approach is the normalization of heterogeneous data from multiple sources into a common, ontology-aligned representation. This normalization allows for:
- Consistent querying across diverse datasets.
- Comprehensive analysis of material properties.
- Enhanced ability to extract derived thermodynamic quantities from existing simulations.
Demonstration of Capabilities
The capabilities of this knowledge graph infrastructure are demonstrated through several case studies, which include:
- Integration of grain boundary data.
- Cross-dataset analysis of material properties.
- Extraction of thermodynamic quantities from previously conducted simulations.
Provenance and Workflow Tracking
Another significant aspect of the infrastructure is the representation of workflows in a machine-readable format. This enables:
- Forward provenance tracking to understand the lineage of data.
- Partial reconstruction of computational procedures to enhance reproducibility.
Conclusion
The resulting knowledge graph contains over 750,000 triples, describing nearly 8,000 computational samples. This work provides a practical framework for improving the findability, interoperability, and reuse of atomistic simulation data, marking a significant advancement in the field.
