cuNNQS-SCI: A Fully GPU-Accelerated Framework for High-Performance Configuration Interaction Selection with Neural Network Quantum States
Summary: arXiv:2604.15768v1 Announce Type: cross
The quest to accurately solve the Schrödinger equation for complex many-body systems has seen significant advancements through the use of artificial intelligence (AI). Among the various methods proposed, the Neural Network Quantum State Configuration Interaction (NNQS-SCI) method has emerged as a leading technique, known for its high accuracy and scalability. However, the application of this method to larger systems has been hindered by its reliance on a hybrid CPU-GPU architecture. This architecture faces challenges such as centralized CPU-based global de-duplication and host-resident coupled-configuration generation, leading to significant computational overheads and communication bottlenecks.
To address these limitations, researchers have introduced cuNNQS-SCI, a fully GPU-accelerated SCI framework. This innovative framework is designed to enhance the scalability and efficiency of the NNQS-SCI method, making it applicable to larger quantum systems.
Key Features of cuNNQS-SCI
- Distributed Global De-Duplication: cuNNQS-SCI integrates a distributed, load-balanced global de-duplication algorithm. This minimizes redundancy and reduces communication overhead, allowing for better scalability across multiple GPUs.
- Fine-Grained CUDA Kernels: The framework employs specialized CUDA kernels for exact coupled configuration generation. This addresses compute limitations that previously constrained performance in larger systems.
- GPU Memory-Centric Runtime: To overcome the single-GPU memory barrier, cuNNQS-SCI incorporates a GPU memory-centric runtime. This includes features such as GPU-side pooling, streaming mini-batches, and overlapped offloading, which collectively allow for the handling of much larger configuration spaces.
The design of cuNNQS-SCI effectively shifts the computational bottleneck from host-side limitations back to on-device inference, thereby expanding the scale of solvable problems in quantum chemistry.
Performance Evaluation
Evaluations conducted using an NVIDIA A100 cluster comprising 64 GPUs have shown that cuNNQS-SCI significantly improves performance metrics. The framework achieves up to 2.32X end-to-end speedup over the highly optimized NNQS-SCI baseline while maintaining the same level of chemical accuracy. This remarkable speedup is accompanied by excellent distributed performance, with cuNNQS-SCI maintaining over 90% parallel efficiency in strong scaling tests.
In conclusion, cuNNQS-SCI represents a significant advancement in the field of quantum computing, offering a robust solution to the challenges posed by large-scale quantum systems. Its fully GPU-accelerated design not only enhances performance but also opens new avenues for research in quantum chemistry, paving the way for further breakthroughs in the understanding of complex quantum phenomena.
