Communication-free Sampling and 4D Hybrid Parallelism for Scalable Mini-batch GNN Training
Summary: arXiv:2604.02651v1 Announce Type: cross
Abstract: Graph neural networks (GNNs) are widely used for learning on graph datasets derived from various real-world scenarios. Learning from extremely large graphs requires distributed training, and mini-batching with sampling is a popular approach for parallelizing GNN training. Existing distributed mini-batch approaches have significant performance bottlenecks due to expensive sampling methods and limited scaling when using data parallelism.
In this work, we present ScaleGNN, a 4D parallel framework for scalable mini-batch GNN training that combines communication-free distributed sampling, 3D parallel matrix multiplication (PMM), and data parallelism. The innovations introduced by ScaleGNN address the limitations of current methodologies, enabling more efficient and scalable training of GNNs.
Key Innovations of ScaleGNN
- Communication-free Distributed Sampling: ScaleGNN introduces a uniform vertex sampling algorithm, allowing each process (GPU device) to construct its local mini-batch, i.e., subgraph partitions without any inter-process communication. This innovation significantly reduces the overhead associated with traditional sampling methods.
- 3D Parallel Matrix Multiplication (PMM): The framework employs a 3D PMM strategy, which enables the scaling of mini-batch training to larger GPU counts than conventional data parallelism, while also minimizing communication overheads.
- Overlapping Sampling with Training: To further enhance efficiency, additional optimizations have been implemented to overlap the sampling process with training phases, thereby maximizing resource utilization.
- Lower Precision Data Transmission: ScaleGNN reduces communication overhead by sending data in lower precision, which not only speeds up the process but also maintains the integrity of training outcomes.
- Kernel Fusion: The framework integrates kernel fusion techniques to streamline operations, which contributes to the overall performance improvements.
- Communication-Computation Overlap: By overlapping communication and computation tasks, ScaleGNN ensures that GPU resources are fully engaged, leading to enhanced throughput.
Performance Evaluation
ScaleGNN has been rigorously evaluated on five distinct graph datasets, demonstrating robust scalability. Testing was conducted across several high-performance computing environments:
- 2048 GPUs on the Perlmutter system
- 2048 GCDs on the Frontier system
- 1024 GPUs on the Tuolumne system
Among these tests, ScaleGNN achieved a remarkable 3.5x end-to-end training speedup over the state-of-the-art (SOTA) baseline on the ogbn-products dataset. This significant improvement underscores the effectiveness of the proposed framework in handling large-scale GNN training tasks efficiently.
Conclusion
In summary, ScaleGNN represents a substantial advancement in the field of GNN training, addressing key bottlenecks associated with existing methods. Its innovative approaches to sampling, parallelism, and resource utilization pave the way for more scalable and efficient training of graph neural networks, making it a valuable contribution to the realm of machine learning and artificial intelligence.
