ScaleGNN: Scalable Mini-batch GNN Training with 4D Parallelism

Communication-free Sampling and 4D Hybrid Parallelism for Scalable Mini-batch GNN Training

Summary: arXiv:2604.02651v1 Announce Type: cross

Abstract: Graph neural networks (GNNs) are widely used for learning on graph datasets derived from various real-world scenarios. Learning from extremely large graphs requires distributed training, and mini-batching with sampling is a popular approach for parallelizing GNN training. Existing distributed mini-batch approaches have significant performance bottlenecks due to expensive sampling methods and limited scaling when using data parallelism.

In this work, we present ScaleGNN, a 4D parallel framework for scalable mini-batch GNN training that combines communication-free distributed sampling, 3D parallel matrix multiplication (PMM), and data parallelism. The innovations introduced by ScaleGNN address the limitations of current methodologies, enabling more efficient and scalable training of GNNs.

Key Innovations of ScaleGNN

Communication-free Distributed Sampling: ScaleGNN introduces a uniform vertex sampling algorithm, allowing each process (GPU device) to construct its local mini-batch, i.e., subgraph partitions without any inter-process communication. This innovation significantly reduces the overhead associated with traditional sampling methods.
3D Parallel Matrix Multiplication (PMM): The framework employs a 3D PMM strategy, which enables the scaling of mini-batch training to larger GPU counts than conventional data parallelism, while also minimizing communication overheads.
Overlapping Sampling with Training: To further enhance efficiency, additional optimizations have been implemented to overlap the sampling process with training phases, thereby maximizing resource utilization.
Lower Precision Data Transmission: ScaleGNN reduces communication overhead by sending data in lower precision, which not only speeds up the process but also maintains the integrity of training outcomes.
Kernel Fusion: The framework integrates kernel fusion techniques to streamline operations, which contributes to the overall performance improvements.
Communication-Computation Overlap: By overlapping communication and computation tasks, ScaleGNN ensures that GPU resources are fully engaged, leading to enhanced throughput.

Performance Evaluation

ScaleGNN has been rigorously evaluated on five distinct graph datasets, demonstrating robust scalability. Testing was conducted across several high-performance computing environments:

2048 GPUs on the Perlmutter system
2048 GCDs on the Frontier system
1024 GPUs on the Tuolumne system

Among these tests, ScaleGNN achieved a remarkable 3.5x end-to-end training speedup over the state-of-the-art (SOTA) baseline on the ogbn-products dataset. This significant improvement underscores the effectiveness of the proposed framework in handling large-scale GNN training tasks efficiently.

Conclusion

In summary, ScaleGNN represents a substantial advancement in the field of GNN training, addressing key bottlenecks associated with existing methods. Its innovative approaches to sampling, parallelism, and resource utilization pave the way for more scalable and efficient training of graph neural networks, making it a valuable contribution to the realm of machine learning and artificial intelligence.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

ScaleGNN: Scalable Mini-batch GNN Training with 4D Parallelism

Communication-free Sampling and 4D Hybrid Parallelism for Scalable Mini-batch GNN Training

Key Innovations of ScaleGNN

Performance Evaluation

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related