Plexus: Taming Billion-edge Graphs with 3D Parallel Full-graph GNN Training
/ Authors
/ Abstract
Graph neural networks (GNNs) leverage the connectivity and structure of real-world graphs to learn intricate properties and relationships between nodes. Many real-world graphs exceed the memory capacity of a GPU due to their sheer size, and training GNNs on such graphs requires techniques such as mini-batch sampling to scale. The alternative approach of distributed full-graph training suffers from high communication overheads and load imbalance due to the irregular structure of graphs. We propose a three-dimensional (3D) parallel approach for full-graph training that tackles these issues and scales to billion-edge graphs. In addition, we introduce optimizations such as a double permutation scheme for load balancing, and a performance model to predict the optimal 3D configuration of our parallel implementation – Plexus. We evaluate Plexus on six different graph datasets and show scaling results on up to 2048 GPUs of Perlmutter, and 1024 GPUs of Frontier. Plexus achieves unprecedented speedups of 2.3 − 12.5 × over prior state of the art, and a reduction in time-to-solution by 5.2 − 8.7 × on Perlmutter and 7.0 − 54.2 × on Frontier.
Journal: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis