DISTRIBUTED GRAPH NEURAL NETWORK TRAINING WITH PERIODIC STALE REPRESENTATION SYNCHRO-NIZATION

Abstract

Despite the recent success of Graph Neural Networks (GNNs), it remains challenging to train a GNN on large graphs with over millions of nodes & billions of edges, which are prevalent in many graph-based applications such as social networks, recommender systems, and knowledge graphs. Traditional sampling-based methods accelerate GNN training by dropping edges and nodes, which impairs the graph integrity and model performance. Differently, distributed GNN algorithms accelerate GNN training by utilizing multiple computing devices and can be classified into two types: "partition-based" methods enjoy low communication cost but suffer from information loss due to dropped edges, while "propagation-based" methods avoid information loss but suffer from prohibitive communication overhead caused by neighbor explosion. To jointly address these problems, this paper proposes DIGEST (DIstributed Graph reprEsentation SynchronizaTion), a novel distributed GNN training framework that synergizes the complementary strength of both categories of existing methods. We propose to allow each device utilize the stale representations of its neighbors in other subgraphs during subgraph parallel training. This way, out method preserves global graph information from neighbors to avoid information loss and reduce the communication cost. Therefore, DIGEST is both computation-efficient and communication-efficient as it does not need to frequently (re-)compute and transfer the massive representation data across the devices, due to neighbor explosion. DIGEST provides synchronous and asynchronous training manners for homogeneous and heterogeneous training environment, respectively. We proved that the approximation error induced by the staleness of the representations can be upper-bounded. More importantly, our convergence analysis demonstrates that DIGEST enjoys the state-of-the-art convergence rate. Extensive experimental evaluation on large, real-world graph datasets shows that DIGEST achieves up to 21.82× speedup without compromising the performance compared to state-of-the-art distributed GNN training frameworks.

1. INTRODUCTION

Graph Neural Networks (GNNs) have shown impressive success in analyzing non-Euclidean graph data and have achieved promising results in various applications, including social networks, recommender systems and knowledge graphs, etc. (Dai et al., 2016; Ying et al., 2018; Eksombatchai et al., 2018; Lei et al., 2019; Zhu et al., 2019) . Despite the great promise of GNNs, they meet significant challenges when being applied to large graphs, which are common in real world-the number of nodes of a large graph can be up to millions or even billions. For instance, Facebook social network graph contains over 2.9 billion users and over 400 billion friendship relations among users 1 . Amazon provides recommendations over 350 million items to 300 million users 2 . Further, natural language processing (NLP) tasks take advantage of knowledge graphs, such as Freebase (Chah, 2017) with over 1.9 billion triples. Training GNNs on large graphs is jointly challenged by the lack of inherent parallelism in the backpropagation optimization and heavy inter-dependencies among graph nodes, rendering existing parallel techniques inefficient. To tackle the unique challenges in GNN training, distributed GNN training is a promising open domain that has attracted fast-increasing attention in recent years. A classic and intuitive way is by sampling. Until now, a good number of graph-sampling-based GNN methods have been proposed, including neighbor-sampling-based methods (e.g., GraphSAGE (Hamilton et al., 2017) , VR-GCN (Chen et al., 2018) ) and subgraphsampling-based methods (e.g., Cluster-GCN (Chiang et al., 2019) , GraphSAINT (Zeng et al., 2019) ). These methods enable a GNN model to be trained over large graphs on a single machine by sampling a subset of data during forward or backward propagation. While sampling operations reduce the size of data needed for computation, these methods suffer from degenerated performance due to unnecessary information loss. To walk around this drawback and also to leverage the increasingly powerful computing capability of modern hardware accelerators, recent solutions propose to train GNNs on a large number of CPU and GPU devices (Thorpe et al., 2021; Ramezani et al., 2021; Wan et al., 2022) and have become the de facto standard for fast and accurate training over large graphs. Existing methods in distributed training for GNNs can be classified into two categories, namely "partition-based" and "propagation-based", by how they tackle the trade-off between computation/communication cost and information loss. "Partition-based" methods (Angerd et al., 2020; Jia et al., 2020; Ramezani et al., 2021) partition the graph into different subgraphs by dropping the edges across subgraphs. This way, the GNN training on a large graph is decomposed into many smaller training tasks, each trained in a siloed subgraphs in parallel, reducing communications among subgraphs, and thus, tasks, due to edge dropping. However, this will result in severe information loss due to the ignorance of the dependencies among nodes across subgraphs and cause performance degeneration. To alleviate information loss, "propagation-based" methods (Ma et al., 2019; Zhu et al., 2019; Zheng et al., 2020; Tripathy et al., 2020; Wan et al., 2022) do not ignore edges across different subgraphs with neighbor communications among subgraphs to satisfy GNN's neighbor aggregation. However, the number of neighbors involved in neighbor aggregation grows exponentially as the GNN goes deeper (i.e., neighborhood explosion (Hamilton et al., 2017)), hence inevitably suffering huge communication overhead and plagued training efficiency. Therefore, although "partition-based" methods can parallelize a training job among partitioned subgraphs, they suffers from information loss and low accuracy. "Propagation-based" methods, on the other hand, use the entire graph for training without information loss but suffer from huge communication overhead and poor efficiency. Hence, it is highly imperative to develop a method that can jointly address the problems of high communication cost and severe information loss. Moreover, theoretical guarantees (e.g., on convergence, approximation error) are not well explored for distributed GNN training due to the joint sophistication of graph structure and neural network optimization. To address the aforementioned challenges, we propose a novel distributed GNN training framework that synergizes the complementary strengths of both partitioning-based and propagating-based methods, named DIstributed Graph reprEsentation SynchronizaTion, or DIGEST. DIGEST does not completely discard node information from other subgraphs in order to avoid unnecessary information loss; DIGEST does not frequently update all the node information in order to minimize communication costs. Instead, DIGEST extends the idea of single-GPU-based GCN training with stale representations (Chen et al., 2018; Fey et al., 2021) 2021) while the asynchronous version can handle the straggler issue (Chen et al., 2016; Zheng et al., 2017) in synchronous version and enjoys even better performance. From the system aspect, DIGEST (1) enables efficient, cross-device representation exchanging by using a shared-memory key-value storage (KVS) system, (2) supports both synchronous and asynchronous parameter updating, and (3) overlaps the computation (layer training) with I/Os (pushing/pulling representations to/from the KVS. Furthermore, we proved that the approximation error induced by the staleness of representation can be bounded. More importantly, global convergence guarantee is provided, which demonstrates that DIGEST has the state-of-the-art convergence rate. Our main contributions can be summarized as: 



https://backlinko.com/facebook-users https://amzscout.net/blog/amazon-statistics



to a distributed setting, by enabling each device to efficiently exchange a relatively stale version of the neighbor representations from other subgraphs, to achieve scalable and high-performance GNN training. This effectively avoids neighbor updating explosion and reduces communication costs across training devices. Considering naive synchronous distributed training that inherently lacks the capability of handling stragglers caused by training environment heterogeneity (e.g. GPU resource heterogeneity), we further design an asynchronous version of DIGEST (DIGEST-A), where each subgraph follows a non-blocking training manner. The synchronous version is a natural generalization of Fey et al. (

• Proposing a novel distributed GNN training framework that synergizes the benefits of partition-based and communication-based methods. Existing work in distributed GNN training

