DISTRIBUTED GRAPH NEURAL NETWORK TRAINING WITH PERIODIC STALE REPRESENTATION SYNCHRO-NIZATION

Abstract

Despite the recent success of Graph Neural Networks (GNNs), it remains challenging to train a GNN on large graphs with over millions of nodes & billions of edges, which are prevalent in many graph-based applications such as social networks, recommender systems, and knowledge graphs. Traditional sampling-based methods accelerate GNN training by dropping edges and nodes, which impairs the graph integrity and model performance. Differently, distributed GNN algorithms accelerate GNN training by utilizing multiple computing devices and can be classified into two types: "partition-based" methods enjoy low communication cost but suffer from information loss due to dropped edges, while "propagation-based" methods avoid information loss but suffer from prohibitive communication overhead caused by neighbor explosion. To jointly address these problems, this paper proposes DIGEST (DIstributed Graph reprEsentation SynchronizaTion), a novel distributed GNN training framework that synergizes the complementary strength of both categories of existing methods. We propose to allow each device utilize the stale representations of its neighbors in other subgraphs during subgraph parallel training. This way, out method preserves global graph information from neighbors to avoid information loss and reduce the communication cost. Therefore, DIGEST is both computation-efficient and communication-efficient as it does not need to frequently (re-)compute and transfer the massive representation data across the devices, due to neighbor explosion. DIGEST provides synchronous and asynchronous training manners for homogeneous and heterogeneous training environment, respectively. We proved that the approximation error induced by the staleness of the representations can be upper-bounded. More importantly, our convergence analysis demonstrates that DIGEST enjoys the state-of-the-art convergence rate. Extensive experimental evaluation on large, real-world graph datasets shows that DIGEST achieves up to 21.82× speedup without compromising the performance compared to state-of-the-art distributed GNN training frameworks.

1. INTRODUCTION

Graph Neural Networks (GNNs) have shown impressive success in analyzing non-Euclidean graph data and have achieved promising results in various applications, including social networks, recommender systems and knowledge graphs, etc. (Dai et al., 2016; Ying et al., 2018; Eksombatchai et al., 2018; Lei et al., 2019; Zhu et al., 2019) . Despite the great promise of GNNs, they meet significant challenges when being applied to large graphs, which are common in real world-the number of nodes of a large graph can be up to millions or even billions. For instance, Facebook social network graph contains over 2.9 billion users and over 400 billion friendship relations among usersfoot_0 . Amazon provides recommendations over 350 million items to 300 million usersfoot_1 . Further, natural language processing (NLP) tasks take advantage of knowledge graphs, such as Freebase (Chah, 2017) with over 1.9 billion triples. Training GNNs on large graphs is jointly challenged by the lack of inherent parallelism in the backpropagation optimization and heavy inter-dependencies among graph nodes, rendering existing parallel techniques inefficient. To tackle the unique challenges in GNN training, distributed GNN training is a promising open domain that has attracted fast-increasing attention in recent years. A classic and intuitive way is by sampling. Until now, a good number of graph-sampling-based GNN methods have been proposed, including neighbor-sampling-based methods (e.g., GraphSAGE (Hamilton et al., 2017) , VR-GCN (Chen et al., 2018) ) and subgraphsampling-based methods (e.g., Cluster-GCN (Chiang et al., 2019) , GraphSAINT (Zeng et al., 2019) ). These methods enable a GNN model to be trained over large graphs on a single machine by sampling a subset of data during forward or backward propagation. While sampling operations reduce the size of data needed for computation, these methods suffer from degenerated performance due to unnecessary information loss. To walk around this drawback and also to leverage the increasingly powerful computing capability of modern hardware accelerators, recent solutions propose to train GNNs on a large number of CPU and GPU devices (Thorpe et al., 2021; Ramezani et al., 2021; Wan et al., 2022) and have become the de facto standard for fast and accurate training over large graphs. Existing methods in distributed training for GNNs can be classified into two categories, namely "partition-based" and "propagation-based", by how they tackle the trade-off between computation/communication cost and information loss. "Partition-based" methods (Angerd et al., 2020; Jia et al., 2020; Ramezani et al., 2021) partition the graph into different subgraphs by dropping the edges across subgraphs. This way, the GNN training on a large graph is decomposed into many smaller training tasks, each trained in a siloed subgraphs in parallel, reducing communications among subgraphs, and thus, tasks, due to edge dropping. However, this will result in severe information loss due to the ignorance of the dependencies among nodes across subgraphs and cause performance degeneration. To alleviate information loss, "propagation-based" methods (Ma et al., 2019; Zhu et al., 2019; Zheng et al., 2020; Tripathy et al., 2020; Wan et al., 2022) do not ignore edges across different subgraphs with neighbor communications among subgraphs to satisfy GNN's neighbor aggregation. However, the number of neighbors involved in neighbor aggregation grows exponentially as the GNN goes deeper (i.e., neighborhood explosion (Hamilton et al., 2017) ), hence inevitably suffering huge communication overhead and plagued training efficiency. Therefore, although "partition-based" methods can parallelize a training job among partitioned subgraphs, they suffers from information loss and low accuracy. "Propagation-based" methods, on the other hand, use the entire graph for training without information loss but suffer from huge communication overhead and poor efficiency. Hence, it is highly imperative to develop a method that can jointly address the problems of high communication cost and severe information loss. Moreover, theoretical guarantees (e.g., on convergence, approximation error) are not well explored for distributed GNN training due to the joint sophistication of graph structure and neural network optimization. To address the aforementioned challenges, we propose a novel distributed GNN training framework that synergizes the complementary strengths of both partitioning-based and propagating-based methods, named DIstributed Graph reprEsentation SynchronizaTion, or DIGEST. DIGEST does not completely discard node information from other subgraphs in order to avoid unnecessary information loss; DIGEST does not frequently update all the node information in order to minimize communication costs. Instead, DIGEST extends the idea of single-GPU-based GCN training with stale representations (Chen et al., 2018; Fey et al., 2021) to a distributed setting, by enabling each device to efficiently exchange a relatively stale version of the neighbor representations from other subgraphs, to achieve scalable and high-performance GNN training. This effectively avoids neighbor updating explosion and reduces communication costs across training devices. Considering naive synchronous distributed training that inherently lacks the capability of handling stragglers caused by training environment heterogeneity (e.g. GPU resource heterogeneity), we further design an asynchronous version of DIGEST (DIGEST-A), where each subgraph follows a non-blocking training manner. The synchronous version is a natural generalization of Fey et al. (2021) while the asynchronous version can handle the straggler issue (Chen et al., 2016; Zheng et al., 2017) in synchronous version and enjoys even better performance. From the system aspect, DIGEST (1) enables efficient, cross-device representation exchanging by using a shared-memory key-value storage (KVS) system, (2) supports both synchronous and asynchronous parameter updating, and (3) overlaps the computation (layer training) with I/Os (pushing/pulling representations to/from the KVS. Furthermore, we proved that the approximation error induced by the staleness of representation can be bounded. More importantly, global convergence guarantee is provided, which demonstrates that DIGEST has the state-of-the-art convergence rate. Our main contributions can be summarized as: We also showed the upper bound on the approximation error of gradients due to the staleness. • Conducting comprehensive empirical results on both performance and speedup. We perform extensive evaluation on four benchmark with classic GNNs (e.g., GCN (Hamilton et al., 2017) and GAT (Veličković et al., 2017) ). The experimental results show that for the best case DIGEST improves the performance by 33.14%, and achieves 21.82× speedup in training time compared to two state-of-the-art distributed GNNs training frameworks.

2. BACKGROUND AND PROBLEM FORMULATION

In this section, we first introduce the Graph Neural Network (GNN) and its training on a single machine, and then formulate the problem of distributed GNN training. Graph Neural Networks. GNNs aim to learn a function of signals/features on a graph G(V, E) with node representations X ∈ R |V|×d , where d denotes the node feature dimension. For typical semisupervised node classification tasks (Kipf & Welling, 2016) , where each node v ∈ V is associated with a label y v , a L-layer GNN f is trained to learn the node representation h v such that y v can be predicted accurately. The training process of a GNN can be practically described as the node representation learning based on the message passing mechanism (Gilmer et al., 2017) . Analytically, given a graph G(V, E) and a node v ∈ V, the (ℓ + 1)-th layer of the GNN is defined as h (ℓ+1) v = f (ℓ+1) h (ℓ) v , h (ℓ) u : u ∈ N (v) = Ψ (ℓ+1) h (ℓ) v , Φ (ℓ+1) h (ℓ) u : u ∈ N (v) , where h (ℓ) v denotes the representation of node v in the ℓ-th layer, and h v being initialized to x v (v-th row in X), and N (v) represents the set of direct 1-hop neighbors for node v. Each layer of the GNN, i.e. f (ℓ) , can be further decomposed into two components: 1) Aggregation function Φ (ℓ) , which takes the nodes representations of node v's neighbors as input, and output the aggregated neighborhood representation. 2) Updating function Ψ (ℓ) , which combines the representation of v and the aggregated neighborhood representation to update the representation of node v for the next layer. Both Φ (ℓ) and Ψ (ℓ) can choose to use various functions in different types of GNNs. To train a GNN on a single machine, one can minimize the empirical loss L(W) over the entire graph in the training data, i.e., L(W) = (1/|V|) v∈V Loss h (L) v , y v , where Loss(•, •) denotes a loss function (e.g., cross entropy loss), and h (L) v denotes the representation of node v from the last layer of the GNN and can be calculated by following Eq. 1 recursively. Distributed Training for GNNs. Distributed GNN training means to first partition the original graph into multiple subgraphs without overlap, which can also be considered as mini batches. Then different mini-batches are trained in different devices in parallel. Here, Eq. 1 can be further reformulated as h (ℓ+1) v = Ψ (ℓ+1) h (ℓ) v , Φ (ℓ+1) h (ℓ) u : u ∈ N (v) ∩ S(v) In-subgraph nodes ∪ h (ℓ) u : u ∈ N (v) \ S(v) Out-of-subgraph nodes , where S(v) denotes the subgraph that node v belongs to. In this paper, we consider the distributed training of GNNs with multiple local machines and a global server. The original input graph G is first partitioned into M subgraphs, where each G m (V m , E m ) represents the subgraph m. Our goal is to find the optimal set of parameters W in a distributed manner by minimizing each local loss, i.e., where min W L Local m W m = 1 |V m | v∈Vm Loss h (L) v , y v , m = 1, 2, • • • , M in parallel, W m = {W (ℓ) m } L ℓ=1 are local parameters and h (L) v follows Eq. 2 recursively. Challenges. The main challenges for distributed training of GNNs lie in the trade-off between communication cost and information loss. "Partition-based" method generalizes the existing data parallelism techniques of classical distributed training on i.i.d data to graph data and enjoys minimal communication cost. However, directly partitioning a large graph into multiple subgraphs can result in severe information loss due to the ignorance of huge number of cross-subgraph edges and cause performance degeneration (Angerd et al., 2020; Jia et al., 2020; Ramezani et al., 2021) . For these methods, the representation of neighbors out of the current subgraph (second representation set in Eq. 2) are dropped and the connections between subgraphs are thus ignored. Hence, another line of work (Wang et al., 2019) , namely "propagation-based" method considers using communication of neighbor nodes for each subgraph to satisfy GNN's neighbor aggregation, which minimizes the information loss. As shown in Eq. 2, the representations for neighbor nodes outside the current subgraph is swapped between different subgraphs. However, the number of neighbors involved in the neighbor aggregation process expands exponentially as the GNN model goes deep, which is known as the neighborhood explosion problem. Hence, though no edges are dropped in this case, inevitable communication overhead is incurred and plagues the achievable training efficiency (Ma et al., 2019; Zhu et al., 2019; Zheng et al., 2020; Tripathy et al., 2020; Wan et al., 2022) . Moreover, theoretical guarantees (e.g., on convergence, approximation error) are not well explored for distributed GNN due to the joint sophistication of graph structure and neural network optimization.

3. PROPOSED METHOD

In this section, we introduce the proposed GNN training framework DIGEST. DIGEST leverages both types of representations in Eq. 2 to address the information loss issue. In addition, instead of exchanging real-time representations during the training process between the subgraphs, DIGEST only pull and push the stale representations before or after each step of training periodically. With this strategy, the communications turn to be more efficient, which are illustrated in Figure 1 and analyzed in more details in Section 3.3. Moreover, we prove that the error introduced by the staleness of the stale representation is upper-bounded while the convergence is also guaranteed.

3.1. DISTRIBUTED GNN TRAINING WITH FULL-GRAPH AWARENESS

In DIGEST, each copy of GNN trained on a local machine will make use of all available graph information, i.e. no edges are dropped in both forward and backward propagation. Analytically, calculating each local gradient ∇L Local m as defined in Eq. 3 will involve out-of-subgraph neighbor information. For out-of-subgraph neighbor nodes, we approximate their representations via stale representations acquired in previous training, denoted by h(ℓ) v . Formally, given a node v ∈ G m (V m , E m ), the forward propagation for the (ℓ + 1)-th layer of DIGEST is achieved by modifying Eq. 2 as h (ℓ+1) v = Ψ (ℓ+1) h (ℓ) v , Φ (ℓ+1) h (ℓ) u : u ∈ N (v) ∩ V m ∪ h(ℓ) u : u ∈ N (v) \ V m Stale representation . As can be seen, DIGEST considers ALL neighbor nodes information during forward propagation. On the other hand, leveraging the entire graph data in forward propagation will in turn improve the estimation of gradient in backpropagation. To see this, we reformulate Eq. 4 into the matrix form: H (ℓ+1,m) in = F H (ℓ,m) in , H(ℓ,m) out := σ P (m) in H (ℓ,m) in W (ℓ+1) m + P (m) out H(ℓ,m) out W (ℓ+1) m , where H (ℓ,m) in and H(ℓ,m) out denotes the matrix of in-subgraph node representations and out-of-subgraph stale representations at ℓ-th layer on subgraph G m , respectively. F denotes the forward propagation function of one layer of GNN for compact formula. We consider the GCN model as an example for illustration but our analyses apply to general cases of any GNN models. P  ∂ ∂W (ℓ+1) m F H (ℓ,m) in , H(ℓ,m) out = ∂ ∂W (ℓ+1) m σ P (m) in H (ℓ,m) in W (ℓ+1) m + P (m) out H(ℓ,m) out W (ℓ+1) m = P (m) in H (ℓ,m) in + P (m) out H(ℓ,m) out ⊤ σ ′ P (m) in H (ℓ,m) in W (ℓ+1) m + P (m) out H(ℓ,m) out W (ℓ+1) m . The key observation here is that ALL neighbor nodes are involved in the backpropagation since the gradient above depends on H(ℓ,m) out . The separation of in-subgraph nodes and out-of-subgraph nodes, and their approximation via stale representation form the very foundation of DIGEST.

3.2. SYSTEM DESIGN

This section presents the overall system design of DIGEST as depicted in Figure 1 . DIGEST maintains a shared-memory-based KVS for storing and retrieving representations. KVS can be easily extended to a truly distributed storage to support large-scale distributed training spanning multiple servers. We first introduce two operations used by DIGEST to store and retrieve representations. The stale representations of layer ℓ for all nodes in V can be formulated as H(ℓ) = { h(ℓ) v : v ∈ V}. For any subgraph G m to start the forward process of layer ℓ, the necessary stale representations H(ℓ,m) out = { h(ℓ) u : u ∈ N (v) \ V m , ∀ v ∈ V m } are pulled from the KVS that stores representations; this is called a "pull" operation denoted as H (ℓ,m) out ← H(ℓ,m) out . See Figure 1(c). After the end of a epoch, the newly-computed representations H (ℓ,m) in = {h (ℓ) v : ∀ v ∈ V m } are pushed to the KVS, and these newly-stored representations will be fetched as stale representations in future epochs; this is called a "push" operation denoted as H (ℓ,m) in → H(ℓ,m) in . DIGEST features two training modes: (1) DIGEST: a synchronous mode designed ideal for homogeneous training environments. (2) DIGEST-A: an asynchronous mode that better fits for a heterogeneous training environment. DIGEST and DIGEST-A follow different parameter and representation updating strategies. In DIGEST, for each global round, before fetching the aggregated parameters and pulling the stale representations, each subgraph has to wait for other subgraphs to finish updating the latest parameters to the parameter server (PS) and their local representations to the KVS. However, some subgraphs may have lower computing resource compared to other subgraphs, which we call stragglers. This may lead to imbalanced local training times. In this case, with the synchronous mode, the overall training process can be bottlenecked by the slowest subgraph, therefore suffering from prolonged training time. To address this issue, DIGEST-A applies an asynchronous, non-blocking strategy, where each subgraph directly pulls/pushes stale representations of other subgraphs from the shared KVS and downloads/uploads parameters from the PS without blindly waiting for the slowest subgraph to finish. For better scalability, we will explore disaggregated storage techniques (Klimovic et al., 2016; Nanavati et al., 2017; Amaro et al., 2020) as part of our future work, where DIGEST can utilize a network-attached, high-performance far memory storage system for representation storage and retrieval. We summarize our algorithm in Algorithm 1 in the appendix due to limited space. In addition, DIGEST and DIGEST-A use several optimizations to minimize the I/O overhead introduced by pulls and pushes. First, we observe that there are a large number of node representations involved in both pull and push operations, and more importantly, nodes are independent of each other on these two operations. out can be overlapped with the forward process of layer ℓ -1; similarly, the push operation for H (ℓ,m) in can be overlapped with the forward process of layer ℓ + 1. The training process on each subgraph is depicted in Figure 2 . The cost of pull/push operations is hidden by the layer forward process, therefore, is eliminated. 

4. THEORETICAL ANALYSES

In this section, we provide theoretical analyses of the propose distributed strategy DIGEST, including the bound of error induced by the staleness of node representations, and convergence guarantee for DIGEST under both synchronous and asynchronous settings. All proofs can be found in the appendix.

4.1. ERROR BOUND ON GLOBAL APPROXIMATED GRADIENTS

Our first theorem shows that under the distributed setting, the approximation error of the global model's gradients can be upper bounded by the staleness of node representations. Theorem 1. Given a L-layer GNN f W with r 1 -Lipschitz smooth Φ and r 2 -Lipschitz smooth Ψ. Denote ∆(G) as the maximal node degree for graph G. ℓ) , where h  Assume ∀ v ∈ V and ∀ ℓ ∈ {1, 2, • • • , L -1} we have ∥h (ℓ) v - h(ℓ) v ∥ ≤ ϵ ( W L -∇ W L * 2 ≤ (τ /M ) L-1 ℓ=1 ϵ (ℓ) r L-ℓ 1 r L-ℓ 2 M m=1 |∆(G m )| L-ℓ , where ∇ W L and ∇ W L * denotes the global gradient computed by DIGEST and the exact global gradient without any staleness.

4.2. CONVERGENCE OF SYNCHRONOUS DIGEST

As both fresh inner-subgraph and stale out-of-subgraph representations are adopted in our algorithm, its convergence rate is still unknown. We have proved the convergence of DIGEST and present the convergence property in the theorem blow. First, we introduce some assumptions: Assumption 1. The loss function Loss(•, •) is C Loss -Lipchitz continuous and L Loss -Lipschitz smooth with respect to the last layer's node representation, i.e., |Loss(h (L) v , y v ) -Loss(h (L) w , y v )| ≤ C Loss ∥h (L) v -h (L) w ∥ 2 and ∥∇Loss(h (L) v , y v ) -∇Loss(h (L) w , y v )∥ 2 ≤ L Loss ∥h (L) v -h (L) w ∥ 2 . Assumption 2. The activation function σ(•) is C σ -Lipchitz continuous and L σ -Lipschitz smooth, i.e. ∥σ(Z (ℓ) 1 ) -σ(Z (ℓ) 2 )∥ 2 ≤ C σ ∥(Z (ℓ) 1 -Z (ℓ) 2 ∥ 2 and ∥σ ′ (Z (ℓ) 1 ) -σ ′ (Z (ℓ) 2 )∥ 2 ≤ L σ ∥(Z (ℓ) 1 -Z (ℓ) 2 ∥ 2 . Assumption 3. ∀ ℓ = 1, 2, • • • , L, we have ∥W (ℓ) ∥ F ≤ K W , ∥P (ℓ) ∥ P ≤ K W , ∥X (ℓ) ∥ F ≤ K X . Theorem 2. Consider GCN with L layers that is L f -Lipschitz smooth. ∀ ϵ > 0, ∃ constant E > 0 such that, we can choose a learning rate η = √ M ϵ E and number of training iterations T = (L(W (1) ) -L(W * )) E √ M ϵ -3 2 s.t., T -1 T t=1 ∥∇L(W (t) )∥ 2 ≤ O(T -2/3 M -1/3 ), where W * denotes the optimal parameter. Our convergence rate of DIGEST is O(T -2/foot_2 M -1/3 ), which is better than pipeline-parallelism method O(T -2/3 ) (Wan et al., 2022) and sampling-based method O(T -1/2 ) (Chen et al., 2018; Cong et al., 2021) , and very close to full-graph training O(T -1 ).

4.3. CONVERGENCE OF ASYNCHRONOUS DIGEST

Convergence for asynchronous distributed algorithms could be even harder to obtain due to the delay in parameter's update (the global model's parameters may have been updated several times when the slowest local machine finishes its computation.) Our main result is shown below: Assumption 4. ∀ m ∈ [M ], ∥∇ Lm (W )∥ 2 ≤ V • ∥∇L(W )∥ 2 , and ∇L(W ), ∇ Lm (W ) ≥ β • ∥∇L(W )∥ 2 2 , where V and β are positive real numbers, i.e., V, β ∈ R + . Theorem 3. Assume the global model L(W ) is C f -Lipschitz continuous and the delay is bounded, i.e., τ < K. Further, assume β -V 2 2 > 0. There exist constant B and a second-order polynomial of learning rate η, i.e., P (η) such that after T global iterations on the server, asynchronous DIGEST converges to the optimal parameter W * by T -1 T t=1 ∇L(W (t) ) 2 2 ≤ (ηT B) -1 (L(W (1) ) - L(W ( * ) )) + P (η)/B, where B = β -V 2 2 and P (η) = 1 2 η 2 K 2 C 2 f L 2 f + (1 + V )ηKC 2 f L f .

5. EXPERIMENTS

In (Paszke et al., 2019) . For all the experiments, we simulate a distributed training environment using an EC2 g4dn.metal virtual machine (VM) instance on AWS, which has 8 NVIDIA T4 GPUs, 96 vCPUs, and 384 GB main memory. We implemented the shared-memory KVS using the Plasma in-memory object store 3 for representation storage and retrieval. 

5.2. EXPERIMENTAL RESULTS

In this section, we evaluate both sync & async versions of DIGEST, LLCG, and DGL on the four datasets. Due to the page limit, we move parts of our evaluation results to the Appendix. Efficiency of DIGEST. We first evaluate the training performance of DIGEST and DIGEST-A. As shown in Figure 3 , DIGEST outperforms both LLCG and DGL for all the datasets when performing distributed training on a pure GCN. LLCG performs worst particularly for the Reddit dataset, because in the global server correction of LLCG, only a mini-batch is trained and it is not sufficient to correct the plain GCN. This is also the reason why the authors of LLCG report the performance of a complex model with mixing GCN layers and GraphSAGE layers Ramezani et al. (2021) . DGL achieves good performance on some dataset (e.g., OGB-products) with uniform node sampling strategy and realtime representation exchanging. However, frequent communication also leads to slow performance increasing for dataset Flickr (Figure 3 Table 1 presents the detailed numbers for the comparison of three frameworks on the four datasets. For all the cases except GAT on OGB-Arxiv, DIGEST achieves leading F1 scores on the validation dataset, demonstrating the efficacy of DIGEST's design. Scalability of DIGEST. We evaluate the scalability of three frameworks by training a GCN on OGB-Products with varied number of GPUs. We use average training time per epoch against that of DGL with a single GPU to calculate the speedup results. As shown in Figure 5 , DIGEST shows the best scalability compared to the other two. The speedup rises with the number of GPUs used during training. We observe a similar trend for DGL, but the relative speedup for DGL is significantly smaller than that for DIGEST, due to the using of real-time representations instead of stale representations. Synchronization frequency. We next perform a sensitivity analysis by varying the synchronization intervals for OGB-Products to study how the synchronization frequency would affect the training performance. As shown in Figure 6 , DIGEST achieves the highest F1 score over training time when configured to perform synchronization of stale representations every 10 epochs. A large interval (20) or a small interval (1) results in performance degradation, due to the long term loss of graph information or additional communication cost. Training in heterogeneous environment. Finally, we test DIGEST's asynchronous training mode. As stated in Section 3.2, asynchronous training is better suited for GNN training in heterogeneous environments. In this test, we randomly select one subgraph as the straggler before the training starts. To simulate the straggler lagging caused by limited computing capability, a random delay ranging from 8 to 10 seconds is added to the chosen straggler during the whole training process. We can see from Figure 7 that DIGEST-A performs much better than other three synchronous methods and converge to high F1-score at the early stage of the training. This is because asynchronous mode effectively eliminates GPU's blocking caused by waiting with significantly improved GPU utilization.

6. CONCLUSION

There are two general categories in distributed GNN training. Partition-based methods suffer from graph information loss, while propogation-based methods suffer from high communication cost. In this work we present DIGEST, a novel distributed GNN training framework that synergizes the complementary strengths of both methods by leveraging stale representations intelligently. We provide rigorous theoretical analysis to prove that DIGEST has competitive convergence rate and bounded error due to staleness. Extensive experiments on four benchmark datasets validate our analysis and demonstrate the efficiency and scalability of DIGEST.

A APPENDIX

In this section, we clarify our contributions made in this work and describe detailed experimental setup, additional experimental results, and complete proofs. We reuse part of code adopted from GNNAutoScale Fey et al. (2021) ; our code is available at: https://anonymous.4open. science/r/DIGESTA-78F2/. Please note that the code is subjected to reorganization to improve the readability.

A.1 CONTRIBUTIONS AND NOVELTY

The contributions and novelty of this work are multi-fold. In this paper, we propose a new, highlyparallel, and full-graph-aware distributed GNN training method; on top of this new method, we design a novel, compute-and-storage-disaggregated training system to enable better scalability and allow distributed GNN training to potentially benefit from emerging computing paradigms and hardware; finally, we deduce new theoretical guarantees and analyses for the co-designed algorithms and systems. ( (2). System Architecture Novelty: The disaggregated architecture of DIGEST is the result of an algorithm-system co-design as mentioned in the Methodology Novelty, and enables great properties including high scalability and low training time, as demonstrated in our paper. More importantly, this disaggregated architecture could enable fundamental opportunities for GNN training systems to take advantage of emerging computing paradigm such as elastic serverless computing as well as emerging hardware such as Zoned Namespace SSD (ZNS) and smart programmable network hardware (SmartNIC); in this work, we have shown the promising scalability and speedup that DIGEST offers, which establishes a solid system foundation for further system-level optimizations and innovations. This demands/inspires future research along the line, which we plan to do as part of our future work. (3). Theoretical Novelty: All of our theoretical analyses are tailored for a distributed training setup, while GNNAutoScale only considers single-GPU training.

A.2 EXPERIMENTAL SETUP DETAILS

As mentioned in Section 5.1, all the experiments are done on an EC2 g4dn.metal virtual machine (VM) instance on AWS, which has 8 NVIDIA T4 GPUs, 96 vCPUs, and 384 GB main memory. Other important information including operation system version, Linux kernel version, and CUDA version is summarized in Table 2 . For fair comparison, we use the same optimizer (Adam), learning rate, and graph partition algorithm for all the three frameworks, DIGEST, LLCG, and DGL. For parameters that are unique to both LLCG and DGL, such as the number of neighbors sampled from each layer for each node, we choose the default value for both LLCG and DGL. Each of the three frameworks has a set of parameters that are exclusively unique to that framework; for these exclusive parameters, we tune them in order to achieve the best performance. Please refer to the configuration files under run/conf/model for detailed configuration setups for all the models and datasets. (2019) , and OGB-Products Hu et al. (2020) for evaluation. The detailed information of these datasets is summarized in Table 3 . For dataset OGB-Arxiv, the performance of DIGEST is slightly worse than DGL but still outperforms LLCG. Specifically, LLCG's training curves are not stable and fluctuate dramatically for both GCN and GAT on Reddit. This is because Reddit is much denser compared to other datasets, and in this case, the sampling process of the global server correction in LLCG has difficulty capturing all the information loss due to the cut-edges. Unlike LLCG, DIGEST's training curves are much smoother not only for GCN training but also for GAT training.

A.3.2 MEMORY OVERHEAD

In this section we quantify the memory overhead introduced by DIGEST. Ratio of out-of-subgraph nodes and in-subgraph nodes. Figure 9 shows the ratios of the number of out-of-subgraph nodes to the number of in-subgraph nodes across four datasets. This ratio quantifies additional memory consumption compared to methods that does not use any information of the neighboring nodes during training. Denser graphs like Flickr and Reddit require more memory to store representations of off-subgraph nodes than OGB-Arxiv and OGB-products. This introduces an interesting tradeoff between extra memory storage and gained benefits in reduced communication and preservation of global graph information. We argue that modern GPU servers are equipped with ample GPU memory resources to buffer the out-of-subgraph representations Gandhi & Iyer (2021) ; Wang et al. (2021) ; if indeed more memory is required, our research will benefit from the recent advancement of unified memory Choi et al. (2022) ; Huang et al. (2020) ; Peng et al. (2020) , where DIGEST can use both the host and GPU memory more efficiently. In the worst case, if GPU memory is limited, DIGEST can implement a multi-tier storage system that uses the limited memory as a level-one cache and the host memory as a backing store. For large graphs that are sparse (OGB-products), the extra memory cost can be bounded to a relatively lower ratio (58.43%). Host memory cost of stale representations. The KVS is responsible for storing the representations of all the nodes in a graph. The representations are stored in the memory of the host server instead of the GPUs, the latter of which is rather limited. We implemented the in-memory KVS with Apache Plasma, which is a shared memory storage that supports efficient, shared-memory-based inter-process communication (IPC) for multiple training processes located on the same server. However, extending our current KVS implementation to a fully-distributed storage system is trivial. Using off-the-shelf, high-performance distributed in-memory KVSes such as Redis is one option. Alternatively, we could also implement a simple client library, which can be used by the training process for key-value item mapping (e.g., using the commonly-used consistent hashing algorithm) and remote representation retrieval/storage, and with the client library, we could deploy a cluster of Plasma storage processes either on a dedicated storage cluster or on the same training server cluster to support distributed representation storage. The overall memory consumption required to store representation data can be calculated with the following equation: KV S memory usage = (L -1) × dim × |V | × s (7) where L is the total number of layers of the model, dim denotes the hidden dimension, |V | represents the number of nodes in the graph, and s is the size of data type in Python numpy. For the f loat32 data type, it takes 4 bytes for each single value. With the provided formula, for a 3-layer GNN model, training large graph dataset such as OGB-Products (with 2, 449, 029 nodes and 128 hidden dimensions), the extra host memory consumption for the representations is around 2 * 128 * 2449029 * 4/1024/1024/1024 = 2.336 GB. DIGEST exhibits an interesting tradeoff: it uses a small amount of extra memory overhead for storing stale representations to enable the disaggregation of the compute and storage for higher scalability and more flexibility. We also argue that a small host memory cost of several GBs is negligible considering today's multi-GPU servers are equipped with hundreds of GBs if not more than a few TBs of host memoryfoot_3 . In this section, we empirically evaluate the gradient approximation error due to the usage of stale representation. We conduct this experiment to show that the actual approximation error of gradients of DIGEST compared with the ground-truth gradients (i.e., gradients calculated without any stale representation) can be negligible in practice. As can be seen in Figure 10 , during the training phase the gradient calculated by DIGEST quickly converge to the ground-truth gradients typically after fewer than 10-20 epochs. Hence, for the majority of training epochs the error of gradients is very small and the impact is negligible, which in turn validate our theoretical analyses in Theorem 1.

A.3.4 OTHER COMPARISONS

In this section, we compare DIGEST/DIGEST-A with PipeGCN and GNNAutoScale. DIGEST uses the widely-used METIS algorithm Karypis & Kumar (1998) . Then the mini-batches are distributed to distinct workers, each of which handles training on a GPU device. Depending on the size of the mini-batch, a single worker can handle one or multiple subgraphs. DIGEST has two types of I/Os: storing and retrieving the model weights and stale representation as illustrated in Figure 1(c ). The former one is performed for each epoch by aggregating all the model weights of other subgraphs in parallel (Line 13). For the stale representation synchronization, we synchronize every N epochs, where we empirically tune N to obtain the optimal performance over training time with the defined pull and push operations (Lines 5,6,9,10). To support the asynchronous mode (DIGEST-A), we can simply remove the loop of training epoch and move the parameter aggregation (Line 13) into the subgraph loop.

A.5 THEORETICAL PROOF

In this section, we provide the formal proof for all the theories presented in the main paper.

A.5.1 PROOF OF THEOREM 1

Theorem 4 (Formal version of Theorem 1). Given a L-layer GNN f W with r 1 -Lipschitz smooth Φ and r 2 -Lipschitz smooth Ψ. Denote ∆(G) as the maximal node degree for graph G. ℓ) , where h  Assume ∀ v ∈ V and ∀ ℓ ∈ {1, 2, • • • , L -1} we have ∥h (ℓ) v - h(ℓ) v ∥ ≤ ϵ ( ∇ W L -∇ W L * 2 ≤ τ M L-1 ℓ=1 ϵ (ℓ) r L-ℓ 1 r L-ℓ 2 M m=1 |∆(G m )| L-ℓ , where ∇ W L := 1 M M m=1 1 |V m | v∈Vm ∇ W L Local m (h (L) v ), and ∇ W L * := 1 M M m=1 1 |V m | v∈Vm ∇ W L Local m (h * (L) v ), where h * (ℓ) v denotes the exact output from the ℓ-th layer of GNN without any staleness. Proof. As stated in Theorem 2 in Fey et al. (2021) , under the single-GPU training setup, with Lipschitz smooth Φ and Ψ as well as not too stale node representations, the GNN last layer's output Algorithm 1 Distributed GNN training with periodic stale representation synchronization can be bounded by Input: Graph G(V, E); GNN depth L; training epoch R; global parameters W (r) = {W (r,ℓ) } L ℓ=1 , local parameters W (r) m = {W (r,ℓ) m } L ℓ=1 , ∀ m ∈ M , Initialize W (1) {G m (V m , E m ), m = 1, 2, .., M } ← METIS(G) ▷ graph partition for r = 1...R do 3 for m = 1, • • • , M in parallel do 4 W (r) m = W (r) for ℓ = 1...L do 5 if r % N == 0 and ℓ ̸ = L then 6 H (ℓ,m) out ← H(ℓ,m) out ▷ PULL 7 for v ∈ V m do 8 h (ℓ) out = {h (ℓ) u : u ∈ N (v) \ V m } h (ℓ) in = {h (ℓ) u : u ∈ N (v) ∩ V m } h (ℓ) v = σ W (r,ℓ) m • CONCAT h (ℓ) v , h (ℓ) in , h (ℓ) out 9 if (r -1) % N == 0 and ℓ ̸ = L then 10 H (ℓ,m) in → H(ℓ,m) in ▷ PUSH 11 h (ℓ) v ← h (ℓ) v /∥h (ℓ) v ∥ 2 , ∀ v ∈ V m ▷ representation normalization 12 W (r,ℓ+1) m = W (r,ℓ) m -η • ▽W (r,ℓ) m ▷ update local parameters 13 W (r+1) ← AGG(W h (L) v -h * (L) v 2 ≤ L-1 ℓ=1 ϵ (ℓ) r L-ℓ 1 r L-ℓ 2 |N (v)| L-ℓ . Now consider the distributed GNN training setting. First, notice that in our distributed setting, the stale node representation h(ℓ) v is shared for all subgraphs. In other words, for m = 1, 2, • • • , M we can apply the conclusion above with the Lipschitz smooth asumption and have ∇ W L Local m (h (L) v ) -∇ W L Local m (h * (L) v ) 2 ≤ τ h (L) v -h * (L) v 2 ≤ τ • L-1 ℓ=1 ϵ (ℓ) r L-ℓ 1 r L-ℓ 2 |N (v)| L-ℓ . (12) Notice that |N (v)| ≤ ∆(G m ), ∀ v ∈ V m , where ∆(G m ) is defined as the maximal node degree for subgraph G m . We can sum over all nodes v ∈ V m and take average on both sides of Eq. 12 to get 1 |V m | v∈Vm ∇ W L Local m (h (L) v ) -∇ W L Local m (h * (L) v ) 2 ≤ τ • L-1 ℓ=1 ϵ (ℓ) r L-ℓ 1 r L-ℓ 2 |∆G m | L-ℓ . (13) |∆G m | L-ℓ , which finishes the proof. A.5.2 PROOF OF THEOREM 2 In this section, we prove the convergence of DIGEST under the synchronous setting. First, we introduce some notions, definitions and necessary assumptions. Preliminaries. We consider GCN in our proof without loss of generality. We denote the input graph as G = (V, E), L-layer GNN as f , feature matrix as X, weight matrix as W . The forward propagation of one layer of GCN is Z (ℓ+1) = P H (ℓ) W (ℓ) , H (ℓ+1) = σ(Z (ℓ) ) ( ) where ℓ is the layer index, σ is the activation function, and P is the propagation matrix following the definition of GCN (Kipf & Welling, 2016) . Notice H (0) = X. We can further define the (ℓ + 1)-th layer of GCN as: f (ℓ+1) (H (ℓ) , W (ℓ) ) := σ(P H (ℓ) W (ℓ) ) The backward propagation of GCN can be expressed as follow: G (ℓ) H = ∇ H f (ℓ+1) (H (ℓ) , W (ℓ) , G H ) := P ⊺ D (ℓ+1) (W (ℓ+1) ) ⊺ (17) G (ℓ+1) W = ∇ W f (ℓ+1) (H (ℓ+1) , W (ℓ) , G H ) := (P H (ℓ) ) ⊺ D (ℓ+1) where D (ℓ+1) = G (ℓ) H • σ ′ (P H (ℓ) W (ℓ+1) ) and • represents the Hadamard product. Under a distributed training setting, for each subgraph G m = (V m , E m ), m = 1.2, • • • , M , the propagation matrix can be decomposed into two independent matrices, i.e. P = P m,in + P m,out , where P m,in denotes the propagation matrix for nodes inside the subgraph G m while P m,out denotes that for neighbor nodes outside G m . If it will not cause confusion, we will use P in and P out in our future proof for simpler notation. For DIGEST, the forward propagation of a single layer of GCN can be expressed as  Z(t,ℓ+1) m = P in H(t,ℓ) m W (t,ℓ) m + P out H(t-1,ℓ) m W (t,ℓ) m H(t,ℓ+1) m = σ( Z(t,ℓ) m ) Assumptions. Here we introduce some assumptions about the GCN model and the original input graph. These assumptions are standard ones that are also used in (Chen et al., 2018; Cong et al., 2021; Wan et al., 2022) .  ∥W (ℓ) ∥ F ≤ K W , ∥P (ℓ) ∥ P ≤ K W , ∥X (ℓ) ∥ F ≤ K X . Now we can introduce the proof of our Theorem 2. We consider a GCN with L layers that is L f -Lipschitz smooth, i.e., ∥∇L(W 1 ) -∇L(W 2 )∥ 2 ≤ L f ∥W 1 -W 2 ∥ 2 . Theorem 5 (Formal version of Theorem 2). There exists a constant E such that for any arbitrarily small constant ϵ > 0, we can choose a learning rate η = where W (t) and W * denotes the parameters at iteration t and the optimal one, respectively. Proof. Beginning from the assumption of smoothness of loss function, L(W t+1 ) ≤ L(W t ) + ∇L(W t ), W (t+1) -W (t) + L f 2 ∥W (t+1) -W (t) ∥ 2 2 (30) Recall that the update rule of DIGEST is W (t+1) = W (t) - η M M m=1 ∇ Lm (W (t) m ) We naturally assume that each local copy of the global GCN model is also L f -Lipschitz smooth, i.e., ∥∇ Lm (W 1 ) -∇ Lm (W 2 )∥ 2 ≤ L f ∥W 1 -W 2 ∥ 2 , ∀ m = 1, 2, • • • , M . Now we can give the proof of Theorem 3. Theorem 6 (Formal version of Theorem 3). Assume the global model L(W ) is C f -Lipschitz continuous and the delay is bounded, i.e., τ < K. Further, assume the constants defined in Assumption 8 satisfy β -V 2 2 > 0. Then, after T global iterations on the server, asynchronous DIGEST converges to the optimal parameter W * by 1 T T t=1 ∥∇L(W (t) )∥ 2 2 ≤ 1 ηT B L(W (1) ) -L(W ( * ) ) + P (η) B , where B = β -V 2 2 and P (η ) = 1 2 η 2 K 2 C 2 f L 2 f + (1 + V )ηKC 2 f L f . Proof. By the smoothness assumption of global model, L(W (t+1) ) ≤ L(W (t) ) + ∇L(W (t) ), W (t+1) - W (t) + L f 2 ∥W (t+1) -W (t) ∥ 2 2 . ( ) Suppose at global iteration t + 1, the server receives an update from subgraph m, where m could be any value from 1 up to M . Then, L(W (t+1) ) ≤ L(W (t) ) -η ∇L(W (t) ), ∇ Lm (W (t-τ ) ) + η 2 L f 2 ∥∇ Lm (W (t-τ ) )∥ 2 2 , ( ) where τ is the delay for subgraph m when sending server its update. Denote r m := ∇ Lm (W (t-τ ) ) -∇ Lm (W (t) ). Since τ ≤ K, by the Lipschitz smoothness of Lm , ∥r m ∥ 2 = ∥∇ Lm (W (t-τ ) ) -∇ Lm (W (t) )∥ 2 ≤ L f • ∥W (t-τ ) -W (t) ∥ 2 = L f • ∥ t i=t-τ ηg i ∥ 2 ≤ ηKL f C f , where g i is the gradient or update the global server receives at iteration i. Therefore, L(W (t+1) ) -L(W (t) ) ≤ -η ∇L(W (t) ), ∇ Lm (W (t) ) + r m + η 2 L f 2 ∥∇ Lm (W (t) ) + r m ∥ 2 2 = -η ∇L(W (t) ), ∇ Lm (W (t) ) I + η 2 L f 2 ∥∇ Lm (W (t) )∥ 2 2 II + η 2 L f 2 ∥r m ∥ 2 2 III -η ∇L(W (t) ), r m IV + η 2 L f ∇ Lm (W (t) ), r m V ( ) Now we want to find bounds for (I -V) above. By Assumption 8, we have (I) ≤ -ηβ∥∇L(W (t) )∥ 2 2 , (44) and (II) ≤ 1 2 η 2 L f V 2 ∥∇L(W (t) )∥ 2 2 . ( ) By our previous result on ∥r m ∥ 2 , we have (III) ≤ 1 2 η 3 K 2 L 2 f C 2 f ( ) Taking η ≤ 1/L f , we have (IV) + (V) ≤ η ∇ Lm (W (t) ) -∇L(W (t) ), r m (47) By Cauchy-Schwartz inequality, triangle inequality and Assumption 8, (IV) + (V) ≤ η∥∇ Lm (W (t) ) -∇L(W (t) )∥ 2 • ∥r m ∥ 2 ≤ η 2 KC f L f • ∥∇ Lm (W (t) ) -∇L(W (t) )∥ 2 ≤ η 2 KC f L f • ∥∇ Lm (W (t) )∥ 2 + ∥∇L(W (t) )∥ 2 ≤ (1 + V )η 2 KC f L f • ∥∇L(W (t) )∥ 2 ≤ (1 + V )η 2 KC 2 f L f Put everything together, we have L(W (t+1) ) -L(W (t) ) ≤ 1 2 ηV 2 -ηβ • ∥∇L(W (t) )∥ 2 2 + 1 2 η 3 K 2 L 2 f C 2 f + (1 + V )η 2 KC 2 f L f . Hence, ∥∇L(W (t) )∥ 2 2 ≤ ηβ - 1 2 ηV 2 -1 • L(W (t) ) -L(W (t+1) ) + ηβ - 1 2 ηV 2 -1 • 1 2 η 3 K 2 L 2 f C 2 f + (1 + V )η 2 KC 2 f L f . Summing up from t = 1 to T and taking the average, 1 T T t=1 ∥∇L(W (t) )∥ 2 2 ≤ 1 ηβ -1 2 ηV 2 T L(W (1) ) -L(W ( * ) ) + ηβ - 1 2 ηV 2 -1 • 1 2 η 3 K 2 L 2 f C 2 f + (1 + V )η 2 KC 2 f L f ≤ 1 ηT B L(W (1) ) -L(W ( * ) ) + P (η) B , where B = β -V 2 2 and P (η) = 1 2 η 2 K 2 C 2 f L 2 f + (1 + V )ηKC 2 f L f .



https://backlinko.com/facebook-users https://amzscout.net/blog/amazon-statistics https://arrow.apache.org/docs/python/plasma.html For example, AWS EC2's p3.16xlarge is equipped with 8 Nvidia Tesla V100 GPUs with 488 GBs of host memory: https://aws.amazon.com/blogs/aws/ new-amazon-ec2-instances-with-up-to-8-nvidia-tesla-v100-gpus-p3/. , where V and β are positive real numbers, i.e., V, β ∈ R + .



Figure 1: Distributed GNN training methods. (a): Propagation-based methods rely on communication of out-of-subgraph neighbor nodes for exact message passing even in a distributed setup. (b): Partition-based methods decompose the original problem into multiple smaller ones and directly apply data parallelism onto partitioned subgraph data. (c): In DIGEST each device utilizes the stale representations of all its neighbors from other subgraphs. Propagation-based methods suffers high communication cost (red vertical double arrows in (a)) due to neighbor explosion, while partitionbased methods suffer severe information loss due to dropped edges (red crosses in (b)). DIGEST combines the best of both worlds. ALL nodes are utilized in DIGEST to achieve full-graph awareness, while periodic stale representation synchronization keeps the communication cost low.

propagation matrix for in-subgraph nodes and out-of-subgraph nodes of G m , respectively, and we haveP m = P (m) in + P (m)out where P m is the original propagation matrix for subgraph G m . σ(•) is the activation function following GCN's definition. Hence, the gradient over model parameters is

Figure 2: Illustration of DIGEST's concurrent pull/push and forward propagation operations on a 3-layer GNN. Second, to further reduce the I/O overhead, DIGEST uses a periodic representation synchronization strategy, which pushes updated representations to the KVS once every N epoches. This introduces a trade-off in I/O overhead and training performance.Increasing the frequency of the periodic synchronization will benefit performance, but this will introduce more I/O overhead. We analyze this trade-off in Section 5.2.

v denotes the node representation computed by DIGEST and the stale one, respectively. Further assume each local loss function L Local m is τ -Lipschitz smooth w.r.t the node representation. Then, we have that ∇

this section, we evaluate DIGEST and compare DIGEST against two state-of-the-art distributed GNNs training frameworks as baselines in terms of training efficiency and scalability. Considering the distinct training time per epoch between DIGEST and other baselines, we report the F1 scores on validation dataset and training loss over training time, instead of over communication rounds, in the results. This way it makes a fairer comparison in terms of training performance and efficiency. 5.1 EXPERIMENT SETTING Implementation and Setup. We have implemented DIGEST and other comparison GNNs training methods all in PyTorch

Figure 3: Performance comparison of the GCN training frameworks on four benchmark datasets. The top four subfigures show the training loss over training time, and the bottom four subfigures show the global validation F1 scores during the whole training process. (Best viewed in color.)

Figure 4: Training time/epoch.

Figure 5: Scalability.

(b)) and poor performance for all four datasets. DIGEST and DIGEST-A avoid these issues and therefore achieve satisfying performance over the training time. DIGEST-A is slowly catching up DIGEST due to the diverse model parameters used by subgraphs in the early training period. We measure the training time per epoch as shown in Figure 4. Since the representation synchronization is only performed before the start or after the end of local training, DIGEST takes significantly shorter training time per epoch than that of LLCG and DGL. Furthermore, DIGEST performs periodic synchronization instead of per-epoch synchronization, which further shortens the training time.

Figure 7: Performance comparison of OGB-Products when trained in a heterogeneous environment.

Figure 8: Performance comparison of different distributed GAT training methods on four benchmark datasets. The top three subfigures show the training loss during the whole training process; the bottom three subfigures show the global validation F1 scores during the whole training process.

Figure 10: Error between the gradients calculated by DIGEST and full-graph baseline (i.e., without any staleness). Zoom in for detail.

Figure 11: Time comparisons with PipeGCN and GNNAutoScale.

v denotes the node representation computed by DIGEST and the stale one, respectively. Further assume each local loss function L Local m is τ -Lipschitz smooth w.r.t node representation. Then the global gradient computed by DIGEST has the following error bound

Assumption 5. The loss function Loss(•, •) is C Loss -Lipchitz continuous and L Loss -Lipschitz smooth with respect to the last layer's node representation, i.e.,|Loss(h (L) v , y v ) -Loss(h (L) w , y v )| ≤ C Loss ∥h (L) v -h (L) w ∥ 2(25)and ∥∇Loss(h(L) v , y v ) -∇Loss(h (L) w , y v )∥ 2 ≤ L Loss ∥h (L) v -h (L) w ∥ 2(26)Assumption 6. The activation function σ(•) is C σ -Lipchitz continuous and L σ -Lipschitz smooth, i.e. ∀ ℓ that ℓ = 1, 2, • • • , L, we have

Proposing a novel distributed GNN training framework that synergizes the benefits of partition-based and communication-based methods. Existing work in distributed GNN training focus on two contradictory objectives: partition-based methods target minimizing the communication cost while propagation-based methods aim to minimize information loss. DIGEST drops no edges while avoiding communication overhead by integrating the strengths of both categories. • Developing a periodic stale representation synchronization technique for distributed GNN training. DIGEST utilizes the entire graph for training by separating in-and out-of-subgraph neighbor nodes and approximating the latter with stale representations. Instead of making strictly synchronous pull/push operations for the representations of all layers before/after training, DIGEST overlaps pull/push operations with layer training to minimize the overall training time. Furthermore, a shared-memory-based KVS is used among subgraphs for efficiently exchanging representations.• Providing extensive theoretical guarantee on both performance and convergence of the proposed algorithm. We proved that DIGEST's convergence rate is O(T -2/3 M -1/3 ) with T iterations and M subgraphs, which is close to vanilla distributed GNN training without staleness.Convergence guarantee for both synchronous and asynchronous versions of DIGEST is provided.

Hence, it is inherently suitable for parallel I/O at the granularity of node level. For subgraph G m , the total number of stale representations needed to be pulled from the KVS is | H(ℓ,m)

Performance comparison of distributed GNNs frameworks. F1 score on validation dataset reported. Speedup is calculated by normalizing per-epoch training time against that of DGL. Recall in Section 2 we categorize existing distributed GNN training into two types of general methods. In evaluation, we choose two state-of-the-art distributed training frameworks, one from each category as the baseline. For the first category, we choose LLCG(Ramezani et al., 2021), which partitions a graph into subgraphs and trains each subgraph strictly independently without incurring any communication among subgraphs. LLCG uses a central server to aggregate local models from each device and performs global training using mini-batches with full neighbor information to ensure that the model learns the global structure of the graph. LLCG uses this additional step to reduce the information loss caused by graph partitioning. For the second category, we choose to use DGL(Wang et al., 2019), which is a commonly-used, distributed GNN training framework. In contrast to LLCG, DGL requires exchanging node representations among partitioned subgraphs. DGL requires frequent swap operations with other subgraphs for representations during subgraph's local training in each epoch, and therefore, DGL incurs high communication cost.

). Methodology Novelty in Algorithm-System Co-design: Our paper is mainly motivated from a distributed training perspective, where the proposed framework synergizes the best of both partition-based and propagation-based distributed training; GNNAutoScale provides a theoretical foundation, which exposes potential opportunities that can be harnessed by and co-designed with new distributed training system infrastructures to enable highly-parallel GNN training. DIGEST goes beyond GNNAutoScale in that we built a novel distributed training framework that effectively decouples the management of state (i.e., representations) and compute (i.e., GNN training).

Summary of environmental setup of our testbed.

Summary of dataset statistics.

GPU memory consumption. For the concern of GPU memory consumption, we compare DIGEST with GraphSAGE, LLCG, and DGL by including all the information inside a GNN's receptive field in a single optimization step. The comparison results in Table4show that DIGEST has the lowest GPU memory consumption across all four systems.

We train OGB-Products with DIGEST/DIGEST-A, PipeGCN and GNNAutoScale in the heterogeneous environment mentioned in Section 5.2, and report the training time taken to reach the target validation F1 score and time per epoch in Firgure 11, since there is no "epoch" in an asynchronous setting, the value of time per epoch for DIGEST-A is omitted. We can see that DIGEST gets slight higher time per epoch than GNNAutoScale but reduces the time per epoch by 24.13% compared with PipeGCN. Meanwhile, DIGEST-A gets the lowest training time to reach the target F1 score and saves 48.98% and 19.12% training time compared with PipeGCN and GNNAutoScale, respectively. We further evaluate DIGEST, PipeGCN and GNNAutoScale on a large graph OGB-papers100m which consist of 111 million nodes 1.6 billion edges to show the efficiency of DIGEST. The experiments are done in a homogeneous environment with 32 GPUs. Since mini-batches in GNNAutoScale are trained in a serial manner instead of a parallel distributed setting, only one GPU is used. DIGEST reduces the time per epoch by 21.13% compared with distributed GNN training algorithm PipeGCN.

r ∈ R ; non-linearity activation function σ; neighborhood function N : v → 2 V ; synchronization interval N; learning rate η. Output: The trained model weights W (R+1) .

