COMMUNICATION-EFFICIENT SAMPLING FOR DIS-TRIBUTED TRAINING OF GRAPH CONVOLUTIONAL NETWORKS Anonymous

Abstract

Training Graph Convolutional Networks (GCNs) is expensive as it needs to aggregate data recursively from neighboring nodes. To reduce the computation overhead, previous works have proposed various neighbor sampling methods that estimate the aggregation result based on a small number of sampled neighbors. Although these methods have successfully accelerated the training, they mainly focus on the single-machine setting. As real-world graphs are large, training GCNs in distributed systems is desirable. However, we found that the existing neighbor sampling methods do not work well in a distributed setting. Specifically, a naive implementation may incur a huge amount of communication of feature vectors among different machines. To address this problem, we propose a communication-efficient neighbor sampling method in this work. Our main idea is to assign higher sampling probabilities to the local nodes so that remote nodes are accessed less frequently. We present an algorithm that determines the local sampling probabilities and makes sure our skewed neighbor sampling does not affect much to the convergence of the training. Our experiments with node classification benchmarks show that our method significantly reduces the communication overhead for distributed GCN training with little accuracy loss.

1. INTRODUCTION

Graph Convolutional Networks (GCNs) are powerful models for learning representations of attributed graphs. They have achieved great success in graph-based learning tasks such as node classification (Kipf & Welling, 2017; Duran & Niepert, 2017) , link prediction (Zhang & Chen, 2017; 2018) , and graph classification (Ying et al., 2018b; Gilmer et al., 2017) . Despite the success of GCNs, training a deep GCN on large-scale graphs is challenging. To compute the embedding of a node, GCN needs to recursively aggregate the embeddings of the neighboring nodes. The number of nodes needed for computing a single sample can grow exponentially with respect to the number of layers. This has made mini-batch sampling ineffective to achieve efficient training of GCNs. To alleviate the computational burden, various neighbor sampling methods have been proposed (Hamilton et al., 2017; Ying et al., 2018a; Chen et al., 2018b; Zou et al., 2019; Li et al., 2018; Chiang et al., 2019; Zeng et al., 2020) . The idea is that, instead of aggregating the embeddings of all neighbors, they compute an unbiased estimation of the result based on a sampled subset of neighbors. Although the existing neighbor sampling methods can effectively reduce the computation overhead of training GCNs, most of them assume a single-machine setting. The existing distributed GCN systems either perform neighbor sampling for each machine/GPU independently (e.g., PinSage (Ying et al., 2018a ), AliGraph (Zhu et al., 2019) , DGL (Wang et al., 2019) ) or perform a distributed neighbor sampling for all machines/GPUs (e.g., AGL (Zhang et al., 2020) ). If the sampled neighbors on a machine include nodes stored on other machines, the system needs to transfer the feature vectors of the neighboring nodes across the machines. This incurs a huge communication overhead. None of the existing sampling methods or the distributed GCN systems have taken this communication overhead into consideration. In this work, we propose a communication-efficient neighbor sampling method for distributed training of GCNs. Our main idea is to assign higher sampling probabilities for local nodes so that remote nodes will be accessed less frequently. By discounting the embeddings with the sampling probability, we make sure that the estimation is unbiased. We present an algorithm to generate the sampling probability that ensures the convergence of training. To validate our sampling method, we conduct experiments with node classification benchmarks on different graphs. The experimental results show that our method significantly reduces the communication overhead with little accuracy loss.

2. RELATED WORK

The idea of applying convolution operation to the graph domain is first proposed by Bruna et al. Node-wise Neighbor Sampling: GraphSAGE (Hamilton et al., 2017) proposes to reduce the receptive field size of each node by sampling a fixed number of its neighbors in the previous layer. PinSAGE (Ying et al., 2018a) adopts this node-wise sampling technique and enhances it by introducing an importance score to each neighbor. It leads to less information loss due to weighted aggregation. VR-GCN (Chen et al., 2018a) further restricts the neighbor sampling size to two and uses the historical activation of the previous layer to reduce variance. Although it achieves comparable convergence to GraphSAGE,VR-GCN incurs additional computation overhead for convolution operations on historical activation which can outweigh the benefit of reduced number of sampled neighbors. The problem with node-wise sampling is that, due to the recursive aggregation, it may still need to gather the information of a large number of nodes to compute the embeddings of a mini-batch. Layer-wise Importance Sampling: To further reduce the sample complexity, FastGCN (Chen et al., 2018b) proposes layer-wise importance sampling. Instead of fixing the number of sampled neighbors for each node, it fixes the number of sampled nodes in each layer. Since the sampling is conduced independently in each layer, it requires a large sample size to guarantee the connectivity between layers. To improve the sample density and reduce the sample size, Huang et al. (2018) and Zou et al. (2019) propose to restrict the sampling space to the neighbors of nodes sampled in the previous layer. Subgraph Sampling: Layer-wise sampling needs to maintain a list of neighbors and calculate a new sampling distribution for each layer. It incurs an overhead that can sometime deny the benefit of sampling, especially for small graphs. GraphSAINT (Zeng et al., 2020) proposes to simplify the sampling procedure by sampling a subgraph and performing full convolution on the subgraph. Similarly, ClusterGCN (Chiang et al., 2019) pre-partitions a graph into small clusters and constructs mini-batches by randomly selecting subsets of clusters during the training. All of the existing neighbor sampling methods assume a single-machine setting. As we will show in the next section, a straightforward adoption of these methods to a distributed setting can lead to a large communication overhead.

3. BACKGROUND AND MOTIVATION

In a M -layer GCN, the l-th convolution layer is defined as H (l) = P σ(H (l-1) )W (l) where H (l) represents the embeddings of all nodes at layer l before activation, H (0) = X represents the feature vectors, σ is the activation function, P is the normalized Laplacian matrix of the graph, and W (l) is



(2013). Later,Kipf & Welling (2017)  andDefferrard et al. (2016)  simplify the convolution computation with localized filters. Most of the recent GCN models (e.g.,GAT (Velickovic et al., 2018),GraphSAGE (Hamilton et al., 2017),GIN (Xu et al., 2019)) are based on the GCN inKipf & Welling  (2017)  where the information is only from 1-hop neighbors in each layer of the neural network. In Kipf & Welling (2017), the authors only apply their GCN to small graphs and use full batch for training. This has been the major limitation of the original GCN model as full batch training is expensive and infeasible for large graphs. Mini-batch training does not help much since the number of nodes needed for computing a single sample can grow exponentially as the GCN goes deeper. To overcome this limitation, various neighbor sampling methods have been proposed to reduce the computation complexity of GCN training.

