DECOUPLED GREEDY LEARNING OF GRAPH NEURAL NETWORKS

Abstract

Graph Neural Networks (GNNs) become very popular for graph-related applications due to their superior performance. However, they have been shown to be computationally expensive in large scale settings, because their produced node embeddings have to be computed recursively, which scales exponentially with the number of layers. To address this issue, several sampling-based methods have recently been proposed to perform training on a subset of nodes while maintaining the fidelity of the trained model. In this work, we introduce a decoupled greedy learning method for GNNs (DGL-GNN) that, instead of sampling the input graph, decouples the GNN into smaller modules and associates each module with greedy auxiliary objectives. Our approach allows GNN layers to be updated during the training process without waiting for feedback from successor layers, thus making parallel GNN training possible. Our method achieves improved efficiency without significantly compromising model performances, which would be important for time or memory limited applications. Further, we propose a lazy-update scheme during training to further improve its efficiency. We empirically analyse our proposed DGL-GNN model, and demonstrate its effectiveness and superior efficiency through a range of experiments. Compared to the sampling-based acceleration, our model is more stable, and we do not have to trade-off between efficiency and accuracy. Finally, we note that while here we focus on comparing the decoupled approach as an alternative to other methods, it can also be regarded as complementary, for example, to sampling and other scalability-enhancing improvements of GNN training.

1. INTRODUCTION

Graph Neural Networks (GNN) have been shown to be highly effective in graph-related tasks, such as node classification (Kipf & Welling, 2016) , graph classification (Ying et al., 2018b) , graph matching (Bai et al., 2019), and recommender system (Ying et al., 2018a) . Given a graph of arbitrary size and attributes, GNNs obtain informative node embeddings by first conducting a graph convolution operation to aggregate information from the neighbors of each node, and then transforming the aggregated information. As a result, GNNs can fuse together the topological structure and node features of a graph, and have thus became dominant models for graph-based applications. Despite its superior representation power, the graph convolution operation has been shown to be expensive when GNNs become deep and wide (Chen et al., 2017) . Therefore, training a deep GNN model is challenging for large and dense graphs. Since deep and wide GNNs are becoming increasingly important with the emergence of classification tasks on large graphs, such as the newly proposed OGB datasets (Hu et al., 2020) , and semantic segmentation tasks as introduced in (Li et al., 2019) , we focus here on studying methods for alleviating computational burdens associated with large-scale GNN training. Several strategies have been proposed during the past years to alleviate this computation issue of large-scale GNNs. GraphSAGE (Hamilton et al., 2017) took the first step to leverage a neighborhood sampling strategy for GNNs training, which only aggregates a sampled subset of neighbors of each node in the graph convolution operation. However, though this sampling method helps reduce memory and time cost for shallow GNNs, it computes the representation of a node recursively, and the node's receptive field grows exponentially with the number of GNN layers, which may make 2020) proposed a layer-wise sequential training algorithm for GNNs, which decouples the aggregation and transformation operations in the per-layer feed-forward process and reduces the time and memory cost during training while not sacrificing too much model capability, this indicates that the GNN layers do not have to be learned jointly. However, the sequential training would bring some inefficiency. In addition to the inefficiency brought by the graph convolution operation, as discussed in (Belilovsky et al., 2019a) , the sequential nature of standard backpropagation also leads to inefficiency. As pointed out in (Jaderberg et al., 2017) , backpropagation for deep neural networks suffers an update-locking problem, which means each layer heavily relies on upper layers' feedback to update itself, and thus, it must wait for the information to propagate through the whole network before updating. This would be a great obstacle for GNN layers to be trained in parallel to alleviate computation pressure under time and memory constraint, and would prohibit the GNN training to be trained in an asynchronous setting. In this work, using semi-supervised node classification as an example, we show that the greedy learning would help to decouple the optimization of each layer in GNNs and enable GNNs to achieve update-unlocking, i.e., allow the GNN layers to update without getting any feedback from the later layers. By using this decoupled greedy learning for GNNs, we can achieve parallelization of the network layers, which would make the model training much more efficient and would be very important for time or memory limited applications. Moreover, we propose to use a lazy-update scheme during training, which is to exchange information between layers after a certain number of epochs instead of every epoch, this will further improve the efficiency while not sacrificing much performance. We theoretically analyze the computation complexity of our proposed method, and analogue our method to the classic block coordinate descent optimization to enable further analysis. We run a set of experiments to justify our model, and show its great efficiency on all benchmark datasets. On the newly proposed large OGBN-arxiv dataset, when training a 7-layer model, our proposed method even saves 85% time and 66% per-GPU memory cost of the conventionally trained GCN. Our main contributions can be summarized as follows. First, we introduce a decoupled greedy learning algorithm for GNNs that achieves update-unlocking and enables GNN layer to be trained in parallel. Next, we propose to leverage a lazy-update scheme to improve the training efficiency. We evaluate our proposed training strategy thoroughly on benchmark datasets, and demonstrate it has superior efficiency while not sacrificing much performance. Finally, our method is not limited to the GCN and the node classification task, but can be combined with other scalability-enhancing GNNs and can be applied to other graph-related tasks.

2. RELATED WORK

Before discussing our proposed approach, we review here related work on efficient training strategies for GNNs. The computational complexities the discussed methods are summarized in Table 1 , and we refer the reader to Appendix A for detailed computation.

2.1. DEEP GRAPH CONVOLUTIONAL NETWORK (DEEPGCN)

Graph convolutional network (GCN, Kipf & Welling, 2016) is one of the most popular models for graph-related tasks. Given an undirected graph G with node feature matrix X ∈ R N ×D and adjacency matrix A ∈ R N ×N where N is node number and D is feature dimension, let Ã = A + I , D be a diagonal matrix satisfying Di,i = N j=1 Ãi,j , and F = D-1/2 Ã D-1/2 be the normalized Ã, then, the l-th GCN layer will have the output H (l) as H (l) = σ(F H (l-1) W (l) ), where σ is the non-linear transformation, and W (l) is the trainable weight matrix at layer l. As pointed out in Li et al. (2018) , when GCN becomes deep, it will suffer severe over-smoothing problem, which mean the nodes will become not distinguishable after stacking too many network layers. However, for applications such as semantic segmentation (Li et al., 2019) or classification



the memory and time cost even goes larger for deeper GNNs when the sample number is big. The work of Chen et al. (2017; 2018); Zou et al. (2019) developed sampling-based stochastic training methods to train GNNs more efficiently and avoid this exponential growth problem. Chiang et al. (2019) proposed a batch learning algorithm by exploiting the graph clustering structure. Beyond the aforementioned methods, recently, You et al. (

