ADAGCN: ADABOOSTING GRAPH CONVOLUTIONAL NETWORKS INTO DEEP MODELS

Abstract

The design of deep graph models still remains to be investigated and the crucial part is how to explore and exploit the knowledge from different hops of neighbors in an efficient way. In this paper, we propose a novel RNN-like deep graph neural network architecture by incorporating AdaBoost into the computation of network; and the proposed graph convolutional network called AdaGCN (Adaboosting Graph Convolutional Network) has the ability to efficiently extract knowledge from high-order neighbors of current nodes and then integrates knowledge from different hops of neighbors into the network in an Adaboost way. Different from other graph neural networks that directly stack many graph convolution layers, AdaGCN shares the same base neural network architecture among all "layers" and is recursively optimized, which is similar to an RNN. Besides, We also theoretically established the connection between AdaGCN and existing graph convolutional methods, presenting the benefits of our proposal. Finally, extensive experiments demonstrate the consistent state-of-the-art prediction performance on graphs across different label rates and the computational advantage of our approach AdaGCN 1 .

1. INTRODUCTION

Recently, research related to learning on graph structural data has gained considerable attention in machine learning community. Graph neural networks (Gori et al., 2005; Hamilton et al., 2017; Veličković et al., 2018) , particularly graph convolutional networks (Kipf & Welling, 2017; Defferrard et al., 2016; Bruna et al., 2014) have demonstrated their remarkable ability on node classification (Kipf & Welling, 2017) , link prediction (Zhu et al., 2016) and clustering tasks (Fortunato, 2010) . Despite their enormous success, almost all of these models have shallow model architectures with only two or three layers. The shallow design of GCN appears counterintuitive as deep versions of these models, in principle, have access to more information, but perform worse. Oversmoothing (Li et al., 2018) has been proposed to explain why deep GCN fails, showing that by repeatedly applying Laplacian smoothing, GCN may mix the node features from different clusters and makes them indistinguishable. This also indicates that by stacking too many graph convolutional layers, the embedding of each node in GCN is inclined to converge to certain value (Li et al., 2018) , making it harder for classification. These shallow model architectures restricted by oversmoothing issue limit their ability to extract the knowledge from high-order neighbors, i.e., features from remote hops of neighbors for current nodes. Therefore, it is crucial to design deep graph models such that high-order information can be aggregated in an effective way for better predictions. There are some works (Xu et al., 2018b; Liao et al., 2019; Klicpera et al., 2018; Li et al., 2019; Liu et al., 2020) that tried to address this issue partially, and the discussion can refer to Appendix A.1. By contrast, we argue that a key direction of constructing deep graph models lies in the efficient exploration and effective combination of information from different orders of neighbors. Due to the apparent sequential relationship between different orders of neighbors, it is a natural choice to incorporate boosting algorithm into the design of deep graph models. As an important realization of boosting theory, AdaBoost (Freund et al., 1999) is extremely easy to implement and keeps competitive in terms of both practical performance and computational cost (Hastie et al., 2009) . Moreover, boosting theory has been used to analyze the success of ResNets in computer vision (Huang et al., 2018) and AdaGAN (Tolstikhin et al., 2017) has already successfully incorporated boosting algorithm into the training of GAN (Goodfellow et al., 2014) . In this work, we focus on incorporating AdaBoost into the design of deep graph convolutional networks in a non-trivial way. Firstly, in pursuit of the introduction of AdaBoost framework, we refine the type of graph convolutions and thus obtain a novel RNN-like GCN architecture called AdaGCN. Our approach can efficiently extract knowledge from different orders of neighbors and then combine these information in an AdaBoost manner with iterative updating of the node weights. Also, we compare our AdaGCN with existing methods from the perspective of both architectural difference and feature representation power to show the benefits of our method. Finally, we conduct extensive experiments to demonstrate the consistent state-of-the-art performance of our approach across different label rates and computational advantage over other alternatives.

2. OUR APPROACH: ADAGCN

2.1 ESTABLISHMENT OF ADAGCN Consider an undirected graph G = (V, E) with N nodes v i ∈ V, edges (v i , v j ) ∈ E. A ∈ R N ×N is the adjacency matrix with corresponding degree matrix D ii = j A ij . In the vanilla GCN model (Kipf & Welling, 2017) for semi-supervised node classification, the graph embedding of nodes with two convolutional layers is formulated as: Z = Â ReLU( ÂXW (0) )W (1) (1) where Z ∈ R N ×K is the final embedding matrix (output logits) of nodes before softmax and K is the number of classes. X ∈ R N ×C denotes the feature matrix where C is the input dimension. Â = D-1 2 Ã D-1 2 where Ã = A + I and D is the degree matrix of Ã. In addition, W (0) ∈ R C×H is the input-to-hidden weight matrix for a hidden layer with H feature maps and W (1) ∈ R H×K is the hidden-to-output weight matrix. Our key motivation of constructing deep graph models is to efficiently explore information of highorder neighbors and then combine these messages from different orders of neighbors in an AdaBoost way. Nevertheless, if we naively extract information from high-order neighbors based on GCN, we are faced with stacking l layers' parameter matrix W (i) , i = 0, ..., l -1, which is definitely costly in computation. Besides, Multi-Scale Deep Graph Convolutional Networks (Luan et al., 2019) also theoretically demonstrated that the output can only contain the stationary information of graph structure and loses all the local information in nodes for being smoothed if we simply deepen GCN. Intuitively, the desirable representation of node features does not necessarily need too many nonlinear transformation f applied on them. This is simply due to the fact that the feature of each node is normally one-dimensional sparse vector rather than multi-dimensional data structures, e.g., images, that intuitively need deep convolution network to extract high-level representation for vision tasks. This insight has been empirically demonstrated in many recent works (Wu et al., 2019; Klicpera et al., 2018; Xu et al., 2018a) , showing that a two-layer fully-connected neural networks is a better choice in the implementation. Similarly, our AdaGCN also follows this direction by choosing an appropriate f in each layer rather than directly deepen GCN layers. Thus, we propose to remove ReLU to avoid the expensive joint optimization of multiple parameter matrices. Similarly, Simplified Graph Convolution (SGC) (Wu et al., 2019 ) also adopted this prac-

