DEEPER-GXX: DEEPENING ARBITRARY GNNS

Abstract

Recently, motivated by real applications, a major research direction in graph neural networks (GNNs) is to explore deeper structures. For instance, the graph connectivity is not always consistent with the label distribution (e.g., the closest neighbors of some nodes are not from the same category). In this case, GNNs need to stack more layers, in order to find the same categorical neighbors in a longer path for capturing the class-discriminative information. However, two major problems hinder the deeper GNNs to obtain satisfactory performance, i.e., vanishing gradient and over-smoothing. On one hand, stacking layers makes the neural network hard to train as the gradients of the first few layers vanish. Moreover, when simply addressing vanishing gradient in GNNs, we discover the shading neighbors effect (i.e., stacking layers inappropriately distorts the non-IID information of graphs and degrade the performance of GNNs). On the other hand, deeper GNNs aggregate much more information from common neighbors such that individual node representations share more overlapping features, which makes the final output representations not discriminative (i.e., overly smoothed). In this paper, for the first time, we address both problems to enable deeper GNNs, and propose Deeper-GXX, which consists of the Weight-Decaying Graph Residual Connection module (WDG-ResNet) and Topology-Guided Graph Contrastive Loss (TGCL). Extensive experiments on real-world data sets demonstrate that Deeper-GXX outperforms state-of-the-art deeper baselines.

1. INTRODUCTION

Graph neural networks (GNNs) have been proven successful at modeling graph data by extracting node hidden representations that are effective for many downstream tasks. In general, they are realized by the message passing schema and aggregate neighbor features to obtain node hidden representations (Kipf & Welling, 2017; Hamilton et al., 2017a; Velickovic et al., 2018; Xu et al., 2019) . Recently, the surge of big data makes graphs' structural and attribute information much more complex and uncertain, which urges the researchers to make GNNs deeper (i.e., stacking more graph neural layers), in order to capture more meaningful information for better performance. For example, in social media, people from different categories (e.g., occupation, interests, etc.) are often connected (e.g., become friends), and users' immediate neighbor information may not reflect their categorical information. Thus, deepening GNNs is necessary to identify the neighbors from the same category in a longer path (e.g., k-hop neighbors), and to aggregate their features to obtain the class-discriminative node representations. To demonstrate the benefit of deeper GNNs, we conduct a case study shown in Figure 1 (See the detailed experimental setup in Appendix A.1). In Figure 1a , we observe that the query node (the diamond in the black dashed circle) cannot rely on its closest labeled neighbor (the red star in the circle) to correctly predict its label (the blue). Only by exploring longer paths consisting of more similar neighbors are we able to predict its label as blue. Figure 1b compares the classification accuracy of shallow GNNs and deeper GNNs. We can see that deeper GNNs significantly outperform shallow ones by more than 11%, due to their abilities to explore longer paths on the graph. Similar observations of the benefits of deeper GNNs are also found in the missing feature scenario presented in Section 3.3. However, simply stacking layers of GNNs can be problematic, due to vanishing gradient and oversmoothing issues. On one hand, increasing the number of neural layers can induce the hard-to-train model, where both the training error and test error are higher than shallow ones. This is mainly caused by the vanishing gradient issue (He et al., 2016) , where the gradient of the first few layers vanish et al., 2016) has been proposed to address this issue. However, we discover that simply combining ResNet with GNNs still leads to the sub-optimal solution: as ResNet stacks layers, the importance of close neighbors' features gradually decreases during the GNN information aggregation process, and the faraway neighbor information becomes dominant. We call this effect as shading neighbors. On the other hand, GNN utilizes the message passing schema to aggregate neighbor features, in order to get class-discriminative node representations. However, by stacking more layers, each node begins to share more and more overlapping neighbor information during the aggregation process and the node representations gradually become indistinguishable (Li et al., 2018; Oono & Suzuki, 2020) . This has been referred to as the over-smoothing issue, and it significantly affects the performance of downstream tasks such as node classification and link prediction. In this paper, we study how to effectively stack GNN layers by addressing shading neighbors and oversmoothing at the same time. First, to address the shading neighbors caused by the direct application of ResNet on GNNs, we propose Weight-Decaying Graph Residual Connection (WDG-ResNet), which learns the weight of each residual connection layer (instead of setting it as 1 in ResNet), and further introduces a decaying factor to refine the weight of each layer. Interestingly, we find that the hyperparameter λ of the weight decaying factor actually controls the number of effective layers in deeper GNNs based on the input graph inherent property, which is verified in Appendix A.5. Second, for addressing over-smoothing, we propose Topology-Guided Graph Contrastive Loss (TGCL) in the contrastive learning manner (van den Oord et al., 2018) , where the hidden representations of the positive pairs should be closer, and those of the negative pairs should be pushed apart. Through theoretical and empirical analysis, we find that TGCL can be effectively and efficiently realized by only considering 1-hop neighbors as the positive pair and all the rest as negative pairs. Combining the proposed WDG-ResNet and TGCL, we propose an end-to-end model called Deeper-GXX to help arbitrary GNNs go deeper. Our contributions can be summarized as follows. 

2. PROPOSED METHOD

In this section, we begin with the overview of Deeper-GXX. Then, we provide the details of the proposed Weight-Decaying Graph Residual Connection (WDG-ResNet) and Topology-Guided Graph Contrastive Loss (TGCL), which address shading neighbors and over-smoothing problems, respectively. We formalize the graph embedding problem in the context of undirected graph G = (V, E, X), where V consists of n vertices, E consists of m edges, X ∈ R n×d denotes the feature



(a) Two groups of nodes in the semi-supervised setting. Stars are labeled, dots are unlabeled, and the diamond is the query node. Euclidean distance between two nodes indicates the edge connection. (b) Comparison of node classification accuracy between shallow and deeper GNN models using data on the left. The deeper GNNs are realized by our Deeper-GXX with corresponding backbones.

Figure 1: A toy example to demonstrate the benefit of deeper GNN models.such that the training loss could not be successfully propagated through deeper models.ResNet (He  et al., 2016)  has been proposed to address this issue. However, we discover that simply combining ResNet with GNNs still leads to the sub-optimal solution: as ResNet stacks layers, the importance of close neighbors' features gradually decreases during the GNN information aggregation process, and the faraway neighbor information becomes dominant. We call this effect as shading neighbors. On the other hand, GNN utilizes the message passing schema to aggregate neighbor features, in order to get class-discriminative node representations. However, by stacking more layers, each node begins to share more and more overlapping neighbor information during the aggregation process and the node representations gradually become indistinguishable(Li et al., 2018; Oono & Suzuki, 2020). This has been referred to as the over-smoothing issue, and it significantly affects the performance of downstream tasks such as node classification and link prediction.

We propose Weight-Decaying Graph Residual Connection (WDG-ResNet) to address the shading neighbors effect caused by vanilla ResNet in dealing with the vanishing gradient of GNNs. • We propose Topology-Guided Graph Contrastive Loss (TGCL) to address the over-smoothing problem by encoding the graph topological information to the discriminative node representations. • We combine the proposed WDG-ResNet and TGCL into an end-to-end model called Deeper-GXX, which is model-agnostic and can help arbitrary GNNs go deeper. • Extensive experiments show that Deeper-GXX outperforms state-of-the-art deeper baselines.

