REVISITING OVER-SMOOTHING IN GRAPH NEURAL NETWORKS

Abstract

Shallow graph neural networks (GNNs) are state-of-the-art models for relational data. However, it is known that deep GNNs suffer from over-smoothing where, as the number of layers increases, node representations become nearly indistinguishable and model performance on the downstream task degrades significantly. Despite multiple approaches being proposed to address this problem, it is unclear when any of these methods (or their combination) works best and how they perform when evaluated under exactly the same experimental setting. In this paper we systematically and carefully evaluate different methods for addressing oversmoothing in GNNs. Furthermore, inspired by standard deeply supervised nets, we propose a general architecture that deals with over-smoothing based on the idea of layer-wise supervision. We term this architecture deeply supervised GNNs (or DSGNNs for short). Our experiments show that deeper GNNs can indeed provide better performance when trained on a combination of different approaches and that DSGNNs are robust under various conditions and can provide best performance in missing features scenarios.

1. INTRODUCTION

Graph Neural Networks, first introduced by Scarselli et al. (2009) , have emerged as the de facto standard for representation learning on relational or graph-structured data. GNNs find many useful applications in building predictive models in traffic speed prediction, product recommendation, and drug discovery (Zhou et al., 2020a; Gaudelet et al., 2021) . One of the most important applications of GNNs is that of node property prediction, as in semi-supervised classification of papers (nodes) in a citation network (see, e.g., Kipf & Welling, 2017) . In this case, we are given labels for a subset of the nodes and aim to learn an algorithm that can accurately predict the labels for the remaining nodes using the network structure and (if available) node features. Even though GNNs, and especially those based on the Graph Convolutional Network (GCN) formulation (Kipf & Welling, 2017) and its extensions (Veličković et al., 2018; Hamilton et al., 2017; Wu et al., 2019; Klicpera et al., 2018) , have been shown to be a powerful tool for graph representation learning, they are limited in depth, that is the number of graph convolutional layers. Indeed, deep GNNs suffer from the problem of over-smoothing where, as the number of layers increases, the node representations become nearly indistinguishable and model performance on the downstream task deteriorates significantly. Increasing model depth is necessary in order to allow information to travel between distant nodes in the graph where each graph convolutional layer corresponds to propagating information from a node's one-hop neighbourhood. Previous work has analyzed and quantified the over-smoothing problem (Liu et al., 2020; Zhao & Akoglu, 2020; Chen et al., 2020a) as well as proposed methodologies to address it explicitly (Li et al., 2018; Zhao & Akoglu, 2020; Xu et al., 2018) . Some of the most recent approaches have focused on forcing diversity on latent node representations via residual connections (see, e.g, Xu et al., 2018; Chen et al., 2020b ), normalization (see, e.g, Zhou et al., 2020b; Zhao & Akoglu, 2020) , and enforced sparsity (see, e.g, Rong et al., 2020) . However, the proposed solutions have mostly been shown to alleviate over-smoothing but do not completely eliminate it, with shallow networks usually performing best. In this context, alleviating means that performance does not catastrophically deteriorate as a function of network depth. One notable exception to this is the GCNII architecture (Chen et al., 2020b) , which was shown to improve the performance of standard GNNs on classification tasks when using deeper architectures. One interesting scenario for the analysis of deep GNNs is when only a subset of the nodes have features, which we will refer to as the missing-feature setting (Zhao & Akoglu, 2020) . In this case, most of the solutions mentioned above (with the exception of GCNII) have been shown superior to the basic GCN architecture. Intuitively, as pointed out by Zhao & Akoglu (2020), a large number of propagation steps (i.e., deeper GNNs) may be required to obtain useful feature node representations. While acknowledging the significant advances towards making GNNs more robust to oversmoothing, we have found important gaps in the literature that, we believe, need to be recognized and addressed by the community. Firstly, and naturally, most previous work has focused on developing new algorithms and showing that they outperformed previous approaches. While there is nothing inherently wrong with such approaches, this usually has come at the expense of empirical results across different algorithms not using exactly the same settings (such as hyper-parameter optimization). Secondly, we have also found that, in fact, all proposed solutions are general enough that can be combined together. However, these combinations and their performance have not been studied in detail. Finally, another crucial gap in the GNN over-smoothing literature is that, previous approaches that have tackled (seemingly different but) related problems in standard neural networks have not been investigated. In particular, we refer to the work on deeply supervised nets Lee et al. (2015) for learning discriminative and robust features and for dealing with vanishing/exploding gradients. Contributions: In this paper, (i) we address the above gaps by systematically and carefully evaluating several proposed GNN over-smoothing solutions and their combinations. We analyze their performance in the transductive, semi-supervised node classification setting in both the standard fully observed and missing-feature settings. Furthermore, inspired by the work of Lee et al. ( 2015), (ii) we propose a new general architecture for tackling over-smoothing. Our architecture trains predictors using node representations from all layers, each contributing to the loss function equally, therefore encouraging the GNN to learn discriminative features at all network depths. We name our approach deeply-supervised graph neural networks (DSGNNs). (iii) We show that DSGNNs are resilient to the over-smoothing problem in deep networks and can outperform competing methods on challenging datasets. Finally, (iv) we provide recommendations for the selection of a GNN architecture for practical applications.

2. GRAPH NEURAL NETWORKS

Let a graph be represented as the tuple G = (V, E) where V is the set of nodes and E the set of edges. The graph has |V | = N nodes. We assume that each node v ∈ V is also associated with an attribute vector x v ∈ R d and let X ∈ R N ×d represent the attribute vectors for all nodes in the graph. Let A ∈ R N ×N represent the graph adjacency matrix; here we assume that A is a symmetric and binary matrix such that A ij ∈ {0, 1}, where A ij = 1 if there is an edge between nodes i and j, i.e., (v i , v j ) ∈ E, and A ij = 0 otherwise. Also, let D represent the diagonal degree matrix such that D ii = N -1 j=0 A ij . Typical GNNs learn node representations via a neighborhood aggregation function. Assuming a GNN with K layers, we define such a neighborhood aggregation function centred on node v at layer l as follows, h (l) v = h (l) f g h (l-1) v , h (l-1) u ∀u ∈ N v , where N v is the set of node v's neighbors in the graph, g is an aggregation function, f is a linear transformation that could be the identity function, and h (l) is a non-linear function applied elementwise. Let H (l) ∈ R N ×d (l) the representations for all nodes at the l-th layer with output dimension d (l) ; we set H (0) def = X and d (0) def = d. A common aggregation function g that calculates the weighted average of the node features where the weights are a deterministic function of the node degrees is ÂH as proposed by Kipf & Welling (2017) . Here Â represents the twice normalized adjacency matrix with self loops given by Â = D-1/2 (A + I) D-1/2 where D is the degree matrix for A + I and I ∈ R N ×N is the identity matrix. Substituting this aggregation function in 1 and specifying

