REVISITING OVER-SMOOTHING IN GRAPH NEURAL NETWORKS

Abstract

Shallow graph neural networks (GNNs) are state-of-the-art models for relational data. However, it is known that deep GNNs suffer from over-smoothing where, as the number of layers increases, node representations become nearly indistinguishable and model performance on the downstream task degrades significantly. Despite multiple approaches being proposed to address this problem, it is unclear when any of these methods (or their combination) works best and how they perform when evaluated under exactly the same experimental setting. In this paper we systematically and carefully evaluate different methods for addressing oversmoothing in GNNs. Furthermore, inspired by standard deeply supervised nets, we propose a general architecture that deals with over-smoothing based on the idea of layer-wise supervision. We term this architecture deeply supervised GNNs (or DSGNNs for short). Our experiments show that deeper GNNs can indeed provide better performance when trained on a combination of different approaches and that DSGNNs are robust under various conditions and can provide best performance in missing features scenarios.

1. INTRODUCTION

Graph Neural Networks, first introduced by Scarselli et al. (2009) , have emerged as the de facto standard for representation learning on relational or graph-structured data. GNNs find many useful applications in building predictive models in traffic speed prediction, product recommendation, and drug discovery (Zhou et al., 2020a; Gaudelet et al., 2021) . One of the most important applications of GNNs is that of node property prediction, as in semi-supervised classification of papers (nodes) in a citation network (see, e.g., Kipf & Welling, 2017) . In this case, we are given labels for a subset of the nodes and aim to learn an algorithm that can accurately predict the labels for the remaining nodes using the network structure and (if available) node features. Even though GNNs, and especially those based on the Graph Convolutional Network (GCN) formulation (Kipf & Welling, 2017) and its extensions (Veličković et al., 2018; Hamilton et al., 2017; Wu et al., 2019; Klicpera et al., 2018) , have been shown to be a powerful tool for graph representation learning, they are limited in depth, that is the number of graph convolutional layers. Indeed, deep GNNs suffer from the problem of over-smoothing where, as the number of layers increases, the node representations become nearly indistinguishable and model performance on the downstream task deteriorates significantly. Increasing model depth is necessary in order to allow information to travel between distant nodes in the graph where each graph convolutional layer corresponds to propagating information from a node's one-hop neighbourhood. Previous work has analyzed and quantified the over-smoothing problem (Liu et al., 2020; Zhao & Akoglu, 2020; Chen et al., 2020a) as well as proposed methodologies to address it explicitly (Li et al., 2018; Zhao & Akoglu, 2020; Xu et al., 2018) . Some of the most recent approaches have focused on forcing diversity on latent node representations via residual connections (see, e.g, Xu et al., 2018; Chen et al., 2020b ), normalization (see, e.g, Zhou et al., 2020b; Zhao & Akoglu, 2020) , and enforced sparsity (see, e.g, Rong et al., 2020) . However, the proposed solutions have mostly been shown to alleviate over-smoothing but do not completely eliminate it, with shallow networks usually performing best. In this context, alleviating means that performance does not catastrophically deteriorate as a function of network depth. One notable exception to this is the

