REDUCING OVERSMOOTHING IN GRAPH NEURAL NET-WORKS BY CHANGING THE ACTIVATION FUNCTION Anonymous authors Paper under double-blind review

Abstract

The performance of Graph Neural Networks (GNNs) deteriorates as the depth of the network increases. That performance drop is mainly attributed to oversmoothing, which leads to similar node representations through repeated graph convolutions. We show that in deep GNNs the activation function plays a crucial role in oversmoothing. We explain theoretically why this is the case and propose a simple modification to the slope of ReLU to reduce oversmoothing. The proposed approach enables deep architectures without the need to change the network architecture or to add residual connections. We verify the theoretical results experimentally and further show that deep networks, which do not suffer from oversmoothing, are beneficial in the presence of the "cold start" problem, i.e. when there is no feature information about unlabeled nodes.

1. INTRODUCTION

Graph Neural Networks (GNNs) utilize message passing or neighborhood aggregation schemes to extract representations for nodes and their neighborhoods. GNNs have achieved good results on a variety of graph analytics tasks, such as node classification (Zhang et al., 2018; Hamilton et al., 2017) , link prediction (Liben-Nowell & Kleinberg, 2003; Zhang & Chen, 2018 ) and graph classification (Klicpera et al., 2020) . As a result, they play a key role in graph representation learning. One of the most prominent GNN models is the Graph Convolutional Network (GCN) (Kipf & Welling, 2017) , which creates node representations, by averaging the representations (embeddings) of its immediate neighbors. Several studies have shown that, the performance of GCNs deteriorates, when their architecture becomes deeper (Li et al., 2018) . The success of deep CNNs on many tasks, like image classification, naturally led to several attempts towards building deep GNNs for node classification (Kipf & Welling, 2017; Li et al., 2018; Xu et al., 2018) . Increasing the model's depth (and the number of parameters it has), would allow more accurate representational learning to occur. Most of the existing approaches have failed to develop a sufficiently deep architecture that achieves good performance. Therefore, there is a need to design new models which can efficiently scale to a large number of layers. The aim of this paper is to investigate the contributing factors, that compromise the performance of deep GNNs and develop a method to address them. The performance drop of deep GNNs, is associated with several factors, including vanishing gradients, overfitting, as well as the phenomenon called oversmoothing (Li et al., 2018; Xu et al., 2018; Klicpera et al., 2018; Wu et al., 2019) . Oversmoothing has been shown to be associated with graph convolution, a type of Laplacian operator. Li et al. (2018) proved that applying that operator repeatedly, makes node representations converge to a stationary point. At that point, all of the initial information (i.e. node features' information) is lost through the Laplacian smoothing. Consequently, oversmoothing hurts the performance by making node features indistinguishable across different classes. In this work, we address the oversmoothing problem in deep GNNs. We provide what is to the best of our knowledge the first study regarding the role of the activation function and the learning rate per layer of the model to oversmoothing and propose a new method to address the problem. We confirm our hypotheses both experimentally and theoretically. The new method is shown to prevent node embeddings from converging to the same point, thus leading to better node representations. We summarize our main contributions as follows. •Role of Activation Function in Oversmoothing: We prove theoretically the connection between

