REDUCING OVERSMOOTHING IN GRAPH NEURAL NET-WORKS BY CHANGING THE ACTIVATION FUNCTION Anonymous authors Paper under double-blind review

Abstract

The performance of Graph Neural Networks (GNNs) deteriorates as the depth of the network increases. That performance drop is mainly attributed to oversmoothing, which leads to similar node representations through repeated graph convolutions. We show that in deep GNNs the activation function plays a crucial role in oversmoothing. We explain theoretically why this is the case and propose a simple modification to the slope of ReLU to reduce oversmoothing. The proposed approach enables deep architectures without the need to change the network architecture or to add residual connections. We verify the theoretical results experimentally and further show that deep networks, which do not suffer from oversmoothing, are beneficial in the presence of the "cold start" problem, i.e. when there is no feature information about unlabeled nodes.

1. INTRODUCTION

Graph Neural Networks (GNNs) utilize message passing or neighborhood aggregation schemes to extract representations for nodes and their neighborhoods. GNNs have achieved good results on a variety of graph analytics tasks, such as node classification (Zhang et al., 2018; Hamilton et al., 2017) , link prediction (Liben-Nowell & Kleinberg, 2003; Zhang & Chen, 2018 ) and graph classification (Klicpera et al., 2020) . As a result, they play a key role in graph representation learning. One of the most prominent GNN models is the Graph Convolutional Network (GCN) (Kipf & Welling, 2017) , which creates node representations, by averaging the representations (embeddings) of its immediate neighbors. Several studies have shown that, the performance of GCNs deteriorates, when their architecture becomes deeper (Li et al., 2018) . The success of deep CNNs on many tasks, like image classification, naturally led to several attempts towards building deep GNNs for node classification (Kipf & Welling, 2017; Li et al., 2018; Xu et al., 2018) . Increasing the model's depth (and the number of parameters it has), would allow more accurate representational learning to occur. Most of the existing approaches have failed to develop a sufficiently deep architecture that achieves good performance. Therefore, there is a need to design new models which can efficiently scale to a large number of layers. The aim of this paper is to investigate the contributing factors, that compromise the performance of deep GNNs and develop a method to address them. The performance drop of deep GNNs, is associated with several factors, including vanishing gradients, overfitting, as well as the phenomenon called oversmoothing (Li et al., 2018; Xu et al., 2018; Klicpera et al., 2018; Wu et al., 2019) . Oversmoothing has been shown to be associated with graph convolution, a type of Laplacian operator. Li et al. (2018) proved that applying that operator repeatedly, makes node representations converge to a stationary point. At that point, all of the initial information (i.e. node features' information) is lost through the Laplacian smoothing. Consequently, oversmoothing hurts the performance by making node features indistinguishable across different classes. In this work, we address the oversmoothing problem in deep GNNs. We provide what is to the best of our knowledge the first study regarding the role of the activation function and the learning rate per layer of the model to oversmoothing and propose a new method to address the problem. We confirm our hypotheses both experimentally and theoretically. The new method is shown to prevent node embeddings from converging to the same point, thus leading to better node representations. We summarize our main contributions as follows. •Role of Activation Function in Oversmoothing: We prove theoretically the connection between the activation function and oversmoothing. In fact, we show the relation between the slope of ReLU and the singular values of weight matrices, which are known to be associated with oversmoothing (Oono & Suzuki, 2020; Cai & Wang, 2020) . We have also verified our theoretical results experimentally. •Role of Learning Rate in Oversmoothing: Our analysis on the effect of the slope of ReLU to oversmoothing has a direct extension to the learning rates used per layer of the network. We conducted further experiments to study the effect of tuning the learning rates, showing that this approach could also reduce oversmoothing, but it is less practical. •The power of Deep GNNs: We have performed extensive experiments using up to 64-layer networks, tackling oversmoothing with the proposed method. We further show the benefits that such deep GNNs can provide in the presence of reduced information, such as in a "cold start" situation, where node features are available only for the labeled nodes in a node classification setting.

2. NOTATIONS AND PRELIMINARIES 2.1 NOTATIONS

In order to illustrate the problem of oversmoothing, we consider the task of semi-supervised node classification on a graph. The graph to be analysed is G(V, E,X), with |V| = N nodes u i ∈ V, edges (u i , u j ) ∈ E and X = [x 1 , ..., x N ] T ∈ R N ×C denotes the initial node features. The edges form an adjacency matrix A ∈ R N ×N where edge (u i , u j ) is associated with element A i,j . A i,j can take arbitrary real values indicating the weight (strength) of edge (u i , u j ). Node degrees are represented through a diagonal matrix D ∈ R N ×N , where each element d i represents the sum of edge weights connected to node i. During training, only the labels of a subset V l ∈ V are available. The task is to learn a node classifier, that predicts the label of each node using the graph topology and the given feature vectors. GCN, originally proposed by Kipf & Welling (2017) , utilizes a feed forward propagation as: H (l+1) = σ( ÂH (l) W (l) ) where H (l) = [h (l) 1 , ..., h N ] are node representations (or hidden vectors or embeddings) at the l-th layer, with h l i standing for the hidden representation of node i; Â = D-1/2 (A+I) D-1/2 denotes the augmented adjacency matrix after self-loop addition, where D corresponds to the degree matrix; σ(•) is a nonlinear element-wise function, i.e. the activation function, ReLU; and W (l) is the trainable weight matrix of the l-th layer. A generic ReLU function with slope α is defined as ReLU(x) = max(α • x, 0).

2.2. UNDERSTANDING OVERSMOOTHING

GNNs achieve state-of-the-art performance in a variety of graph-based tasks. Despite their success, models like GCN (Kipf & Welling, 2017) and GAT (Velickovic et al., 2018) experience a performance drop, when stacking multiple layers. To a large extent, this is attributed to oversmoothing due to repeated graph convolutions. In their analysis Li et al. (2018) , showed that graph convolution is a special form of Laplacian smoothing. In fact, they proved that the new representation of each node is formed by a weighted average of its own representation and that of its neighbors. This mechanism allows the node representations within each (graph) cluster, i.e. highly connected group of nodes, to become more similar and improves the performance on semi-supervised tasks on graphs. When stacking multiple layers, the smoothing operation is repeated multiple times leading to oversmoothing of node representations, i.e., the hidden representations of all nodes become similar, resulting in information loss.

2.3. DEEP GNN LIMITATIONS

Therefore, oversmoothing leads node representations to converge to a fixed point as the network's depth increases (Li et al., 2018) . At that point, node representations contain information relevant to the graph topology and disregard the input features. Oono & Suzuki (2020) have generalized the idea in Li et al. ( 2018) by taking into consideration, that the ReLU activation function maps to a positive cone. They explain oversmoothing as the convergence to a subspace, instead of the

