GRADIENT GATING FOR DEEP MULTI-RATE LEARNING ON GRAPHS

Abstract

We present Gradient Gating (G 2 ), a novel framework for improving the performance of Graph Neural Networks (GNNs). Our framework is based on gating the output of GNN layers with a mechanism for multi-rate flow of message passing information across nodes of the underlying graph. Local gradients are harnessed to further modulate message passing updates. Our framework flexibly allows one to use any basic GNN layer as a wrapper around which the multi-rate gradient gating mechanism is built. We rigorously prove that G 2 alleviates the oversmoothing problem and allows the design of deep GNNs. Empirical results are presented to demonstrate that the proposed framework achieves state-of-the-art performance on a variety of graph learning tasks, including on large-scale heterophilic graphs.

1. INTRODUCTION

Learning tasks involving graph structured data arise in a wide variety of problems in science and engineering. Graph Neural Networks (GNNs) (Sperduti, 1994; Goller & Kuchler, 1996; Sperduti & Starita, 1997; Frasconi et al., 1998; Gori et al., 2005; Scarselli et al., 2008; Bruna et al., 2014; Defferrard et al., 2016; Kipf & Welling, 2017; Monti et al., 2017; Gilmer et al., 2017) are a popular deep learning architecture for graph-structured and relational data. GNNs have been successfully applied in domains including computer vision and graphics (Monti et al., 2017 ), recommender systems (Ying et al., 2018 ), transportation (Derrow-Pinion et al., 2021 ), computational chemistry (Gilmer et al., 2017 ), drug discovery (Gaudelet et al., 2021 ), particle physics (Shlomi et al., 2020) and social networks. See Zhou et al. (2019); Bronstein et al. (2021) for extensive reviews. Despite the widespread success of GNNs and a plethora of different architectures, several fundamental problems still impede their efficiency on realistic learning tasks. These include the bottleneck (Alon & Yahav, 2021 ), oversquashing (Topping et al., 2021 ), and oversmoothing (Nt & Maehara, 2019; Oono & Suzuki, 2020) phenomena. Oversmoothing refers to the observation that all node features in a deep (multi-layer) GNN converge to the same constant value as the number of layers is increased. Thus, and in contrast to standard machine learning frameworks, oversmoothing inhibits the use of very deep GNNs for learning tasks. These phenomena are likely responsible for the unsatisfactory empirical performance of traditional GNN architectures in heterophilic datasets, where the features or labels of a node tend to be different from those of its neighbors (Zhu et al., 2020) . Given this context, our main goal is to present a novel framework that alleviates the oversmoothing problem and allows one to implement very deep multi-layer GNNs that can significantly improve performance in the setting of heterophilic graphs. Our starting point is the observation that in standard Message-Passing GNN architectures (MPNNs), such as GCN (Kipf & Welling, 2017) or GAT (Velickovic et al., 2018) , each node gets updated at exactly the same rate within every hidden layer. Yet, realistic learning tasks might benefit from having different rates of propagation (flow) of information on the underlying graph. This insight leads to a novel multi-rate message passing scheme capable of learning these underlying rates. Moreover, we also propose a novel procedure that harnesses graph gradients to ameliorate the oversmoothing problem. Combining these elements leads to a new architecture described in this paper, which we term Gradient Gating (G 2 ). Main Contributions. We will demonstrate the following advantages of the proposed approach: • G 2 is a flexible framework wherein any standard message-passing layer (such as GAT, GCN, GIN, or GraphSAGE) can be used as the coupling function. Thus, it should be thought of as a framework into which one can plug existing GNN components. The use of multiple rates and gradient gating facilitates the implementation of deep GNNs and generally improves performance. • G 2 can be interpreted as a discretization of a dynamical system governed by nonlinear differential equations. By investigating the stability of zero-Dirichlet energy steady states of this system, we rigorously prove that our gradient gating mechanism prevents oversmoothing. To complement this, we also prove a partial converse, that the lack of gradient gating can lead to oversmoothing. • We provide extensive empirical evidence demonstrating that G 2 achieves state-of-the-art performance on a variety of graph learning tasks, including on large heterophilic graph datasets.

2. GRADIENT GATING

Let G = (V, E ⊆ V × V) be an undirected graph with |V| = v nodes and |E| = e edges (unordered pairs of nodes {i, j} denoted i ∼ j). The 1-neighborhood of a node i is denoted N i = {j ∈ V : i ∼ j}. Furthermore, each node i is endowed with an m-dimensional feature vector X i ; the node features are arranged into a v × m matrix X = (X ik ) with i = 1, . . . , v and k = 1, . . . , m. A typical residual Message-Passing GNN (MPNN) updates the node features by performing several iterations of the form, X n = X n-1 + σ(F θ (X n-1 , G)), where F θ is a learnable function with parameters θ, and σ is an element-wise non-linear activation function. Here n ≥ 1 denotes the n-th hidden layer with n = 0 being the input. One can interpret (1) as a discrete dynamical system in which F plays the role of a coupling function determining the interaction between different nodes of the graph. In particular, we consider local (1-neighborhood) coupling of the form Y i = (F(X, G)) i = F(X i , {{X j∈Ni }}) operating on the multiset of 1-neighbors of each node. Examples of such functions used in the graph machine learning literature (Bronstein et al., 2021) are graph convolutions Y i = j∈Ni c ij X j (GCN, (Kipf & Welling, 2017)) and graph attention Y i = j∈Ni a(X i , X j )X j (GAT, (Velickovic et al., 2018) ). We observe that in (1), at each hidden layer, every node and every feature channel gets updated with exactly the same rate. However, it is reasonable to expect that in realistic graph learning tasks one can encounter multiple rates for the flow of information (node updates) on the graph. Based on this observation, we propose a multi-rate (MR) generalization of (1), allowing updates to each node of the graph and feature channel with different rates, X n = (1 -τ n ) ⊙ X n-1 + τ n ⊙ σ(F θ (X n-1 , G)), where τ denotes a v × m matrix of rates with elements τ ik ∈ [0, 1]. Rather than fixing τ prior to training, we aim to learn the different update rates based on the node data X and the local structure of the underlying graph G, as follows τ n (X n-1 , G) = σ( Fθ (X n-1 , G)), where Fθ is another learnable 1-neighborhood coupling function, and σ is a sigmoidal logistic activation function to constrain the rates to lie within [0, 1]. Since the multi-rate message-passing scheme (2) using (3) does not necessarily prevent oversmoothing (for any choice of the coupling function), we need to further constrain the rate matrix τ n . To this end, we note that the graph gradient of scalar node features y on the underlying graph G is defined as (∇y) ij = y jy i at the edge i ∼ j (Lim, 2015) . Next, we will use graph gradients to obtain the proposed Gradient Gating (G 2 ) framework given by τ n = σ( Fθ (X n-1 , G)), τ n ik = tanh   j∈Ni | τ n jk -τ n ik | p   , X n = (1 -τ n ) ⊙ X n-1 + τ n ⊙ σ(F θ (X n-1 , G)),

