GRADIENT GATING FOR DEEP MULTI-RATE LEARNING ON GRAPHS

Abstract

We present Gradient Gating (G 2 ), a novel framework for improving the performance of Graph Neural Networks (GNNs). Our framework is based on gating the output of GNN layers with a mechanism for multi-rate flow of message passing information across nodes of the underlying graph. Local gradients are harnessed to further modulate message passing updates. Our framework flexibly allows one to use any basic GNN layer as a wrapper around which the multi-rate gradient gating mechanism is built. We rigorously prove that G 2 alleviates the oversmoothing problem and allows the design of deep GNNs. Empirical results are presented to demonstrate that the proposed framework achieves state-of-the-art performance on a variety of graph learning tasks, including on large-scale heterophilic graphs.

1. INTRODUCTION

Learning tasks involving graph structured data arise in a wide variety of problems in science and engineering. Graph Neural Networks (GNNs) (Sperduti, 1994; Goller & Kuchler, 1996; Sperduti & Starita, 1997; Frasconi et al., 1998; Gori et al., 2005; Scarselli et al., 2008; Bruna et al., 2014; Defferrard et al., 2016; Kipf & Welling, 2017; Monti et al., 2017; Gilmer et al., 2017) are a popular deep learning architecture for graph-structured and relational data. GNNs have been successfully applied in domains including computer vision and graphics (Monti et al., 2017) , recommender systems (Ying et al., 2018) , transportation (Derrow-Pinion et al., 2021) , computational chemistry (Gilmer et al., 2017) , drug discovery (Gaudelet et al., 2021) , particle physics (Shlomi et al., 2020) and social networks. See Zhou et al. (2019) ; Bronstein et al. (2021) for extensive reviews. Despite the widespread success of GNNs and a plethora of different architectures, several fundamental problems still impede their efficiency on realistic learning tasks. These include the bottleneck (Alon & Yahav, 2021) , oversquashing (Topping et al., 2021) , and oversmoothing (Nt & Maehara, 2019; Oono & Suzuki, 2020) phenomena. Oversmoothing refers to the observation that all node features in a deep (multi-layer) GNN converge to the same constant value as the number of layers is increased. Thus, and in contrast to standard machine learning frameworks, oversmoothing inhibits the use of very deep GNNs for learning tasks. These phenomena are likely responsible for the unsatisfactory empirical performance of traditional GNN architectures in heterophilic datasets, where the features or labels of a node tend to be different from those of its neighbors (Zhu et al., 2020) . Given this context, our main goal is to present a novel framework that alleviates the oversmoothing problem and allows one to implement very deep multi-layer GNNs that can significantly improve performance in the setting of heterophilic graphs. Our starting point is the observation that in standard Message-Passing GNN architectures (MPNNs), such as GCN (Kipf & Welling, 2017) or GAT (Velickovic et al., 2018) , each node gets updated at exactly the same rate within every hidden layer. Yet, realistic learning tasks might benefit from having different rates of propagation (flow) of information on the underlying graph. This insight leads to a novel multi-rate message passing scheme capable of learning these underlying rates. Moreover, we also propose a novel procedure that harnesses graph gradients to ameliorate the oversmoothing problem. Combining these elements leads to a new architecture described in this paper, which we term Gradient Gating (G 2 ). Main Contributions. We will demonstrate the following advantages of the proposed approach: • G 2 is a flexible framework wherein any standard message-passing layer (such as GAT, GCN, GIN, or GraphSAGE) can be used as the coupling function. Thus, it should be thought of as a framework into which one can plug existing GNN components. The use of multiple rates and gradient gating facilitates the implementation of deep GNNs and generally improves performance. • G 2 can be interpreted as a discretization of a dynamical system governed by nonlinear differential equations. By investigating the stability of zero-Dirichlet energy steady states of this system, we rigorously prove that our gradient gating mechanism prevents oversmoothing. To complement this, we also prove a partial converse, that the lack of gradient gating can lead to oversmoothing. • We provide extensive empirical evidence demonstrating that G 2 achieves state-of-the-art performance on a variety of graph learning tasks, including on large heterophilic graph datasets.

2. GRADIENT GATING

Let G = (V, E ⊆ V × V) be an undirected graph with |V| = v nodes and |E| = e edges (unordered pairs of nodes {i, j} denoted i ∼ j). The 1-neighborhood of a node i is denoted N i = {j ∈ V : i ∼ j}. Furthermore, each node i is endowed with an m-dimensional feature vector X i ; the node features are arranged into a v × m matrix X = (X ik ) with i = 1, . . . , v and k = 1, . . . , m. A typical residual Message-Passing GNN (MPNN) updates the node features by performing several iterations of the form, X n = X n-1 + σ(F θ (X n-1 , G)), where F θ is a learnable function with parameters θ, and σ is an element-wise non-linear activation function. Here n ≥ 1 denotes the n-th hidden layer with n = 0 being the input. One can interpret (1) as a discrete dynamical system in which F plays the role of a coupling function determining the interaction between different nodes of the graph. In particular, we consider local (1-neighborhood) coupling of the form Y i = (F(X, G)) i = F(X i , {{X j∈Ni }}) operating on the multiset of 1-neighbors of each node. Examples of such functions used in the graph machine learning literature (Bronstein et al., 2021) are graph convolutions Y i = j∈Ni c ij X j (GCN, (Kipf & Welling, 2017) ) and graph attention Y i = j∈Ni a(X i , X j )X j (GAT, (Velickovic et al., 2018) ). We observe that in (1), at each hidden layer, every node and every feature channel gets updated with exactly the same rate. However, it is reasonable to expect that in realistic graph learning tasks one can encounter multiple rates for the flow of information (node updates) on the graph. Based on this observation, we propose a multi-rate (MR) generalization of (1), allowing updates to each node of the graph and feature channel with different rates, X n = (1 -τ n ) ⊙ X n-1 + τ n ⊙ σ(F θ (X n-1 , G)), where τ denotes a v × m matrix of rates with elements τ ik ∈ [0, 1]. Rather than fixing τ prior to training, we aim to learn the different update rates based on the node data X and the local structure of the underlying graph G, as follows τ n (X n-1 , G) = σ( Fθ (X n-1 , G)), where Fθ is another learnable 1-neighborhood coupling function, and σ is a sigmoidal logistic activation function to constrain the rates to lie within [0, 1]. Since the multi-rate message-passing scheme (2) using (3) does not necessarily prevent oversmoothing (for any choice of the coupling function), we need to further constrain the rate matrix τ n . To this end, we note that the graph gradient of scalar node features y on the underlying graph G is defined as (∇y) ij = y jy i at the edge i ∼ j (Lim, 2015) . Next, we will use graph gradients to obtain the proposed Gradient Gating (G 2 ) framework given by τ n = σ( Fθ (X n-1 , G)), τ n ik = tanh   j∈Ni | τ n jk -τ n ik | p   , X n = (1 -τ n ) ⊙ X n-1 + τ n ⊙ σ(F θ (X n-1 , G)), where τ n jkτ n ik = (∇ τ n * k ) ij denotes the graph-gradient and τ n * k is the k-th column of the rate matrix τ n and p ≥ 0. Since j∈Ni | τ n jkτ n ik | p ≥ 0 for all i ∈ V, it follows that τ n ∈ [0, 1] v×m for all n, retaining its interpretation as a matrix of rates. The sum over the neighborhood N i in (4) can be replaced by any permutation-invariant aggregation function (e.g., mean or max). Moreover, any standard message-passing procedure can be used to define the coupling functions F and F (and, in particular, one can set F = F). As an illustration, Fig. 1 shows a schematic diagram of the layer-wise update of the proposed G 2 architecture. X n-1 X n Figure 1: Schematic diagram of G 2 (4) showing the layerwise update of the latent node features X (at layer n). The norm of the graph-gradient (i.e., sum in second equation in ( 4)) is denoted as ∥∇∥ p p . The intuitive idea behind gradient gating in (4) is the following: If for any node i ∈ V local oversmoothing occurs, i.e., lim n→∞ j∈Ni ∥X n i -X n j ∥ = 0, then G 2 ensures that the corresponding rate τ n i goes to zero (at a faster rate), such that the underlying hidden node feature X i is no longer updated. This prevents oversmoothing by early-stopping of the message passing procedure.

3. PROPERTIES OF G 2 -GNN

G 2 is a flexible framework. An important aspect of G 2 (4) is that it can be considered as a "wrapper" around any specific MPNN architecture. In particular, the hidden layer update for any form of message passing (e.g., GCN (Kipf & Welling, 2017) , GAT (Velickovic et al., 2018) , GIN (Xu et al., 2018) or GraphSAGE (Hamilton et al., 2017) ) can be used as the coupling functions F, F in (4). By setting τ ≡ I, (4) reduces to X n = σ F θ (X n-1 , G) , a standard (non-residual) MPNN. As we will show in the following, the use of a non-trivial gradientgated learnable rate matrix τ allows implementing very deep architectures that avoid oversmoothing. Maximum Principle for node features. Node features produced by G 2 satisfy the following Maximum Principle. Proposition 3.1. Let X n be the node feature matrix generated by iteration formula (4). Then, the features are bounded as follows: min (-1, σ) ≤ X n ik ≤ max (1, σ) , ∀1 ≤ n, where the scalar activation function is bounded by σ ≤ σ(z) ≤ σ for all z ∈ R. The proof follows readily from writing (4) component-wise and using the fact that 0 ≤ τ n ik ≤ 1, for all 1 ≤ i ≤ v, 1 ≤ k ≤ m and 1 ≤ n. that interesting properties of GNNs (with residual connections) can be understood by taking the continuous (infinite-depth) limit and analyzing the resulting differential equations.

Continuous limit of G

In this context, we can derive a continuous version of (4) by introducing a small-scale 0 < ∆t < 1 and rescaling the rate matrix τ n to ∆tτ n leading to X n = (1 -∆tτ n ) ⊙ X n-1 + ∆tτ n ⊙ σ F θ (X n-1 , G) . ( ) Rearranging the terms in (7), we obtain X n -X n-1 ∆t = τ n ⊙ σ F θ (X n-1 , G) -X n-1 . (8) Interpreting X n ≈ X(n∆t) = X(t n ), i.e. , marching in time, corresponds to increasing the number of hidden layers. Letting ∆t → 0, one obtains the following system of graph-coupled ordinary differential equations (ODEs): dX(t) dt = τ (t) ⊙ (σ (F θ (X(t), G)) -X(t)) , ∀t ≥ 0, τ ik (t) = tanh   j∈Ni | τik (t) -τjk (t)| p   , τ (t) = σ( Fθ (X n-1 , G)). We observe that the iteration formula (4) acts as a forward Euler discretization of the ODE system (9). Hence, one can follow Chamberlain et al. (2021a) and design more general (e.g., higher-order, adaptive, or implicit) discretizations of the ODE system (9). All these can be considered as design extensions of (4). Oversmoothing. Using the interpretation of (4) as a discretization of the ODE system (9), we can adapt the mathematical framework recently proposed in Rusch et al. (2022a) to study the oversmoothing problem. In order to formally define oversmoothing, we introduce the Dirichlet energy defined on the node features X of an undirected graph G as E(X) = 1 v i∈V j∈Ni ∥X i -X j ∥ 2 . ( ) Following Rusch et al. (2022a) , we say that the scheme (9) oversmoothes if the Dirichlet energy decays exponentially fast, E(X(t)) ≤ C 1 e -C2t , ∀t > 0, (11) for some C 1,2 > 0. In particular, the discrete version of (11) implies that oversmoothing happens when the Dirichlet energy, decays exponentially fast as the number of hidden layers increases ((Rusch et al., 2022a ) Definition 3.2). Next, one can prove the following proposition further characterizing oversmoothing with the standard terminology of dynamical systems (Wiggins, 2003) . Proposition 3.2. The oversmoothing problem occurs for the ODEs (9) iff the hidden states X * i = c, for all i ∈ V are exponentially stable steady states (fixed points) of the ODE (9), for some c ∈ R m . In other words, for the oversmoothing problem to occur for this system, all the trajectories of the ODE (9) that start within the corresponding basin of attraction have to converge exponentially fast in time (according to (11)) to the corresponding steady state c. Note that the basins of attraction will be different for different values of c. The proof of this Proposition is a straightforward adaptation of the proof of Proposition 3.3 of Rusch et al. (2022a) . Given this precise formulation of oversmoothing, we will investigate whether and how gradient gating in (9) can prevent oversmoothing. For simplicity, we set m = 1 to consider only scalar node features (extension to vector node features is straightforward). Moreover, we assume coupling functions of the form F(X) = A(X)X, expressed element-wise as (see also Chamberlain et al. (2021a)  ; Rusch et al. (2022a)), (F(X)) i = j∈Ni A(X i , X j )X j . Here, A(X) is a matrix-valued function whose form covers many commonly used coupling functions stemming from the graph attention (GAT, where A ij = A(X i , X j ) is learnable) or convolution operators (GCN, where A ij is fixed). Furthermore, the matrices are right stochastic, i.e., the entries satisfy 0 ≤ A ij ≤ 1, j∈Ni A ij = 1. ( ) Finally, as the multi-rate feature of ( 9) has no direct bearing on the oversmoothing problem, we focus on the contribution of the gradient feedback term. To this end, we deactivate the multi-rate aspects and assume that τi = X i for all i ∈ V, leading to the following form of ( 9): dX i (t) dt = τ i (t)   σ   j∈Ni A ij X j (t)   -X i (t)   , ∀t ≥ 0, τ i (t) = tanh   j∈Ni ∥X j (t) -X i (t)∥ p   . ( ) Lack of G 2 can lead to oversmoothing. We first consider the case where the Gradient Gating is switched off by setting p = 0 in ( 14). This yields a standard GNN in which node features are evolved through message passing between neighboring nodes, without any explicit information about graph gradients. We further assume that the activation function is ReLU i.e., σ(x) = max(x, 0). Given this setting, we have the following proposition on oversmoothing: Proposition 3.3. Assume the underlying graph G is connected. For any c ≥ 0, let X * i ≡ c, for all i ∈ V be a (zero-Dirichlet energy) steady state of the ODEs (14). Moreover, assume no Gradient Gating (p = 0 in (14)) and A ij (c, c) = A ji (c, c), and A ij (c, c) ≥ a, 1 ≤ i, j ≤ v, with 0 < a ≤ 1 and that there exists at least one node denoted w.l.o.g. with index 1 such that X 1 (t) ≡ c, for all t ≥ 0. Then, the steady state X * i = c, for all i ∈ V, of ( 14) is exponentially stable. Proposition 3.2 implies that without gradient gating (G 2 ), ( 9) can lead to oversmoothing. The proof, presented in SM C.1 relies on analyzing the time-evolution of small perturbations around the steady state c and showing that these perturbations decay exponentially fast in time (see ( 20)). G 2 prevents oversmoothing. We next investigate the effect of Gradient Gating in the same setting of Proposition 3.3. The following Proposition shows that gradient gating prevents oversmoothing: Proposition 3.4. Assume the underlying graph G is connected. For any c ≥ 0 and for all i ∈ V, let X * i ≡ c be a (zero-Dirichlet energy) steady state of the ODEs (14). Moreover, assume Gradient Gating (p > 0) and that the matrix A in (14) satisfies (15) and that there exists at least one node denoted w.l.o.g. with index 1 such that X 1 (t) ≡ c, for all t ≥ 0. Then, the steady state X * i = c, for all i ∈ V is not exponentially stable. The proof, presented in SM C.2 clearly elucidates the role of gradient gating by showing that the energy associated with the quasi-linearized evolution equations (SM Eqn. ( 21)) is balanced by two terms (SM Eqn. ( 23)), both resulting from the introduction of gradient gating by setting p > 0 in (14). One of them is of indefinite sign and can even cause growth of perturbations around a steady state c. The other decays initial perturbations. However, the rate of this decay is at most polynomial (SM Eqn. ( 28)). For instance, the decay is merely linear for p = 2 and slower for higher values of p. Thus, the steady state c cannot be exponentially stable and oversmoothing is prevented. This justifies the intuition behind gradient gating, namely, if oversmoothing occurs around a node i, i.e., lim n→∞ j∈Ni ∥X n i -X n j ∥ = 0, then the corresponding rate τ n i goes to zero (at a faster rate), such that the underlying hidden node feature X i stops getting updated. Effect of G 2 on Dirichlet energy. Given that oversmoothing relates to the decay of Dirichlet energy (11), we follow the experimental setup proposed by Rusch et al. (2022a) to probe the dynamics of the Dirichlet energy of Gradient-Gated GNNs, defined on a 2-dimensional 10 × 10 regular grid with 4-neighbor connectivity. The node features X are randomly sampled from U([0, 1]) and then propagated through a 1000-layer GNN with random weights. We compare GAT, GCN and their gradient-gated versions (G 2 -GAT and G 2 -GCN) in this experiment. Fig. 2 depicts on log-log scale the Dirichlet energy of each layer's output with respect to the layer number. We clearly observe that GAT and GCN oversmooth as the underlying Dirichlet energy converges exponentially fast to zero, resulting in the node features becoming indistinguishable. In practice, the Dirichlet energy for these architectures is ≈ 0 after just ten hidden layers. On the other hand, and as suggested by the theoretical results of the previous section, adding G 2 decisively prevents this behavior and the Dirichlet energy remains (near) constant, even for very deep architectures (up to 1000 layers). G 2 for very deep GNNs. Oversmoothing inhibits the use of large number of GNN layers. As G 2 is designed to alleviate oversmoothing, it should allow very deep architectures. To test this assumption, we reproduce the experiment considered in Chamberlain et al. (2021a) : a node-level classification task on the Cora dataset using increasingly deeper GCN architectures. In addition to G 2 , we also compare with two recently proposed mechanisms to alleviate oversmoothing, DropEdge (Rong et al., 2020) and GraphCON (Rusch et al., 2022a) . The results are presented in Fig. 3 , where we plot the test accuracy for all the models with the number of layers ranging from 2 to 128. While a plain GCN seems to suffer the most from oversmoothing (with the performance rapidly deteriorating after 8 layers), GCN+DropEdge as well as GCN+GraphCON are able to mitigate this behavior to some extent, although the performance eventually starts dropping (after 16 and 64 layers, respectively). In contrast, G 2 -GCN exhibits a small but noticeable increase in performance for increasing number of layers, reaching its peak performance for 128 layers. This experiment suggests that G 2 can indeed be used in conjunction with deep GNNs, potentially allowing performance gains due to depth. et al., 2021) . While Chameleon and Squirrel are already widely used as heterophilic node-level classification tasks, the original datasets consist of continuous node targets (average monthly web-page traffic). We normalize the provided webpage traffic values for every node between 0 and 1 and note that the resulting node values exhibit values on a wide range of different scales ranging between 10 -5 and 1 (see Fig. 4 ). Table 1 shows the test normalized mean-square error (mean and standard deviation based on the ten pre-defined splits in Pei et al. ( 2020)) for two standard GNN architectures (GCN and GAT) with and without G 2 . We observe from Table 1 that adding G 2 to the baselines significantly reduces the error, demonstrating the advantage of using multiple update rates. G 2 for varying homophily (Synthetic Cora). We test G 2 on a node-level classification task with varying levels of homophily on the synthetic Cora dataset Zhu et al. (2020) . Standard GNN models are known to perform poorly in heterophilic settings. This can be seen in Fig. 5 , where we present the classification accuracy of GCN and GAT on the synthetic-Cora dataset with a level of homophily varying between 0 and 0.99. While these models succeed in the homophilic case (reaching nearly perfect accuracy), their performance drops to ≈ 20% when the level of homophily approaches 0. Adding G 2 to GCN or GAT mitigates this phenomenon: the resulting models reach a test accuracy of over 80%, even in the most heterophilic setting, thus leading to a four-fold increase in the accuracy of the underlying GCN or GAT models. Furthermore, we notice an increase in performance even in the homophilic setting. Moreover, we compare with a state-of-the-art model GGCN (Yan et al., 2021) , which has been recently proposed to explicitly deal with heterophilic graphs. From Fig. 5 we observe that G 2 performs on par and slightly better than GGCN in strongly heterophilic settings. Heterophilic datasets. In Table 2 , we test the proposed framework on several real-world heterophilic graphs (with a homophily level of ≤ 0.30) (Pei et al., 2020; Rozemberczki et al., 2021) and benchmark it against baseline models GraphSAGE (Hamilton et al., 2017) , GCN (Kipf & Welling, 2017) , GAT (Velickovic et al., 2018) and MLP (Goodfellow et al., 2016) , as well as recent state-ofthe-art models on heterophilic graph datasets, i.e., GGCN (Yan et al., 2021) , GPRGNN (Chien et al., 2020) , H2GCN (Zhu et al., 2020) , FAGCN (Bo et al., 2021) , F 2 GAT (Wei et al., 2022 ), MixHop (Abu-El-Haija et al., 2019) , GCNII (Chen et al., 2020b) , Geom-GCN (Pei et al., 2020) , PairNorm (Zhao & Akoglu, 2019) . We can observe that G 2 added to GCN, GAT or GraphSAGE outperforms all other methods (in particular recent methods such as GGCN, GPRGNN, H2GCN that were explicitly designed to solve heterophilic tasks). Moreover, adding G 2 to the underlying base GNN model improves the results on average by 45.75% for GAT, 45.4% for GCN and 18.6% for GraphSAGE. Table 3 shows the results of G 2 -GraphSAGE together with other standard GNNs, as well as recent state-of-the-art models, i.e., MLP (Goodfellow et al., 2016) , GCN (Kipf & Welling, 2017) , GAT (Velickovic et al., 2018) , Mix-Hop (Abu-El-Haija et al., 2019) , LINK(X) (Lim et al., 2021) , GCNII (Chen et al., 2020b) , APPNP (Klicpera et al., 2018) , GloGNN (Li et al., 2022) , GPR-GNN (Chien et al., 2020) and ACM-GCN (Luan et al., 2021) . We can see that G 2 -GraphSAGE significantly outperforms current state-of-theart (by up to 13%) on the two heterophilic graphs (i.e., snap-patents and arXiv-year). Moreover, G 2 -GraphSAGE is on-par with the current state-of-the-art on the homophilic graph dataset genius. 10 -6 10 -1 E(X n ) G 2 -GCN GCN G 2 -GAT GAT We conclude that the proposed gradient gating method can successfully be scaled up to large graphs, reaching state-of-the-art performance, in particular on heterophilic graph datasets.

5. RELATED WORK

Gating. Gating is a key component of our proposed framework. The use of gating (i.e., the modulation between 0 and 1) of hidden layer outputs has a long pedigree in neural networks and sequence modeling. In particular, classical recurrent neural network (RNN) architectures such as LSTM (Hochreiter & Schmidhuber, 1997) and GRU (Cho et al., 2014) rely on gates to modulate information propagation in the RNN. Given the connections between RNNs and early versions of GNNs (Zhou et al., 2019) , it is not surprising that the idea of gating has been used in designing GNNs Bresson & Laurent (2017) ; Li et al. (2016) ; Zhang et al. (2018) . However, to the best of our knowledge, the use of local graph-gradients to further modulate gating in order to alleviate the oversmoothing problem is novel, and so is its theoretical analysis. Multi-scale methods. The multi-rate gating procedure used in G 2 is a particular example of multiscale mechanisms. The use of multi-scale neural network architectures has a long history. An early example is Hinton & Plaut (1987) , who proposed a neural network with each connection having a fast changing weight for temporary memory and a slow changing weight for long-term learning. The classical convolutional neural networks (CNNs, LeCun et al. (1989) ) can be viewed as multi-scale architectures for processing multiple spatial scales in images (Bai et al., 2020) . Moreover, there is a close connection between our multi-rate mechanism (4) and the use of multiple time scales in recently proposed sequence models such as UnICORNN (Rusch & Mishra, 2021) and long expressive memory (LEM) (Rusch et al., 2022b) . In line with these works, one contribution of our paper is a continuous version of G 2 (9), which we used for a rigorous analysis of the oversmoothing problem. Understanding whether this system of ODEs has an interpretation as a known physical model is a topic for future research.

6. DISCUSSION

We have proposed a novel framework, termed G 2 , for efficient learning on graphs. G 2 builds on standard MPNNs, but seeks to overcome their limitations. In particular, we focus on the fact that for standard MPNNs such as GCN or GAT, each node (in every hidden layer) is updated at the same rate. This might inhibit efficient learning of tasks where different node features would need to be updated at different rates. Hence, we equip a standard MPNN with gates that amount to a multi-rate modulation for the hidden layer output in (4). This enables multiple rates (or scales) of flow of information across a graph. Moreover, we leverage local (graph) gradients to further constrain the gates. This is done to alleviate oversmoothing where node features become indistinguishable as the number of layers is increased. By combining these ingredients, we present a very flexible framework (dubbed G 2 ) for graph machine learning wherein any existing MPNN hidden layer can be employed as the coupling function and the multi-rate gradient gating mechanism can be built on top of it. Moreover, we also show that G 2 corresponds to a time-discretization of a system of ODEs (9). By studying the (in)-stability of the corresponding zero-Dirichlet energy steady states we rigorously prove that gradient gating can mitigate the oversmoothing problem, paving the way for the use of very deep GNNs within the G 2 framework. In contrast, the lack of gradient gating is shown to lead to oversmoothing. We also present an extensive empirical evaluation to illustrate different aspects of the proposed G 2 framework. Starting with synthetic, small-scale experiments, we demonstrate that i) G 2 can prevent oversmoothing by keeping the Dirichlet energy constant, even for a very large number of hidden layers, ii) this feature allows us to deploy very deep architectures and to observe that the accuracy of classification tasks can increase with increasing number of hidden layers, iii) the multirate mechanism significantly improves performance on node regression tasks when the node features are distributed over a range of scales, and iv) G 2 is very accurate at classification on heterophilic datasets, witnessing an increasing gain in performance with increasing heterophily. This last feature was more extensively investigated, and we observed that G 2 can significantly outperform baselines as well as recently proposed methods on both benchmark medium-scale and large-scale heterophilic datasets, achieving state-of-the-art performance. Thus, by a combination of theory and experiments, we demonstrate that the G 2 -framework is a promising approach for learning on graphs. Future work. As future work, we would like to better understand the continuous limit of G 2 , i.e., the ODEs (9), especially in the zero spatial-resolution limit and investigate if the resulting continuous equations have interesting geometric and analytical properties. Moreover, we would like to use G 2 for solving scientific problems, such as in computational chemistry or the numerical solutions of PDEs. Finally, the promising results for G 2 on large-scale graphs encourage us to use it for even larger industrial-scale applications. it is natural to ask how sensitive the performance of G 2 is with respect to different values of the hyperparameter p. To answer this question, we trained different G 2 -GraphSAGE models on a variety of different graph datasets (i.e., Texas, Squirrel, Film, Wisconsin and Chameleon) for different values of p ∈ [1, 5]. Fig. 7 shows the resulting performance of G 2 -GraphSAGE. We can see that different values of p do not significantly change the performance of the model. However, including the hyperparameter p to the hyperparameter fine-tuning procedure will further improve the overall performance of G 2 . On the sensitivity of performance of G 2 to the number of parameters. All results of G 2 provided in the main paper are obtained using standard hyperparameter tuning (i.e., random search). Those hyperparameters include the number of hidden channels for each hidden node of the graph, which directly correlates with the total number of parameters used in G 2 . It is thus natural to ask how G 2 performs compared to its plain counter-version (e.g. G 2 -GCN to GCN) for the exact same number of total parameters of the underlying model. To this end, Fig. 8 shows the test accuracies of G 2 -GCN and GCN for increasing number of total parameters in its corresponding model. We can see that first, using more parameters has only a slight effect on the overall performance of both models. Second, and most importantly, G 2 -GCN constantly reaches significantly higher test accuracies for the exact same number of total parameters. We can thus rule out that the better performance of G 2 compared to its plain counter-versions is explained by the usage of more parameters. Ablation of Fθ in G 2 . In its most general form G 2 (4) uses an additional GNN Fθ to construct the multiple rates τ n . Is this additional GNN needed ? To answer this question, Table 4 shows in which of the provided experiments (using G 2 -GraphSAGE) we actually used an additional GNN Fθ (as suggested by our hyperparameter tuning protocol). We can see that on small-scale experiments having an additional GNN is not needed. However, on the considered medium and large-scale graph datasets it is beneficial to use it. Motivated by this, Table 5 shows the results for G 2 -GraphSAGE on the three medium-scale graph datasets (Film, Squirrel and Chameleon) without using an additional GNN in (4) (i.e., F θ = Fθ ) as well as with using an additional GNN (i.e., the results provided in the main paper). We can see that while G 2 -GraphSAGE without an additional GNN (i.e., w/o Fθ ) yields competitive results, using an additional GNN is needed in order to obtain state-of-the-art results on these three datasets. Ablation of multi-rate channels in G 2 . The corner stone of the proposed G 2 is the multi-rate matrix τ n in (4), which automatically solves the oversmoothing issue for any given GNN (Proposition 3.3). This multi-rate matrix learns different rates for every node but also for every channel of every node. It is thus natural to ask if the multi-rate property for the channels is necessary, or if having multiple rates for the different nodes is sufficient, i.e., having a multi-rate vector τ n ∈ R v . A direct construction of such multi-rate vector (derived from our proposed G 2 ) is: τ n = σ( Fθ (X n-1 , G)), τ n i = tanh   j∈Ni ∥ τ n j -τ n i ∥ p   , X n = (1 -τ n ) ⊙ X n-1 + τ n ⊙ σ(F θ (X n-1 , G)). Note that the only difference to our proposed G 2 is in the second equation of ( 16), where we sum over the node-wise p-norms of the differences of adjacent nodes. This way, we compute a single scalar τ n i for every node i ∈ V. Table 6 shows the results of our proposed G 2 -GraphSAGE as well as the single-rate channels ablation of G 2 (eq. ( 16)) on the Film, Squirrel and Chameleon graph datasets. As a baseline, we also include the results of a plain GraphSAGE. We can see that while G 2 with single-scale channels outperforms the base GraphSAGE model, our proposed G 2 with multi-rate channels vastly outperforms the single-rate channels version of G 2 . Table 6 : Test accuracies of plain GraphSAGE, G 2 -GraphSAGE with multi-rate channels for each node (i.e., standard G 2 (4)) as well as with only a single rate for every channel on Film, Squirrel and Chameleon. Alternative measures of oversmoothing. The proof of Proposition 3.2 and Proposition 3.3 as well as the first experiment in the main paper is based on the definition of oversmoothing using the Dirichlet energy, which was proposed in Rusch et al. (2022a) . However, there exist alternative measures to describe the oversmoothing phenomenon in deep GNNs. One such measure is the mean average distance (MAD), which was proposed in Chen et al. (2020a) . In order to check if our proposed G 2 mitigates oversmoothing defined through the MAD measure we repeat the first experiment in the main paper and plot the MAD instead of the Dirichlet energy for increasing number of layers in Fig. 9 . We can see that while the MAD of a plain GCN and GAT converges exponentially with increasing number of layers, it remains constant for G 2 -GCN and G 2 -GAT. We can thus conclude that G 2 mitigates oversmoothing defined through the MAD measure.



EXPERIMENTAL RESULTSIn this section, we present an experimental study of G 2 on both synthetic and real datasets. We use G 2 with three different coupling functions: GCN(Kipf & Welling, 2017), GAT(Velickovic et al., 2018) and GraphSAGE(Hamilton et al., 2017).



2 . It has recently been shown (see Avelar et al. (2019); Poli et al. (2019); Zhuang et al. (2020); Xhonneux et al. (2020); Chamberlain et al. (2021a); Eliasof et al. (2021); Chamberlain et al. (2021b); Topping et al. (2021); Rusch et al. (2022a) and references therein)

Figure2: Dirichlet energy E(X n ) of layer-wise node features X n propagated through a GAT, GCN and their gradient gated versions (G 2 -GAT, G 2 -GCN).

Figure 3: Test accuracies of GCN with gradient gating (G 2 -GCN) as well as plain GCN and GCN combined with other methods on the Cora dataset for increasing number of layers.

Figure 4: Histogram of the target node values of the Chameleon and Squirrel node-level regression tasks.

Neural differential equations. Ordinary and partial differential equations (ODEs and PDEs) are playing an increasingly important role in designing, interpreting, and analyzing novel graph machine learning architecturesAvelar et al. (2019);Poli et al. (2019);Zhuang et al. (2020);Xhonneux et al. (2020).Chamberlain et al. (2021a)  designed attentional GNNs by discretizing parabolic diffusion-type PDEs. Di Giovanni et al. (2022) interpreted GCNs as gradient flows minimizing a generalized version of the Dirichlet energy. Chamberlain et al. (2021b) applied a non-Euclidean diffusion equation ("Beltrami flow") yielding a scheme with adaptive spatial derivatives ("graph rewiring"), and Topping et al. (2021) studied a discrete geometric PDE similar to Ricci flow to improve information propagation in GNNs. Eliasof et al. (2021) proposed a GNN framework arising from a mixture of parabolic (diffusion) and hyperbolic (wave) PDEs on graphs with convolutional coupling operators, which describe dissipative wave propagation. Finally, Rusch et al. (2022a) used systems of nonlinear oscillators coupled through the associated graph structure to rigorously overcome the oversmoothing problem.

Figure 8: Test accuracies of plain GCN and G 2 -GCN on Texas for varying number of total parameters in the GNN.

Normalized test MSE on multiscale node-level regression tasks. We test the multi-rate nature of G 2 on node-level regression tasks, where the target node values exhibit multiple scales. Due to a lack of widely available node-level regression tasks, we propose regression experiments based on the Wikipedia article networks Chameleon and Squirrel, (Rozemberczki

Results on heterophilic graphs. The three best performing methods are highlighted in red (First), blue (Second), and violet (Third). -GraphSAGE 87.57 ± 3.86 87.84 ± 3.49 37.14 ± 1.01 64.26 ± 2.38 71.40 ± 2.38 86.22 ± 4.90

Results on large-scale datasets.

Usage of Fθ in G 2 (4) for each result with G 2 -GraphSAGE presented in the main paper (YES indicates the usage of Fθ , while NO indicates that no additional GNN is used to construct the multiple rates, i.e., F θ = Fθ ) Texas Wisconsin Film Squirrel Chameleon Cornell snap-patents arXiv-year genius

Test accuracies of G 2 -GraphSAGE with and without additional GNN (i.e., w/ Fθ and w/o Fθ in (4)) on Film, Squirrel and Chameleon graph dataset. Fθ 37.14 ± 1.01 64.26 ± 2.38 71.40 ± 2.38 G 2 -GraphSAGE w/o Fθ 36.83 ± 1.26 55.78 ± 1.61 65.04 ± 2.27

Squirrel Chameleon GraphSAGE 34.23 ± 0.99 41.61 ± 0.74 58.73 ± 1.68 G 2 -GraphSAGE w/ multi-rate channels G 2 (4) 37.14 ± 1.01 64.26 ± 2.38 71.40 ± 2.38 G 2 -GraphSAGE w/ single-rate channels G 2 (16) 36.67 ± 0.56 44.03 ± 1.01 60.29 ± 3.45

ACKNOWLEDGEMENTS

The research of TKR and SM was performed under a project that has received funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (Grant Agreement No. 770880). MWM would like to acknowledge the IARPA (contract W911NF20C0035), NSF, and ONR for providing partial support of this work. MB is supported in part by ERC Grant No. 724228 (LEMAN). The authors thank Emmanuel de Bézenac (ETH) for his constructive suggestions.

Supplementary Material for:

Gradient Gating for Deep Multi-Rate Learning on Graphs A ADDITIONAL EXPERIMENTS In this section, we describe additional empirical results to complement those in the main text.On the multi-rate effect of G 2 . Here, we analyze the performance of G 2 on the multi-scale node-level regression task of the main text. As we see in the main text, G 2 applied to GCN or GAT outperforms their plain counterparts (GCN and GAT) on the multi-scale node-level regression task by more than 50% on Chameleon and more than 100% on Squirrel. The question therefore arises whether this better performance can be explained by the multi-rate nature of gradient gating.To empirically analyse this, we begin by adding a control parameter α to G 2 (4) as follows,Clearly, setting α = 1 recovers the original gradient gating message-passing update,while setting α = 0 disables any explicit multi-rate behavior and a plain message-passing scheme is recovered,). Note that by continuously changing α from 0 to 1 controls the level of multi-rate behavior in the proposed gradient gating method.In Fig. 6 we plot the test NMSE of the best performing G 2 -GCN and G 2 -GAT on the Chameleon multi-scale node-level regression task for increasing values of α ∈ [10 -3 , 1] in log-scale. We can see that the test NMSE monotonically decreases (lower error means better performance) for both G 2 -GCN and G 2 -GAT for increasing values of α, i.e., increasing level of multi-rate behavior. We can conclude that the multi-rate behavior of G 2 is instrumental in successfully learning multi-scale regression tasks.

10. -3

10 -2 10 -1 10 0 α 0.10 0.12 0.14 0.16 0.18 On the sensitivity of performance of G 2 to the hyperparameter p. The proposed gradient gating model implicitly depends on the hyperparameter p, which defines the multiple rates τ , i.e.,While any value p > 0 can be used in practice, a standard hyperparameter tuning procedure (see B for the training details) on p has been applied in every experiment included in this paper. Thus, 

B TRAINING DETAILS

All small and medium-scale experiments have been run on NVIDIA GeForce RTX 2080 Ti, GeForce RTX 3090, TITAN RTX and Quadro RTX 6000 GPUs. The large-scale experiments have been run on Nvidia Tesla A100 (40GiB) GPUs.All hyperparameters were tuned using random search. Table 7 shows the ranges of each hyperparameter as well as the random distribution used to randomly sample from it. Moreover, Table 8 shows the rounded hyperparameter p in G 2 (4) of each best performing network. 

C MATHEMATICAL DETAILS

In this section, we provide proofs for Propositions 3.3 and 3.4 in the main text. We start with the following technical result which is necessary in the subsequent proofs.A Poincare Inequality on Connected Graphs. Poincare inequalities for functions (Evans, 2010) bound function values in terms of their gradients. Similar bounds on node values in terms of graph-gradients can be derived and a particular instance is given below, Proposition C.1. Let G = (V, E) be a connected graph and the corresponding (scalar) node features are denoted by y i ∈ R, for all i ∈ V. Let y 1 = 0. Then, the following bound holds,where d = max i∈V deg(i) and ∆ 1 is the eccentricity of the node 1.Proof. Fix a node i ∈ V. By assumption, the graph G is connected. Hence, there exists a path connecting i and the node 1. Denote the shortest path as P(i, 1). This path can be expressed in terms of the nodes ℓ i,1 with 0 ≤ ℓ ≤ δ, where 0 i,1 = 1 and δ i,1 = i. For any ℓ, we require ℓ i,1 ∼ (ℓ + 1) i,1 . Moreover, δ i,1 is the graph distance between the nodes i and 1 and ∆ 1 = max i∈V δ i,1 is the eccentricity of the node 1.Given the node feature y i , we can rewrite it as,as by assumption y 1 = 0.Using Cauchy-Schwartz inequality on the previous identity yields,Summing the above inequality over i ∈ V and using the fact that ℓ i,1 ∼ (ℓ + 1) i,1 , we obtain the desired Poincare inequality (17).

C.1 PROOF OF PROPOSITION 3.3 OF MAIN TEXT

Proof. By the definition of exponential stability, we consider a small perturbation around the steady state c and study whether this perturbation grows or decays in time. To this end, define the perturbation as,A tedious but straightforward calculation shows that these perturbations evolve by the following linearized system of ODEs,Multiplying xi to both sides of (19) yields,Published as a conference paper at ICLR 2023 Summing the above identity over all nodes i ∈ V yields,Here, the last inequality comes from applying the Poincare inequality (17) for the perturbations X and from the fact that by assumption X1 = 0.Applying Grönwall's inequality yields,Thus, the initial perturbations around the steady state c are damped down exponentially fast and the steady state c is exponentially stable implying that this architecture will lead to oversmoothing.

C.2 PROOF OF PROPOSITION 3.4 OF MAIN TEXT

Proof. As in the proof of Proposition 3.3, we consider small perturbations of form (18) of the steady state c and investigate how these perturbations evolve in time. Assuming that the initial perturbations are small, i.e., that there exists an 0 < ϵ << 1 such that max i∈V |x i (0)| ≤ ϵ, we perform a straightforward calculation to obtain that the perturbations (for a short time) evolve with the following quasi-linearized system of ODEs,Note that we have used the fact that σ ′ (x) = 1 and tanh ′ (0) = 1 in obtaining ( 21) from ( 14).Next, we multiply xi to both sides of ( 21) to obtain,Applying this inequality to the last line of the identity ( 22), we obtain,Summing the above inequality over i ∈ V leads to,Therefore, we have the following inequality,We analyze the differential inequality ( 23) by starting with the term T 1 in ( 23). We observe that this term does not have a definite sign and can be either positive or negative. However, we can upper bound this term in the following manner. Given that the right-hand side of the ODE system ( 21) is Lipschitz continuous, the well-known Cauchy-Lipschitz theorem states that the solutions x depend continuously on the initial data. Given that max i∈V | Xi (0)| ≤ ϵ << 1 and the bounds on the hidden states (1), there exists a time t > 0 such thatUsing the definitions of τ and the right stochasticity of the matrix A, we easily obtain the following bound,where d = max i∈V deg(i).On the other hand, the term T 2 in ( 23) is clearly positive. Hence, the solutions of resulting ODE,will clearly decay in time. The key question is whether or not the decay is exponentially fast. We answer this question below.To this end, we have the following calculation using the Hölder's inequality,Observing that X1 = 0 by assumption, we can applying the Poincare inequality (17) in the above inequality to further obtain,Hence, from the definition of T 2 (23), we have,Therefore, the differential inequality (25) now reduces to,The differential inequality ( 27) can be explicitly solved to obtain,From ( 28), we see that the initial perturbations decay but only algebraically at a rate of t -2 p in time. For instance, the decay is only linear in time for p = 2 and even slower for higher value of p.Combining the analysis of the terms T 1,2 in the differential inequality (23), we see that the one of the terms can lead to a growth in the initial perturbations whereas the second term only leads to polynomial decay. Even if the contribution of the term T 1 ≡ 0, the decay of initial perturbations is only polynomial. Thus, the steady state c is not exponentially stable.Remark C.2. We note that the Proposition 3.4 assumes a certain structure of the matrix A in (14). A careful perusal of the proof presented above reveal that this assumptions can be further relaxed. To start with, if the matrix A(c, c) is not symmetric, then there will be an additional term in the inequality (23), which would be proportional to A ij -A ji . This term will be of indefinite sign and can cause further growth in the perturbations of the steady state c. In any case, it can only further destabilize the quasi-linearized system. The assumption that the entries of A are uniformly positive amounts to assuming positivity of the weights of the underlying GNN layer. This can be replaced by requiring that the corresponding eigenvalues are uniformly positive. If some eigenvalues are negative, this will cause further instability and only strengthen the conclusion of lack of (exponential) stability. Finally, the assumption that one node is not perturbed during the quasi-linearization is required for the Poincare inequality (17). If this is not true, an additional term, of indefinite sign, is added to the inequality (23). This term can cause further growth of the perturbations and will only add instability to the system. Hence, all the assumptions in Proposition 3.4 can be relaxed and the conclusion of lack of exponential stability of the zero-Dirichlet energy steady state still holds.

