ωGNNS: DEEP GRAPH NEURAL NETWORKS EN-HANCED BY MULTIPLE PROPAGATION OPERATORS

Abstract

Graph Neural Networks (GNNs) are limited in their propagation operators. These operators often contain non-negative elements only and are shared across channels and layers, limiting the expressiveness of GNNs. Moreover, some GNNs suffer from over-smoothing, limiting their depth. On the other hand, Convolutional Neural Networks (CNNs) can learn diverse propagation filters, and phenomena like over-smoothing are typically not apparent in CNNs. In this paper, we bridge this gap by incorporating trainable channel-wise weighting factors ω to learn and mix multiple smoothing and sharpening propagation operators at each layer. Our generic method is called ωGNN, and we study two variants: ωGCN and ωGAT. For ωGCN, we theoretically analyse its behaviour and the impact of ω on the obtained node features. Our experiments confirm these findings, demonstrating and explaining how both variants do not over-smooth. Additionally, we experiment with 15 real-world datasets on node-and graph-classification tasks, where our ωGCN and ωGAT perform better or on par with state-of-the-art methods.

1. INTRODUCTION

Graph Neural Networks (GNNs) are useful for a wide array of fields, from computer vision and graphics (Monti et al., 2017; Wang et al., 2018; Eliasof & Treister, 2020) and social network analysis (Kipf & Welling, 2016; Defferrard et al., 2016) to bio-informatics (Hamilton et al., 2017; Jumper et al., 2021) . Most GNNs are defined by applications of propagation and point-wise operators, where the former is often fixed and based on the graph Laplacian (e.g., GCN (Kipf & Welling, 2016) ), or is defined by an attention mechanism (Veličković et al., 2018; Kim & Oh, 2021; Brody et al., 2022) . Most recent GNNs follow a general structure that involves two main ingredients -the propagation operator, denoted by S (l) , and a 1 × 1 convolution denoted by K (l) , as follows f (l+1) = σ(S (l) f (l) K (l) ), where f (l) denotes the feature tensor at the l-th layer. The main limitation of the above formulation is that the propagation operators in most common architectures are constrained to be non-negative. This leads to two drawbacks. First, this limits the expressiveness of GNNs. For example, the gradient of given graph node features can not be expressed by a non-negative operator, while a mixed-sign operator as in our proposed method can (see demonstrations in Fig. 1 and Fig. 2 ). Moreover, the utilization of strictly non-negative propagation operators yields a smoothing process, that may lead GNNs to suffer from over-smoothing. That is, the phenomenon where node features become indistinguishable from one and other as more GNN layers are stacked -causing severe performance degradation in deep GNNs (Li et al., 2018; Wu et al., 2019; Wang et al., 2019) . Both of the drawbacks mentioned above are not evident in Convolutional Neural Networks (CNNs), which can be interpreted as structured versions of GNNs (i.e., GNNs operating on a regular grid). The structured convolutions in CNNs allow to learn diverse propagation operators, and in particular it is known that mixed-sign kernels like sharpening filters are useful feature extractors in CNNs (Krizhevsky et al., 2012) , and such operators cannot be obtained by non-negative (smoothing) kernels only. In the context of GNNs, Eliasof et al. (2022) have shown the significance and benefit of employing mixed-sign propagation operators in GNNs as well. In addition, the over-smoothing phenomenon is typically not evident in standard CNNs where the propagation (spatial) filters are learnt, and usually adding more layers improves accuracy (He et al., 2016) . The discussion above demonstrates two gaps between CNNs and GNNs that we seek to bridge in this work. A third gap between GNNs and CNNs is the ability of the latter to learn and mix multiple propagation operators. In the scope of separable convolutions, CNNs typically learn a distinct kernel per channel, known as a depth-wise convolution (Sandler et al., 2018 ) -a key element in modern CNNs (Tan & Le, 2019; Liu et al., 2022) . On the contrary, the propagation operator S (l) from equation 1 acts on all channels (Chen et al., 2020b; Veličković et al., 2018) , and in some cases on all layers (Kipf & Welling, 2016; Wu et al., 2019) . We note that one exception is the multi-head GAT (Veličković et al., 2018) where several attention heads are learnt per layer. However, this approach typically employs only a few heads due to the high computational cost and is still limited by learning non-negative propagation operators only. In this paper we propose an effective modification to GNNs to directly address the three shortcomings of GNNs discussed above, by introducing a parameter ω to control the contribution and type of the propagation operator. We call our general approach ωGNN, and utilize GCN (Kipf & Welling, 2016) and GAT (Veličković et al., 2018) to construct two variants, ωGCN and ωGAT. We theoretically prove and empirically demonstrate that our ωGNN can prevent over-smoothing. Secondly, we show that by learning ω, our ωGNNs can yield propagation operators with mixed signs, ranging from smoothing to sharpening operators which do not exist in current GNNs (see Fig. 1 for an illustration). This approach enhances the expressiveness of the network, as demonstrated in Fig. 2 , and to the best of our knowledge, was not considered in the GNNs mentioned above that employ non-negative propagation operators only. Lastly, we propose and demonstrate that by learning different ω per layer and channel, similarly to a depth-wise convolution in CNNs, our ωGNNs obtains state-of-the-art accuracy. Our contributions are summarized as follows: • We propose ωGNN, an effective and computationally light modification to GNNs of a common and generic structure, that directly avoids over-smoothing and enhances the expressiveness of GNNs. Our method is demonstrated by ωGCN and ωGAT. • A theoretical analysis and experimental validation of the behaviour of ωGNN are provided to expose its improved expressiveness compared to standard propagation operators in GNNs. • We propose to learn multiple propagation operators by learning ω per layer and per channel and mixing them using a 1 × 1 convolution to enhance the performance of GNNs. • Our experiments with 15 real-world datasets on numerous applications and settings, from semi-and fully-supervised node classification to graph classification show that our ωGCN and ωGAT read on par or better performance than current state-of-the-art methods.

2. METHOD

We start by providing the notations that will be used throughout this paper, and displaying our general ωGNN in Sec. 2.1. Then we consider two popular GNNs that adhere to the structure presented in equation 1, namely GCN and GAT. We formulate and analyse the behaviour of their two counterparts ωGCN and ωGAT in Sec. 2.2 and 2.3, respectively. Notations. Assume we are given an undirected graph defined by the set G = (V, E) where V is a set of n vertices and E is a set of m edges. Let us denote by f i ∈ R c the feature vector of the i-th node of G with c channels. Also, we denote the adjacency matrix A, where A ij = 1 if there exists an edge (i, j) ∈ E and 0 otherwise. We also define the diagonal degree matrix D where D ii is the degree of the i-th node. The graph Laplacian is given by L = D -A. Let us also denote the adjacency and degree matrices with added self-loops by Ã and D, respectively. Lastly, we denote the symmetrically normalized graph Laplacian by Lsym = D-1 2 L D-1 2 where L = D -Ã.

2.1. ωGNNS

The goal of ωGNNs is to utilize learnable mixed-sign propagation operators that control smoothing and sharpening to enrich GNNs expressiveness. Below, we describe how the learnt ω influences the obtained operator and how to learn and mix multiple operators for enhanced expressiveness. Learning propagation weight ω. To address the expressiveness and over-smoothing issues, we suggest a general form given an arbitrary non-negative and normalized (e.g., such that its row sums equal to 1) propagation operator S (l) . Our general ωGNN is then given by f (l+1) = σ I -ω (l) I -S (l) f (l) K (l) , where ω (l) is a scalar that is learnt per layer, and in the next paragraph we offer a more elaborated version with a parameter ω per layer and channel. The introduction of ω (l) allows our ωGNN layer to behave in a three-fold manner. When ω (l) ≤ 1, a smoothing process is obtained 1 . Note, that for ω (l) = 1, equation 2 reduces to the standard GNN dynamics from equation 1. In case ω (l) = 0, equation 2 reduces to a 1 × 1 convolution followed by a non-linear activation function, and does not propagate neighbouring node features. On the other hand, if ω (l) > 1, we obtain an operator with negative signs on the diagonal but positive on the off-diagonal entries, inducing a sharpening operator. An example of various ω (l) values and their impulse response is given in Fig. 1 . Thus, a learnable ω (l) allows to learn a new family of operators, namely sharpening operators, that are not achieved by methods like GCN and GAT. To demonstrate the importance of sharpening operators, we consider a synthetic task of node gradient feature regression, given a graph and input node features(see Appendix B for more details). As depicted in Fig. 2 , using a non-negative operator as in GCN cannot accurately express the gradient operator output, while our ωGCN estimates the gradient output with a machine precision accuracy. Also, the benefit of employing both smoothing and sharpening operators is reflected in the obtained accuracy of our method on real-world datasets in Sec. 4. Multiple propagation operators. To learn multiple propagation operators, we extend equation 2 from a channels-shared weight to channel-wise weights by learning a vector ⃗ ω (l) ∈ c as follows f (l+1) = σ I -Ω ⃗ ω (l) I -S (l) f (l) K (l) , where Ω ⃗ ω (l) is an operator that scales each channel j with a different ω (l) j . As discussed in Sec. 1, this procedure yields a propagation operator per-channel, which is similar to depth-wise convolutions in CNNs (Howard et al., 2017; Sandler et al., 2018) . Thus, the extension to a vector ⃗ ω (l) helps to further bridge the gap between GNNs and CNNs. We note that using this approach, our ωGNN is suitable to many existing GNNs, and in particular to those which act as a separable convolution, as described in equation 1. In what follows, we present and analyse two variants based on GCN and GAT, called ωGCN and ωGAT, respectively.

2.2. ωGCN

GCNs are a class of GNNs that employ a pre-determined propagation operator P = D-1 2 Ã D-1 2 , that stems from the graph Laplacian. For instance, GCN Kipf & Welling (2016) is given by: f (l+1) = σ( Pf (l) K (l) ), 1 The use of the value 1 in this discussion corresponds to a non-negative operator S (l) with zeros on its diagonal, normalized to have row sums of 1. Other normalizations may yield other constants. Also, if 0 < S (l) ii < 1, then setting ω (l) > 1 1-S (l) ii flips the sign of the i-th diagonal entry. 32 64 that is, by setting S (l) = P in equation 1. Other methods like SGC (Wu et al., 2019) , GCNII (Chen et al., 2020b) and EGNN (Zhou et al., 2021 ) also rely on P as a propagation operator. 0 0.5 1 Layer E(f (l) )/E(f (0) ) ω = 1 ω = 0.1 ω = 0.001 ω = 0.0001 ω learnt (a) ωGCN. The operator P is a fixed non-negative smoothing operator, hence, repeated applications of equation 4 lead to the over-smoothing phenomenon, where the feature maps converge to a single eigenvector as shown by Wu et al. (2019) ; Wang et al. (2019) . Moreover, P is pre-determined, and solely depends on the graph connectivity, disregarding the node features, which may harm performance. By baking our proposed ωGNN with a learnable weight, denoted by ω (l) ∈ R into GCN we obtain the following propagation scheme, named ωGCN: f (l+1) = σ I -ω (l) I -P f (l) K (l) . We now present theoretical analyses of our ωGCN and reason about its non over-smoothing property. We first define the node features Dirichlet energy at the l-th layer, as in Zhou et al. (2021) : E(f (l) ) = i∈V j∈Ni 1 2 f (l) i √ (1+di) - f (l) j √ (1+dj ) 2 2 . ( ) Fig. 3a demonstrates how the Dirichlet energy E(f (l) ) decays to zero when ω is a constant, and to a fixed positive value when ω is learnt. Next, we provide a theorem that characterizes the behaviour of ω and how it prevents over-smoothing. To this end we denote the propagation operator of ωGCN from equation 5 by Pω = I -ω I -P = I -ω I -D-1 2 Ã D-1 2 = I -ω D-1 2 L D-1 2 , ( ) where the latter equality is shown in Appendix A. In essence, we show that repeatedly applying the operator P is equivalent to applying gradient descent steps for minimizing equation 6 with a learning rate ω. We build on the observation that smoothing is beneficial (Gasteiger et al., 2019; Chamberlain et al., 2021) and assume that there exists an optimal energy at the last layer that satisfies ) ). Then, we show that if we learn a single ω (l) = ω > 0, shared across all layers, then taking L to infinity will lead ω to zero. Thus, our ωGCN will not over-smooth, as the energy at the last layer E(f (L) ) can reach to E opt (f (L) ). Later, in Corollary 1.1, we generalize this result for a per-layer ω (l) , and empirically validate both results in Sec. 4.5. The proofs for the Theorem and Corollary below are given in Appendix A. 0 < E opt (f (L) ) < E(f ( Theorem 1. Consider L applications of equation 7, i.e., f (L) = ( Pω ) L f (0) with a shared parameter ω (l) = ω that is used in all the layers. Also assume that there is some optimal Dirichlet energy of the final feature map that satisfies 0 < E opt (f (L) ) < E(f (0) ). Then, at the limit, as more layers are added, ω converges to ω/L up to first order accuracy, where L is the number of layers and ω is a value that is independent of L and leads to E opt . Corollary 1.1. Allowing a variable ω (l) > 0 at each layer in Theorem 1, yields L-1 l=0 ω (l) = ω up to first order accuracy. Next, we dwell on the second mechanism in which ωGCN prevents over-smoothing. We analyse the eigenvectors of Pω , showing that different choices of ω yield different leading eigenvectors that alter the behaviour of the propagation operator (i.e. smoothing and sharpening processes). This result is useful because changing the leading eigenvector prevents the gravitation towards a specific eigenvector, which causes the over-smoothing to occur (Wu et al., 2019; Oono & Suzuki, 2020) . Theorem 2. Assume that the graph is connected. Then, there exists some ω 0 ≥ 1 where for all 0 < ω < ω 0 , the operator Pω in equation 7 is smoothing and the leading eigenvector is D 1 2 1. For ω > ω 0 or ω < 0, the leading eigenvector changes. The proof for the theorem is given in Appendix A. ωGCN with multiple propagation operators. To further increase the expressiveness of our ωGCN we extend ω (l) ∈ R to ⃗ ω (l) ∈ R c and learn a propagation operator per channel, at each layer. To this end, we modify equation 5 to the following formulation f (l+1) = σ I -Ω ⃗ ω (l) I -P f (l) K (l) . ( ) As we show in Sec. 4.5, learning a propagation operator per channel is beneficial to improve accuracy.

2.3. ωGAT

The seminal GAT (Veličković et al., 2018 ) learns a non-negative edge-weight as follows α (l) ij = exp LeakyReLU a (l) ⊤ [W (l) f (l) i ||W (l) f (l) j ] p∈Ni exp LeakyReLU a (l) ⊤ [W (l) f (l) i ||W (l) f (l) p ] , where a (l) ∈ R 2c and W (l) ∈ R c×c are trainable parameters and || denotes channel-wise concatenation. Here, GAT is obtained by defining the propagation operator S (l) in equation 1 as Ŝ(l) ij = α ij . To avoid repeated equations, we skip the per-layer ω formulation (as in equation 2) and directly define the per-channel ωGAT as follows f (l+1) = σ I -Ω ⃗ ω (l) I -Ŝ(l) f (l) K (l) . ( ) The introduction of Ω ⃗ ω (l) yields a learnable propagation operator per layer and channel. We note that it is also possible to obtain multiple propagation operators from GAT by using a multi-head attention. However, we distinguish our proposition from GAT in a 2-fold fashion. First, our propagation operators belong to a broader family that includes smoothing and sharpening operators as opposed to smoothing-only due to the SoftMax normalization in GAT. Secondly, our method requires less computational overhead when adding more propagation operators, as our ωGAT requires a scalar per operator, while GAT doubles the number of channels to obtain more attention-heads. Also, utilizing a multi-head GAT can still lead to over-smoothing, as all the heads induce a non-negative operator. To study the behaviour of our ωGAT, we inspect its node features energy compared to GAT. To this end, we define the GAT energy as E GAT (f (l) ) = i∈V j∈Ni 1 2 ||f (l) i -f (l) j || 2 2 . ( ) This modification of the Dirichlet energy from equation 6 is required because in GAT (Veličković et al., 2018) the leading eigenvector of the propagation operator Ŝ(l) is the constant vector 1 as shown by Chen et al. (2020a) , unlike the vector D 1 2 1 in the symmetric normalized P from GCN (Kipf & Welling, 2016) where the Dirichlet energy is natural to consider (Pei et al., 2020) . We present the energy of a 64 layer GAT trained on the Cora dataset in Fig. 3b . It is evident that the accuracy degradation of a deep GAT reported by Zhao & Akoglu (2020) is in congruence with the decaying energy in equation 11, while our ωGAT does not experience decaying energy nor accuray degradation as more layers are added, as can be seen in Tab. 2. To further validate our findings, we repeat this experiment in Appendix D on additional datasets and reach to the same conclusion.

2.4. COMPUTATIONAL COSTS

Our ωGNN approach is general and can be applied to any GNN that conforms to the structure of equation 1 and can be modified into equation 3. The additional parameters compared to the baseline GNN are the added Ω ⃗ ω (l) ∈ R c parameters at each layer, yielding a relatively low computational overhead. For example, in GCN (Kipf & Welling, 2016) there are c × c trainable parameters requiring c × c × n multiplications due to the 1 × 1 convolution K (l) . In our ωGCN, we will have c × c + c parameters and (c + 1) × c × n multiplications. That is in addition to applying the propagation operators S (l) , which are identical for both methods. A similar analysis holds for GAT. To validate the actual complexity of our method, we present the training and inference times for ωGCN and ωGAT in Appendix G. We see a negligible addition to the runtimes compared to the baselines, at the return of better performance.

3. OTHER RELATED WORK

Over-smoothing in GNNs. The over-smoothing phenomenon was identified by Li et al. (2018) , and was profoundly studied in recent years. Various methods stemming from different approaches were proposed. For example, methods like DropEdge (Rong et al., 2020) , PairNorm (Zhao & Akoglu, 2020) , and EGNN (Zhou et al., 2021) propose augmentation, normalization and energy-based penalty methods to alleviate over-smoothing, respectively. Other methods like Min et al. (2020) propose to augment GCN with geometric scattering transforms and residual convolutions, and GCNII (Chen et al., 2020b) present a spectral analysis of the smoothing property of GCN (Kipf & Welling, 2016) and propose adding an initial identity residual connection and a decay of the weights of deeper layers, which are also used in EGNN (Zhou et al., 2021) . Graph Neural Diffusion. The view of GNNs as a diffusion process has gained popularity in recent years. Methods like APPNP (Klicpera et al., 2019) propose to use a personalized PageRank (Page et al., 1999) algorithm to determine the diffusion of features, and GDC (Gasteiger et al., 2019) imposes constraints on the ChebNet (Defferrard et al., 2016) architecture to obtain diffusion kernels, showing accuracy improvement. Other works like GRAND (Chamberlain et al., 2021) , CFD-GCN (Belbute-Peres et al., 2020), PDE-GCN (Eliasof et al., 2021) and GRAND++ (Thorpe et al., 2022) propose to view GNN layers as time steps in the integration process of ODEs and PDEs that arise from a non-linear heat equation, allowing to control the diffusion (smoothing) in the network to prevent over-smoothing. In addition, some GNNs (Eliasof et al., 2021; Rusch et al., 2022) propose a mixture between diffusion and oscillatory processes to avoid over-smoothing by frequency preservation of the features. Mixed-sign operators in GNNs. The importance of mixed-sign operators in GNNs was discussed in Eliasof et al. (2022) , where k-hop filters and stochastic path sampling mechanisms are utilized. However, such a method requires significantly more computational resources than a standard GNN like Kipf & Welling (2016); Veličković et al. (2018) due to the path sampling strategy and larger filters of 5-hop required for optimal accuracy. However, our ωGNNs perform 1-hop convolutions and as we show in Appendix G, obtain state-of-the-art results without significant added computational costs.

4. EXPERIMENTS

We demonstrate our ωGCN and ωGAT on node classification, inductive learning and graph classification tasks. Additionally, we conduct an ablation study of the different configurations of our method and experimentally verify the theorems from Sec. 2. A description of the network architectures is given in Appendix E. We use the Adam (Kingma & Ba, 2014) optimizer in all experiments, and perform grid search to determine the hyper-parameters reported in Appendix F. The objective function in all experiments is the cross-entropy loss, besides inductive learning on PPI (Hamilton et al., 2017) where we use the binary cross-entropy loss. Our code is implemented with PyTorch (Paszke et al., 2019) and PyTorch-Geometric (Fey & Lenssen, 2019) and trained on an Nvidia Titan RTX GPU. We show that for all the considered tasks and datasets, whose statistics are provided Appendix C, our ωGCN and ωGAT are either better or on par with other state-of-the-art models. 

4.1. SEMI-SUPERVISED NODE CLASSIFICATION

We employ the Cora, Citeseer and Pubmed (Sen et al., 2008) datasets using the standard training/validation/testing split by Yang et al. (2016) , with 20 nodes per class for training, 500 validation nodes and 1,000 testing nodes. We follow the training and evaluation scheme of Chen et al. (2020b) and compare with models like GCN, GAT, superGAT (Kim & Oh, 2021) , Inception (Szegedy et al., 2017) , APPNP (Klicpera et al., 2019) , JKNet Xu et al. (2018) , DropEdge (Rong et al., 2020) , GCNII (Chen et al., 2020b) , GRAND (Chamberlain et al., 2021) , PDE-GCN (Eliasof et al., 2021) and EGNN (Zhou et al., 2021) . We summarize the results in Tab. 1 where we see better or on par performance with other state-of-the-art methods. Additionally, we report the accuracy per number of layers, from 2 to 64 In Tab. 2, where it is evident that our ωGCN and ωGAT do not over-smooth. To ensure the robustness of our method, we also experiment with 100 random splits in Appendix I where our ωGCN and ωGAT continue to perform better or on par with state-of-the-art methods.

4.2. FULLY-SUPERVISED NODE CLASSIFICATION

To further validate the efficacy of our method on fully-supervised node classification, both on homophilic and heterophilic datasets as defined in Pei et al. (2020) . Specifically, examine our ωGCN and ωGAT on Cora, Citeseer, Pubmed, Chameleon (Rozemberczki et al., 2021), Cornell, Texas and Wisconsin using the identical train/validation/test splits of 48%, 32%, 20%, respectively, and report the average performance over 10 random splits from Pei et al. (2020) . We compare our performance with, GCN, GAT, Geom-GCN, APPNP, JKNet, Inception, GCNII, PDE-GCN and others, as presented in Tab. 3. Additionally, we evaluate our ωGCN and ωGAT on the Actor (Rozemberczki et al., 2021) and Ogbn-arxiv (Hu et al., 2020) datasets, as reported in Tab. 4. We see an accuracy improvement across all benchmarks compared to the considered methods. In Appendix J we present and discuss the learnt ⃗ ω for homophilic and heterophilic datasets.

4.3. INDUCTIVE LEARNING

We employ the PPI dataset (Hamilton et al., 2017) for the inductive learning task. We use 8 layer ωGCN and ωGAT, with a learning rate of 0.001, dropout of 0.2 and no weight-decay. As a comparison we consider several methods and report the micro-averaged F1 score in in Tab. 5. Our ωGCN achieves (Veličković et al., 2018) 28.45 71.59 GATv2 (Brody et al., 2022) -71.87 APPNP (Klicpera et al., 2019) 31.26 71.82 Geom-GCN-P (Pei et al., 2020) 31.63 -JKNet (Xu et al., 2018) 29.81 72.19 SGC (Wu et al., 2019) 30.98 69.20 GCNII (Chen et al., 2020b) 32.87 72.74 EGNN (Zhou et al., Hamilton et al. (2017) 61.20 VR-GCN (Chen et al., 2018) 97.80 GaAN (Zhang et al., 2018a) 98.71 GAT (Veličković et al., 2018) 97.30 JKNet (Xu et al., 2018) 97.60 GeniePath (Liu et al., 2018) 98.50 Cluster-GCN (Chiang et al., 2019) 99.36 GCNII* (Chen et al., 2020b) 99.58 PDE-GCN M (Eliasof et al., 2021) 99.18 ωGCN (Ours) 99.60 ωGAT (Ours) 99.48 a score of 99.60, which is significantly superior than its baseline -GCN, and also performs better than methods like GAT, JKNet, GeniePath, Cluster-GCN and PDE-GCN.

4.4. GRAPH CLASSIFICATION

Previous experiments considered the node-classification task. To further demonstrate the efficacy of our ωGNNs we experiment with graph classification on TUDatasets (Morris et al., 2020) . Here, we follow the same experimental settings from Xu et al. (2019) , and report the 10 fold cross-validation performance on MUTAG, PTC, PROTEINS, NCI1 and NCI109 datasets. The hyper-parameters are determined by a grid search, as in Xu et al. (2019) and are reported in Appendix E. We compare our ωGCN and ωGAT with recent and popular methods like GIN (Xu et al., 2019) , DGCNN (Zhang et al., 2018b) , IGN (Maron et al., 2018) , GSN (Bouritsas et al., 2022) , SIN (Bodnar et al., 2021b) , CIN (Bodnar et al., 2021a) and others. We also compare with methods that stem from 'classical' graph algorithms like RWK (Gärtner et al., 2003) and WL Kernel (Shervashidze et al., 2011) . All the results are summarized in Tab. 6, with an evident improvement or similar results to current deep learning as well as classical methods, highlighting the efficacy of our approach.

4.5. ABLATION STUDY

In this section we study the different components and configurations of our ωGNN. We start by allowing a global (single) ω to be learnt throughout all the layers-this architecture is dubbed as ωGCN G . We validate that this simple variant does not over-smooth, depicted in Tab. 7. The table also shows ωGCN PL , that includes a single parameter ω (l) per layer, and ωGCN shown in the results earlier that has Ω (l) , i.e., a parameter per layer and channel, which yields further accuracy improvements. RWK (Gärtner et al., 2003) 79.2 ± 2.1 55.9 ± 0.3 59.6 ± 0.1 --GK (Shervashidze et al., 2009) 81.4 ± 1.7 55.7 ± 0.5 71.4 ± 0.3 62.5 ± 0.3 62.4±0.3 PK (Neumann et al., 2016) 76.0 ± 2.7 59.5± 2.4 73.7± 0.7 82.5± 0.5 -WL Kernel (Shervashidze et al., 2011) In addition, we empirically verify our theoretical results from Sec. 2 in Fig. 4 , where we show that ω = Lω and ω = L-1 l=0 ω (l) is similar for varying number of layers L as Theorem 1 and Corollary 1.1 suggest. For completeness, we also perform the ablation study on ωGAT in Appendix H.

5. SUMMARY

In this work we proposed an effective and computationally efficient modification that applies to a large family of GNNs that carry the form of a separable propagation and 1 × 1 convolutions, and in particular we demonstrate its efficacy on the popular GCN and GAT architectures. We provide theorems that reason about the smoothing nature of GCN and through the lens of operator analysis suggest to learn weighting factors ⃗ ω learn and mix smoothing and sharpening propagation operators. Through an extensive set of experiments on numerous datasets, ranging from node classification to graph classification, as well as an ablation study that validates our theoretical findings, we demonstrate the contribution of our ωGNN, reading on par or achieving new state-of-the-art performance.

A PROOFS OF THEOREMS

Here we repeat the theorems, observations and corollaries from the main paper, for convenience, and provide their proofs or derivation. P is a scaled diffusion operator. Assume that A is the adjacency matrix, and D is the degree matrix. Denote the adjacency matrix with added self-loops by Ã = A + I. Then, the convolution operator from GCN (Kipf & Welling, 2016) is P = D-1 2 Ã D-1 2 (12) We first note that the Laplacian including self loops is the same as the regular Laplacian: L = D -Ã = D + I -A -I = D -A = L. Therefore, it holds that: P = I -I + D-1 2 Ã D-1 2 = I -D-1 2 D D-1 2 + D-1 2 Ã D-1 2 = I -D-1 2 ( D -Ã) D-1 2 (14) = I -D-1 2 (D -A) D-1 2 = I -D-1 2 L D-1 2 . Theorem 1. Consider the L times application of equation 7 from the main paper, i.e., f (L) = ( Pω ) L f (0) with a shared parameter ω (l) = ω that is used in all the layers. Also assume that there is some optimal Dirichlet energy of the final feature map that satisfies 0 < E opt (f (L) ) < E(f (0) ). Then, at the limit, as more layers are added, ω converges to ω/L up to first order accuracy, where L is the number of layers and ω is a value that is independent of L and leads to E opt . Proof. First, note that equation 6 from the main paper can be written as E(f (l) ) = i∈V j∈Ni 1 2 f (l) i √ (1+di) - f (l) j √ (1+dj ) 2 2 = 1 2 ∥G D-1 2 f (l) ∥ 2 2 , ( ) where G is the graph gradient operator, also known as the incidence matrix, that for each edge subtracts the features of the two connected nodes, i.e., Gf (i,j) = f (l) i -f (l) j for (i, j) ∈ E. Let us assume that the initial feature f (0) has some Dirichlet energy E 0 > E opt as defined in equation 15. Since ∇E = D-1 2 G ⊤ G D-1 2 f (l) we see that the forward propagation through a GCN approximates the gradient flow of the Dirichlet energy. That is, for given L and and ω we have that f (l+1) = f (l) -ω∇E = f (l) -ω D-1 2 G ⊤ G D-1 2 f (l) = (I -ω D-1 2 L D-1 2 )f (l) where we used that G ⊤ G = L. Equation 16 can be seen both as a gradient descent step to reduce E, and also as a forward Euler approximation with step size ω of the solution of ∂f (t) ∂t = -D-1 2 L D-1 2 f (t), f (0) = f (0) . ( ) It is known that the solution to equation 17 is given by f (t) = exp -t D-1 2 L D-1 2 f (0). Since the Dirichlet energy of f (t) is continuous in t and decays monotonically from E 0 to zero, there exists a T such that E(f (T )) = E opt . Now, considering discrete time intervals 0 = t 0 , ..., t L = T , then, similarly to equation 18, for any two subsequent time steps t l+1 and t l we have that f (t l+1 ) = exp -(t l+1 -t l ) D-1 2 L D-1 2 f (t l ). Taking fixed-interval time steps such that t l+1 -t l = ω = T /L for l = 0, ..., L, we get f (t l+1 ) = exp -ω D-1 2 L D-1 2 f (t l ) = (I -ω D-1 2 L D-1 2 )f (t l ) + O(ω 2 ), where the rightmost approximation holds due to to the Taylor expansion up to first order approximation. Denoting f (l) = f (t l ) and ω = T , we complete the proof. Corollary 1. Allowing a variable ω (l) > 0 in Theorem 1, yields L-1 l=0 ω (l) = ω up to first order accuracy. Proof. The proof follows immediately by setting variable t l+1 -t l = ω (l) and placing in equation 20. Remark 1 (The non-negativity of Pω ). By definition, for 0 < ω ≤ 1 all the spatial weights of Pω defined in equation 7 are non-negative, and it is that the operator is smoothing as it is a low-pass filter. For ω > 1 or ω < 0, by definition we have an operator with mixed signs. Theorem 2. Assume that the graph is connected. Then, there exists some ω 0 ≥ 1 where for all 0 < ω < ω 0 , the operator Pω in equation 7 from the main paper is smoothing and the leading eigenvector is D 1 2 1. For ω > ω 0 or ω < 0, the leading eigenvector changes. Proof. Assuming that the graph is connected, it is known that the graph Laplacian matrix has the eigenvector 1 whose eigenvalue is 0, i.e. L1 = 0. Hence, we get that D-1 2 L D-1 2 D 1 2 1 = 0 so D 1 2 1 is the eigenvector of the normalized Laplacian with eigenvalue of 0.

Furthermore, denote the normalized Laplacian by

L = D-1 2 L D-1 2 . Consider the range 0 < ω < 2 ρ( L) = ω 0 , where ρ( L) denotes the spectral radius of the matrix L. It is easy to verify that for this range of values for ω, the largest eigenvalue in magnitude of Pω is 1, and it corresponds to the null eigenvector of L, i.e., D 1 2 1. Hence, for this range, P is smoothing. For ω > ω 0 and ω < 0, the leading eigenvector of Pω becomes the leading eigenvector of L. Furthermore, it can be shown that ρ( L) ≤ 2 (see Williamson (2016) for the proof), hence ω 0 ≥ 1.

B SYNTHETIC EXPRESSIVENESS TASK

To demonstrate the importance and benefit of learning sharpening propagation operators in addition to smoothing operators, we propose the following synthetic node gradient regression task. Given a graph G = (V, E) with some input node features f in ∈ R n×cin , we wish a GNN to regress the node features gradient, ∇f in , where the node feature gradient of the i-th node is defined as an upwind gradient operator: ∇f in i = max j∈Ni (f i -f j ), where the goal of the considered GNN is to minimize the following objective: ∥GNN(f in , G) -∇f in ∥ 2 2 . ( ) As a comparison, we consider two GNNs: GCN (Kipf & Welling, 2016) and our ωGCN with 64 channels and 2 layers. In both cases we use a learning rate of 1e -4 without weigh decay and train the network for 5000 iterations (no further benefit was obtained with any of the considered methods). The input graph is a random Erdős-Rényi graph with 8 nodes and an edge rate of 30%, with input node features sampled form a uniform distribution in the range of 0 to 1. The obtained loss of GCN is of order 1e -1, while our ωGCN obtains a loss of order 1e -12, also as can be seen in Fig. 2 . We therefore conclude that introducing the ability of learning mixed-sign operators by ω is beneficial to enhance the expressiveness of GNNs.

C DATASETS

In this section we provide the statistics of the datasets used throughout our experiments. Tab. 8 presents information regarding node-classification datasets, and Tab. 9 summarizes the graphclassification datasets. For each dataset, we also provide the homophily score as defined by Pei et al. (2020) . In addition to the observation presented in Sec. 2.1 and specifically in Fig. 3b where we see that recurrent applications of GAT reduces the node feature energy from equation 11, which causes over-smoothing as shown by Wu et al. (2019) ; Wang et al. (2019) (as discussed in the main paper), here, we also show that the same behaviour is evident with Citeseer and Pubmed datasets in Fig. 5 . 

E ARCHITECTURES IN DETAILS

We now elaborate on the specific architectures used in our experiments in Sec. 4. As noted in the main paper, all our network architectures consist of an opening (embedding) layer (1 × 1 convolution), a sequence of ωGNN (i.e., ωGCN or ωGAT) layers, and a closing (classifier) layer (1 × 1 convolution). In total, we have two types of architectures -one that is based on GCN, for node classification tasks when adding more ωGNN layers as reported in Tab. 2 in the main paper. However, since GCN and GAT over-smooth, the comparison here is done with 2 layers, where the highest accuracy is obtained for the baseline models. 

H ABLATION STUDY USING ωGAT

To complement our ablation study on ωGCN in Sec. 4.5 in the main paper, we perform as similar study on ωGAT. Here, we show in Tab. 16, that indeed the single ω variant, dubbed ωGAT G does not over-smooth, and that by allowing the greater flexibility of a per-layer and per layer and channel of our ωGAT PL and ωGAT, respectively, better performance is obtained. 



Figure 1: The impulse response of ωGCN's propagation operator for different ω values. For ω = 0.5, 1.0 non-negative values are obtained, while for ω = 1.5 we see mixed-sign values. The dashed node starts from a feature of 1 and the rest with 0.

Figure3: Node features energy at the l-th layer relative to the initial node embedding energy on Cora. Both ωGCN and ωGAT control the respective energies from equation 6 and equation 11 to avoid over-smoothing, while the baselines with ω = 1 reduce the energies to 0 and over-smooth.

Figure 4: The summation of weighting factors ω vs. the number layers for ωGCNG and ωGCNPL.

Figure5: Node features energy at the l-th layer relative to the initial node embedding energy on Citeseer 5a and Pubmed 5b. ωGAT controls the energy from Eq. equation 11 to avoid over-smoothing, while the baseline GAT with ω = 1 reduce the energy to 0 and over-smooth.

Figure 6: The learnt ⃗ ω ∈ R 64×64 of ωGCN with 64 layers (x-axis) and 64 channels (y-axis) for Cora (homophilic) and Texas (heterophilic) datasets. Smoothing operators appear in blue, while sharpening operators appear in red. White entries are obtained for ω = 1.

Summary of semi-supervised node classification accuracy (%) Method GCN GAT APPNP GCNII GRAND superGAT EGNN ωGCN (Ours) ωGAT (Ours)

Semi-supervised node classification accuracy (%). -indicates not available results.

Fully-supervised node classification accuracy (%). (L) denotes the number of layers.

Fully-supervised node classification accuracy (%).

Inductive learning on PPI dataset. Results are reported in micro-averaged F1 score.

Graph classification accuracy (%) on TUDatasets(Morris et al., 2020).

Accuracy (%) of variants of ωGCN on semisupervised classification.

Node classification datasets statistics. Hom. score denotes the homophily score.

TUDatasets graph classification statistics.

Graph classification hyper-parameters. BS denoted batch size. Architecture Dataset LR GN N LR oc LR ω W D GN N W D oc W D ω

Training and inference GPU runtimes [ms] on Cora.

Ablation study on ωGAT.

annex

reported in Tab. 10, and the other for the graph classification task which is based on Xu et al. (2019) and is reported in Tab. 11. Throughout the following, we denote by c in and c out the input and output channels, respectively, and c denotes the number of features in hidden layers (which is a reported in Appendix F). We initialize the embedding and classifier layers with the Glorot (Glorot & Bengio, 2010) initialization, and K (l) from equation 2 is initialized with an identity matrix of shape c × c. The initialization of Ω (l) also starts from a vectors of ones. We note that our initialization yields a standard smoothing process, which is then adapted to the data as the learning process progresses, and if needed also changes the process to a non-smoothing one by the means of mixed-signs, as discussed earlier and specifically in Theorem. 2. We denote the number of ωGNN layers by L, and the dropout probability by p. The main difference between the two architectures are as follows. First, for the graph classification we use the standard add-pool operation as in GIN (Xu et al., 2019) to obtain a global graph feature. Second, we follow GIN and in addition to the graph layer (which is ωGNN in our work), we add batch normalization (denoted by BN), 1 × 1 convolution and a ReLU activation past each graph layer. 

Input size

Layer Output size Input size Layer Output sizeWe provide the selected hyper-parameters in our experiments. We denote the learning rate of our ωGNN layers by LR GN N , and the learning rate of the 1 × 1 opening and closing as well as any additional classifier layers by LR oc . Also, the weight decay for the opening and closing layers is denoted by W D oc . We denote the ω parameter learning rate and weight decay by LR ω and W D ω , respectively. c denotes the number of hidden channels. In the case of ωGAT, the attention head vector a are learnt with the same learning rate as LR GN N and W D GN N .

F.1 SEMI-SUPERVISED NODE CLASSIFICATION

The hyper-parameters for this experiment are summarized in Tab. 12.

F.2 FULL-SUPERVISED NODE CLASSIFICATION

The hyper-parameters for this experiment are summarized in Tab. 13. The number of layers used in Tab. 3 are mentioned in brackets in the table. For Ogbn-arxiv and Actor from Tab. 4, 8 layer ωGCN and ωGAT were employed. The hyper-parameters for the graph classification experiment on TUDatasets are reported in Tab. 14. We followed the same grid-search procedure as in GIN (Xu et al., 2019) . In all experiment, a 5 layer (including the initial embedding layer) ωGCN and ωGAT are used, similarly to GIN.

F.5 ABLATION STUDY

In this experiment we used the same hyper-parameters as reported in Tab. 12.

G RUNTIMES

Following the computational cost discussion from Sec. 2.4 in the main paper, we also present in Tab. 15 the measured training and inference times of our baselines GCN and GAT with 2 layers, where we see that indeed the addition of ω per layer and channel requires a negligible addition of time, at the return of a significantly more accurate GNN. We note that further accuracy gain can be achieved (Veličković et al., 2018) 81.8 71.4 78.7 MoNet (Monti et al., 2017) 81.3 71.2 78.6 GRAND-l (Chamberlain et al., 2021) 83.6 73.4 78.8 GRAND-nl (Chamberlain et al., 2021) 82.3 70.9 77.5 GRAND-nl-rw (Chamberlain et al., 2021) 83.3 74.1 78.1 GraphCON-GCN (Rusch et al., 2022) 81.9 72.9 78.8 GraphCON-GAT (Rusch et al., 2022) 83.2 73.2 79.5 GraphCON-Tran (Rusch et al., 2022) 84 2016) was considered, to a direct comparison with as many as possible methods. However, since this result reflects the accuracy from a single split, we also repeat this experiment with 100 random splits as in Chamberlain et al. ( 2021) and compare with applicable methods that also conducted such statistical significance test. In Tab. 17, we report our obtained accuracy on Cora, Citeseer and Pubmed. It is possible to see that in this experiment our ωGCN and ωGAT outperform or obtain similar results compared with the considered methods, which further highlight the performance advantage of our method.

J THE LEARNT ⃗ ω

One of the main advantages of our method in Sec. 2 is that our method is capable of learning both smoothing and sharpening propagation operators, which cannot be obtained in most current GNNs.In Fig. 6 we present the actual { ⃗ ω (l) } L l=1 as a matrix of size L × c that was learnt for two dataset of different types-with high and low homophily score (as described in Pei et al. (2020) ). Namely, the Cora dataset with a high homophily score of 0.81, and the Texas dataset with a low homophily score of 0.11 (i.e., a heterophilic dataset). We see that a homophilic dataset like Cora, the network learnt to perform diffusion, albeit in a controlled manner, and not to simply employ the standard averaging operator P. We can further see that for a heterophilic dataset the ability to learn contrastive (i.e., sharpening) propagation operators in addition to diffusive kernels is beneficial, and is also reflected in our results in Tab. 3, where a larger improvement is achieved in datasets like Cornell, Texas and Wisconsin, which have low homophily scores (Rusch et al., 2022) .

