ADAPTIVE UNIVERSAL GENERALIZED PAGERANK GRAPH NEURAL NETWORK

Abstract

In many important graph data processing applications the acquired information includes both node features and observations of the graph topology. Graph neural networks (GNNs) are designed to exploit both sources of evidence but they do not optimally trade-off their utility and integrate them in a manner that is also universal. Here, universality refers to independence on homophily or heterophily graph assumptions. We address these issues by introducing a new Generalized PageRank (GPR) GNN architecture that adaptively learns the GPR weights so as to jointly optimize node feature and topological information extraction, regardless of the extent to which the node labels are homophilic or heterophilic. Learned GPR weights automatically adjust to the node label pattern, irrelevant on the type of initialization, and thereby guarantee excellent learning performance for label patterns that are usually hard to handle. Furthermore, they allow one to avoid feature over-smoothing, a process which renders feature information nondiscriminative, without requiring the network to be shallow. Our accompanying theoretical analysis of the GPR-GNN method is facilitated by novel synthetic benchmark datasets generated by the so-called contextual stochastic block model. We also compare the performance of our GNN architecture with that of several state-ofthe-art GNNs on the problem of node-classification, using well-known benchmark homophilic and heterophilic datasets. The results demonstrate that GPR-GNN offers significant performance improvement compared to existing techniques on both synthetic and benchmark data. Our implementation is available online.

1. INTRODUCTION

Graph-centered machine learning has received significant interest in recent years due to the ubiquity of graph-structured data and its importance in solving numerous real-world problems such as semisupervised node classification and graph classification (Zhu, 2005; Shervashidze et al., 2011; Lü & Zhou, 2011) . Usually, the data at hand contains two sources of information: Node features and graph topology. As an example, in social networks, nodes represent users that have different combinations of interests and properties captured by their corresponding feature vectors; edges on the other hand document observable friendship and collaboration relations that may or may not depend on the node features. Hence, learning methods that are able to simultaneously and adaptively exploit node features and the graph topology are highly desirable as they make use of their latent connections and thereby improve learning on graphs. Graph neural networks (GNN) leverage their representational power to provide state-of-the-art performance when addressing the above described application domains. Many GNNs use message passing (Gilmer et al., 2017; Battaglia et al., 2018) to manipulate node features and graph topology. They are constructed by stacking (graph) neural network layers which essentially propagate and transform node features over the given graph topology. Different types of layers have been proposed and used in practice, including graph convolutional layers (GCN) (Bruna et al., 2014; Kipf & Welling, 2017) , graph attention layers (GAT) (Velickovic et al., 2018) and many others (Hamilton et al., 2017; Wijesinghe & Wang, 2019; Zeng et al., 2020; Abu-El-Haija et al., 2019) . However, most of the existing GNN architectures have two fundamental weaknesses which restrict their learning ability on general graph-structured data. First, most of them seem to be tailor-made to work on homophilic (associative) graphs. The homophily principle (McPherson et al., 2001) in the context of node classification asserts that nodes from the same class tend to form edges. Homophily is also a common assumption in graph clustering (Von Luxburg, 2007; Tsourakakis, 2015; Dau & Milenkovic, 2017) and in many GNNs design (Klicpera et al., 2018) . Methods developed for homophilic graphs are nonuniversal in so far that they fail to properly solve learning problems on heterophilic (disassortative) graphs (Pei et al., 2019; Bojchevski et al., 2019; 2020) . In heterophilic graphs, nodes with distinct labels are more likely to link together (For example, many people tend to preferentially connect with people of the opposite sex in dating graphs, different classes of amino acids are more likely to connect within many protein structures (Zhu et al., 2020) etc) . GNNs model the homophily principle by aggregating node features within graph neighborhoods. For this purpose, they use different mechanisms such as averaging in each network layer. Neighborhood aggregation is problematic and significantly more difficult for heterophilic graphs (Jia & Benson, 2020) . Second, most of the existing GNNs fail to be "deep enough". Although in principle an arbitrary number of layers may be stacked, practical models are usually shallow (including 2-4 layers) as these architectures are known to achieve better empirical performance than deep networks. A widely accepted explanation for the performance degradation of GNNs with increasing depth is feature-oversmoothing, which may be intuitively explained as follows. The process of GNN feature propagating represents a form of random walks on "feature graphs," and under proper conditions, such random walks converge with exponential rate to their stationary points. This essentially levels the expressive power of the features and renders them nondiscriminative. This intuitive reasoning was first described for linear settings in Li et al. (2018) and has been recently studied in Oono & Suzuki (2020) for a setting involving nonlinear rectifiers. We address these two described weaknesses by combining GNNs with Generalized PageRank techniques (GPR) within a new model termed GPR-GNN. The GPR-GNN architecture is designed to first learn the hidden features and then to propagate them via GPR techniques. The focal component of the network is the GPR procedure that associates each step of feature propagation with a learnable weight. The weights depend on the contributions of different steps during the information propagation procedure, and they can be both positive and negative. This departures from common nonnegativity assumptions (Klicpera et al., 2018) allows for the signs of the weights to adapt to the homophily/heterophily structure of the underlying graphs. The amplitudes of the weights trade-off the degree of smoothing of node features and the aggregation power of topological features. These traits do not change with the choice of the initialization procedure and elucidate the process used to combine node features and the graph structure so as to achieve (near)-optimal predictions. In summary, the GPR-GNN method can simultaneously learn the node label patterns of disparate classes of graphs and prevent feature over-smoothing. The excellent performance of GPR-GNN is demonstrated empirically, on real world datasets, and further supported through a number of theoretical findings. In the latter setting, we show that the GPR procedure relates to general polynomial graph filtering, which can naturally deal with both high and low frequency parts of the graph signals. In contrast, recent GNN models that utilize Personalized PageRanks (PPR) with fixed weights (Wu et al., 2019; Klicpera et al., 2018; 2019) inevitably act as low-pass filters. Thus, they fail to learn the labels of heterophilic graphs. We also establish that GPR-GNN can provably mitigate the feature-over-smoothing issue in an adaptive manner even after large-step propagation (i.e., after a large number of propagation steps). Hence, the method is able to make use of informative large-step propagation. To test the performance of GPR-GNN on homophilic and heterophilic node label patterns and determine the trade-off between node and topological feature exploration, we first describe the recently proposed contextual stochastic block model (cSBM) (Deshpande et al., 2018) . The cSBM allows for smoothly controlling the "informativeness ratio" between node features and graph topology, The learnt GPR weights of the GPR-GNN on real world datasets. Cora is homophilic while Texas is heterophilic (Here, H stands for the level of homophily defined below). An interesting trend may be observed: For the heterophilic case the weights alternate from positive to negative with dampening amplitudes (more examples are provided in Section 5). The shaded region corresponds to a 95% confidence interval. where the graph can vary from being highly homophilic to highly heterophilic. We show that GPR-GNN outperforms all other baseline methods for the task of semi-supervised node classification on the cSBM consistently from strong homophily to strong heterophily. We then proceed to show that GPR-GNN offers state-of-the-art performance on node-classification benchmark real-world datasets which contain both homophilic and heterophilic graphs. Due to the space limit, we put all proofs, formal theorem statements, and the conclusion section in the Supplement.

2. PRELIMINARIES

Let G = (V, E) be an undirected graph with nodes V and edges E. Let n denote the number of nodes, assumed to belong to one of C ≥ 2 classes. The nodes are associated with the node feature matrix X ∈ R n×f , where f denotes the number of features per node. Throughout the paper, we use X i: to indicate the i th row and X :j to indicate the j th column of the matrix X, respectively. The symbol δ ij is reserved for the Kronecker delta function. The graph G is described by the adjacency matrix A, while Ã stands for the adjacency matrix for a graph with added self-loops. We let D be the diagonal degree matrix of Ã and Ãsym = D-1/2 Ã D-1/2 denote the symmetric normalized adjacency matrix with self-loops.

3. GPR-GNNS: MOTIVATION AND CONTRIBUTIONS

Generalized PageRanks. Generalized PageRank (GPR) methods were first used in the context of unsupervised graph clustering where they showed significant performance improvements over Personalized PageRank (Kloumann et al., 2017; Li et al., 2019) . The operational principles of GPRs can be succinctly described as follows. Given a seed node s ∈ V in some cluster of the graph, a one-dimensional feature vector k) , where the parameters γ k ∈ R, k = 0, 1, 2, . . ., are referred to as the GPR weights. Clustering of the graph is performed locally by thresholding the GPR score. Certain PangRank methods, such as Personalized PageRank or heat-kernel PageRank (Chung, 2007) , are associated with specific choices of GPR weights (Li et al., 2019) . For an excellent in-depth discussion of PageRank methods, the interested reader is referred to (Gleich, 2015) . The work in Li et al. (2019) recently introduced and theoretically analyzed a special form of GPR termed Inverse PR (IPR) and showed that long random walk paths are more beneficial for clustering then previously assumed, provided that the GPR weights are properly selected (Note that IPR was developed for homophilic graphs and optimal GPR weights for heterophilic graphs are not currently known). H (0) ∈ R n×1 is initialized according to H (0) v: = δ vs . The GPR score is defined as ∞ k=0 γ k Ãk sym H (0) = ∞ k=0 γ k H ( Equivalence of the GPR method and polynomial graph filtering. If we truncate the infinite sum in the definition of GPR at some natural number K, K k=0 γ k Ãk sym corresponds to a polynomial graph filter of order K. Thus, learning the optimal GPR weights is equivalent to learning the optimal polynomial graph filter. Note that one can approximate any graph filter using a polynomial graph filter (Shuman et al., 2013) and hence the GPR method is able to deal with a large range of different node label patterns. Also, increasing K allows one to better approximate the underlying optimal graph filter. This once again shows that large-step propagation is beneficial. Universality with respect to node label patterns: Homophily versus heterophily. In their recent work, Pei et al. (2019) proposed an index to measure the level of homophily of nodes in a graph H(G) = 1

|V | v∈V

Number of neighbors of v ∈ V that have the same label as v Number of neighbors of v . Note that H(G) → 1 corresponds to strong homophily while H(G) → 0 indicates strong heterophily. Figures 1 (b ) and (c) plot the GPR weights learnt by our GPR-GNN method on a homophilic (Cora) and heterophilic (Texas) dataset. The learnt GPR weights from Cora match the behavior of IPR (Li et al., 2019) , which verifies that large-step propagation is indeed of great importance for homophilic graphs. The GPR weights learnt from Texas behave significantly differently from all known PR variants, taking a number of negative values. These differences in weight patterns are observed under random initialization, demonstrating that the weights are actually learned by the network and not forced by specific initialization. Furthermore, the large difference in the GPR weights for these two graph models illustrates the learning power of GPR-GNN and their universal adaptability. The over-smoothing problem. One of the key components in most GNN models is the graph convolutional layer, described by H (k) GCN = ReLU Ãsym H (k-1) GCN W (k) , PGCN = softmax Ãsym H (K-1) GCN W (k) , where H (0) GCN = X and W (k) represents the trainable weight matrix for the k th layer. The key issue that limits stacking multiple layers is the over-smoothing phenomenon: If one were to remove ReLU in the above expression, lim k→∞ Ãk sym H (0) = H (∞) , where each row of H (∞) only depends on the degree of the corresponding node, provided that the graph is irreducible and aperiodic. This shows that the model looses discriminative information provided by the node features as the number of layers increases. Mitigating graph heterophily and over-smoothing issues with the GPR-GNN model. GPR-GNN first extracts hidden state features for each node and then uses GPR to propagate them. The GPR-GNN process can be mathematically described as: P = softmax(Z), Z = K k=0 γ k H (k) , H (k) = Ãsym H (k-1) , H (0) i: = f θ (X i: ), where f θ (.) represents a neural network with parameter set {θ} that generates the hidden state features H (0) . The GPR weights γ k are trained together with {θ} in an end-to-end fashion. The GPR-GNN model is easy to interpret: As already pointed out, GPR-GNN has the ability to adaptively control the contribution of each propagation step and adjust it to the node label pattern. Examining the learnt GPR weights also helps with elucidating the properties of the topological information of a graph (i.e., determining the optimal polynomial graph filter), as illustrated in Figure 1 (b) and (c). Placing GPR-GNNs in the context of related prior work. Among the methods that differ from repeated stacking of GCN layers, APPNP (Klicpera et al., 2018) represents one of the state-of-theart GNNs that is related to our GPR-GNN approach. It can be easily seen that APPNP as well as SGC (Wu et al., 2019) are special cases of our model since APPNP fixes γ k = α(1-α) k , γ K = (1α) K , while SGC removes all nonlinearities with γ k = δ kK , respectively. These two weight choices correspond to Personalized PageRank (PPR) (Jeh & Widom, 2003) , which is known to be suboptimal compared to the IPR framework when applied to homophilic node classification (Li et al., 2019) . Fixing the GPR weights makes the model unable to adaptively learn the optimal propagation rules which is of crucial importance: As we will show in Section 4, the fixed PPR weights corresponds to low-pass graph filters which makes them inadequate for learning on heterophilic graphs. The recent work (Klicpera et al., 2018) showed that fixed PPR weights (APPNP) can also provably resolve the over-smoothing problem. However, the way APPNP prevents over-smoothing is independent on the node label information. In contrast, the escape of GPR-GNN from over-smoothing is guided by the node label information (Theorem 4.2). A detailed discussion of this phenomena along with illustrative examples is delegated to the Supplement. Among the GCN-like models, JK-Net (Xu et al., 2018) exhibits some similarities with GPR-GNN. It also aggregates the outputs of different GCN layers to arrive at the final output. On the other hand, the GCN-Cheby method (Defferrard et al., 2016; Kipf & Welling, 2017 ) is related to polynomial graph filtering, where each convolutional layer propagates multiple steps and the graph filter is related to Chebyshev polynomials. In both cases, the depth of the models is limited in practice (Klicpera et al., 2018) and they are not easy to interpret as our GPR-GNN method. Some prior work also emphasizes adaptively learning the importance of different steps (Abu-El-Haija et al., 2018; Berberidis et al., 2018) . Nevertheless, none of the above works is applicable for semisupervised learning with GNNs and considers heterophilic graphs.

4. THEORETICAL PROPERTIES OF GPR-GNNS

Graph filtering aspects of GPR-GNNs. As mentioned in Section 3, the GPR component of the network may be viewed as a polynomial graph filter. Let Ãsym = UΛU T be the eigenvalue decomposition of Ãsym . Then, the corresponding polynomial graph filter equals K k=0 γ k Ãk sym = Ug γ,K (Λ)U T , where g γ,K (Λ) is applied element-wise and g γ,K (λ) = K k=0 γ k λ k . We estab- lished the following result. Theorem 4.1 (Informal). Assume that the graph G is connected. If γ k ≥ 0 ∀k ∈ {0, 1, ..., K}, K k=0 γ k = 1 and ∃k > 0 such that γ k > 0, then g γ,K (•) is a low-pass graph filter. Also, if γ k = (-α) k , α ∈ (0, 1) and K is large enough, then g γ,K (•) is a high-pass graph filter. By Theorem 4.1 and from our discussion in Section 3, we know that both APPNP and SGC will invariably suppress the high frequency components. Thus, they are inadequate for use on heterophilic graphs. In contrast, if one allows γ k to be negative and learned adaptively the graph filter will pass relevant high frequencies. This is what allows GPR-GNN to perform exceptionally well on heterophilic graphs (see Figure 2(c) ). GPR-GNN can escape from over-smoothing. As already emphasized, one crucial innovation of the GPR-GNN method is to make the GPR weights adaptively learnable, which allows GPR-GNN to avoid over-smoothing and trade node and topology feature informativeness. Intuitively, when largestep propagation is not beneficial, it increases the training loss. Hence, the corresponding GPR weights should decay in magnitude. This observation is captured by the following result, whose more formal statement and proof are delegated to the Supplement due to space limitations. Theorem 4.2 (Informal). Assume the graph G is connected and the training set contains nodes from each of the classes. Also assume that k is large enough so that the over-smoothing effect occurs for H (k) , ∀k ≥ k which dominate the contribution to the final output Z. Then, the gradients of γ k and γ k are identical in sign for all k ≥ k . Theorem 4.2 shows that as long as over-smoothing happens, |γ k | will approach 0 for all k ≥ k when we use an optimizer such as stochastic gradient descent (SGD) which has a suitable learning rate decay. This reduces the contribution of the corresponding steps H (k) in the final output Z. When the weights |γ k | are small enough so that H (k) no longer dominates the value of the final output Z, the over-smoothing effect is eliminated.

5. RESULTS FOR NEW CSBM SYNTHETIC AND REAL-WORLD DATASETS

Synthetic data. In order to test the ability of label learning of GNNs on graphs with arbitrary levels of homophily and heterophily, we propose to use cSBMs (Deshpande et al., 2018) to generate synthetic graphs. We consider the case with two equal-size classes. In cSBMs, the node features are Gaussian random vectors, where the mean of the Gaussian depends on the community assignment. The difference of the means is controlled by a parameter µ, while the difference of the edge densities in the communities and between the communities is controlled by a parameter λ. Hence µ and λ capture the "relative informativeness" of node features and the graph topology, respectively. Moreover, positive λ s correspond to homophilic graphs while negative λ s correspond to heterophilic graphs. The information-theoretic limits of reconstruction for the cSBM are characterized in Deshpande et al. (2018) . The results show that, asymptotically, one needs λfoot_0 + µ 2 /ξ > 1 to ensure a vanishing ratio of the misclassified nodes and the total number of nodes, where ξ = n/f and f as before denotes the dimension of the node feature vector. Note that given a tolerance value > 0, λ 2 + µ 2 /ξ = 1 + is an arc of an ellipsoid for which λ ≥ 0 and µ ≥ 0. To fairly and continuously control the extent of information carried by the node features and graph topology, we introduce a parameter φ = arctan( λ √ ξ µ ) × 2 π . The setting φ = 0 indicates that only node features are informative, while |φ| = 1 indicates that only the graph topology is informative. Moreover, φ = 1 corresponds to strongly homophilic graphs while φ = -1 corresponds to strongly heterophilic graphs. Note that the values φ and -φ convey the same amount of information regarding graph topology. This is due to the fact that λ 2 = (-λ) 2 . Ideally, GNNs that are able to optimally learn on both homophilic and heterophilic graph should have similar performances for φ and -φ. Due to space limitation we refer the interested reader to (Deshpande et al., 2018) for a review of all formal theoretical results and only outline the cSBM properties needed for our analysis. Additional information is also available in the Supplement. Our experimental setup examines the semi-supervised node classification task in the transductive setting. We consider two different choices for the random split into training/validation/test samples, which we call sparse splitting (2.5%/2.5%/95%) and dense splitting (60%/20%/20%), respectively. The sparse splittnig is more similar to the original semi-supervised setting considered in Kipf & Welling (2017) while the dense setting is considered in Pei et al. (2019) for studying heterophilic graphs. We run each experiment 100 times with multiple random splits and different initializations. Methods used for comparisons. We compare GPR-GNN with 6 baseline models: MLP, GCN (Kipf & Welling, 2017) , GAT (Velickovic et al., 2018) , JK-Net (Xu et al., 2018) , GCN-Cheby (Defferrard et al., 2016) , APPNP (Klicpera et al., 2018) , SGC (Wu et al., 2019) , SAGE (Hamilton et al., 2017) and Geom-GCN (Pei et al., 2019) . For all architectures, we use the corresponding Pytorch Geometric library implementations (Fey & Lenssen, 2019) . For Geom-GCN, we directly use the code provided by the authors 2 . We could not test Geom-GCN on cSBM and other datasets not originally tested in the paper due to a preprocessing subroutine that is not publicly available (Pei et al., 2019) . The GPR-GNN model setup and hyperparameter tuning. We choose random walk path lengths with K = 10 and use a 2-layer (MLP) with 64 hidden units for the NN component. For the GPR weights, we use different initializations including PPR with α ∈ {0.1, 0.2, 0.5, 0.9}, γ k = δ 0k or δ Kk and the default random initialization in pytorch. Similarly, for APPNP we search the optimal α within {0.1, 0.2, 0.5, 0.9}. For other hyperparameter tuning, we optimize the learning rate over {0.002, 0.01, 0.05} and weight decay {0.0, 0.0005} for all models. For Geom-GCN, we use the best variants in the original paper for each dataset. Finally, we use GPR-GNN(rand) to describe the results obtained with random initialization of the GPR weights. Further experimental settings are discussed in the Supplement. Results. We examine the robustness of all baseline methods and GPR-GNN using cSBM-generated data with φ ∈ {-1, -0.75, -0.5, ..., 1}, which includes graphs across the heterophily/homophily spectrum. The results are summarized in Figure 2 . For both the sparse and dense setting, GPR-GNN significantly outperforms all other baseline models whenever φ < 0 (heterophilic graphs). On the other hand, all baseline GNNs can be worse then simple MLP when the graph information is weak (φ = 0, -0.25). This shows that existing GNNs cannot apply to arbitrary graphs, while GPR-GNN is clearly more robust. APPNP methods have the worst performance on strongly heterophilic graphs. This is in agreement with the result of Theorem 4.1 which asserts that APPNP intrinsically acts a low-pass filter and is thus inadequate for strong heterophily settings. JKNet, GCN-Cheby and SAGE are the only three baseline models that are able to learn strongly heterophilic graphs under dense splitting. This is also to be expected since JKNet is the only baseline model that combines results from different steps at the last layer, which is similar to what is done in GPR-GNN. GCN-Cheby uses multiple steps in each layers which allows it to partially adapt to heterophilic settings as each layer is related to a polynomial graph filter of higher order compared to that of GCN. SAGE treats ego-embeddings and embeddings from neighboring nodes differently and does not simply average them out. This allows SAGE to adapt to the heterophilic case since the ego-embeddings prevent nodes from being overwhelmed by information from their neighbors. Nevertheless, JKNet, GCN-Cheby and SAGE are not deep in practice. Also, we observe that random initialization of our GPR weights only results in slight performance drops under dense splitting. The drop is more evident for sparse splitting setting but our method still outperforms baseline models by a large margin for strongly heterophilic graphs. This is also to be expected as we have less label information in the sparse splitting setting where the implicit bias provided by good GPR initialization is helpful. The implicit bias becomes irrelevant for the dense splitting setting, since the label information is sufficiently rich. Besides the strong performance of GPR-GNN, the other benefit is its interpretability. In Figure 3 , we demonstrate the learnt GPR weights by our GPR-GNN on cSBM with random initialization. When the graph is weak homophilic (φ = 0.25), the learnt GPR weights are decreasing. This is similar to the PPR weights used in APPNP, despite that the decaying speed is different. When the graph is strong homophilic (φ = 0.75), the learnt GPR weights are increasing which is significantly different from the PPR weights. This result matches the recent finding in Li et al. (2019) and behave similar to IPR proposed by the authors. On the other hand, the learnt GPR weights have zig-zag shape when the graph is heterophilic. This again validates Theorem 4.1 as GPR weights with alternating signs correspond to a high-pass filter. Interestingly, when φ = -0.25 the magnitude of learnt GPR weight is decreasing. This is because the graph information is weak and the node feature information is more important in this case. It makes sense that the learnt GPR weight focus on the first few steps. Hence, we have validated the interpretablity of GPR-GNN. In practice, one can use the learnt GPR weights to better understand the graph structured data at hand. We showcase this benefit in the results of real world benchmark datasets. Real world benchmark datasets. We use 5 homophilic benchmark datasets available from the Pytorch Geometric library, including the citation graphs Cora, CiteSeer, PubMed (Sen et al., 2008; Yang et al., 2016) and the Amazon co-purchase graphs Computers and Photo (McAuley et al., 2015; Shchur et al., 2018) . We also use 5 heterophilic benchmark datasets tested in Pei et al. (2019) , including Wikipedia graphs Chameleon and Squirrel, the Actor co-occurrence graph, and webpage graphs Texas and Cornell from WebKBfoot_1 . We summarize the dataset statistics in Table 1 . Results on real-world datasets. We use accuracy (the micro-F1 score) as the evaluation metric along with a 95% confidence interval. The relevant results are summarized in Table 2 . For homophilic datasets, we provide results for sparse splitting which is more aligned with the original setting used in Kipf & Welling (2017) ; Shchur et al. (2018) . For the heterophilic datasets, we adopt dense splitting which is used in Pei et al. (2019) . Published as a conference paper at ICLR 2021 Table 2 shows that, in general, GPR-GNN outperforms all tested methods. On homophilic datasets, GPR-GNN achieves the state-of-the-art performance. On heterophilic datasets, GPR-GNN significantly outperforms all the other baseline models. It is important to point out that there are two different patterns to be observed among the heterophilic datasets. On Chameleon and Squirrel, MLP and APPNP perform worse then other baseline methods such as GCN and JKNet. In contrast, MLP and APPNP outperform the other baseline methods on Actor, Texas and Cornell. We conjecture that this is due to the fact that the graph topology information is strong and weak, respectively. Note that these two patterns match the results of the cSBM experiments for φ close to -1 and 0, respectively (Figure 2 ). Furthermore, the homophily measure H(G) proposed by Pei et al. (2019) cannot characterize such differences in heterophilic datasets. We relegate the more detailed discussion of this topic along with illustrative examples to the Supplement.For fairness, we also repeated the experiment involving GeomGCN on homophilic datasets using a dense split -the observed performance pattern tends to be similar which can be found in Supplement. We also examined the learned GPR weights on real datasets in Figure 4 . Due to space limitations, a more comprehensive GPR weight analysis for other datasets is deferred to the Supplement. We can see that learned GPR weights are all positive for homophilic datasets (PubMed and Photo). In contrast, some GPR weights learned from heterophilic datasets (Actor and Squirrel) are negative. These results agree with the patterns observed on cSBMs. Interestingly, the learned weight γ 0 has the largest magnitude for the Actor dataset. This indicates that most of the information is contained in node features. From Table 2 we can also see that MLPs indeed outperforms most baseline GNNs (this is similar to the case of cSBM(φ = -0.25)). On the other hand, GPR weights learned from Squirrel have a zig-zag pattern. This implies that graph topology is more informative for Squirrel compared to Actor. From Table 2 we also see that baseline GNNs also outperform MLPs on Squirrel. Escaping from over-smoothing and dynamics of learning GPR weights. To demonstrate the ability of GPR-GNNs to escape from over-smoothing, we choose the initial GPR weights to be γ k = δ kK . This ensures that over-smoothing effects are present with high probability at the very beginning of the learning process. On cSBM(φ = -1) with dense splitting, we find that for 96 out of 100 runs, GPR-GNN predicts the same labels for all nodes at epoch 0, which implies that over-smoothing indeed occurs immediately. The final prediction is 98.79% accurate which is much larger than the initial accuracy of 50.07% at epoch 0. Similar results can be observed for other datasets and this verifies our theoretical findings. We plot the dynamics of the learned GPR weights in Figure 4 (e)-(h), which shows that the peak at last step is indeed reduced while the GPR weights for other steps are significantly increased in magnitude. More results on the dynamics of learning GPR weights may be found in the Supplement. Efficiency analysis. We also examine the computational complexity of GPR-GNNs compared to other baseline models. We report the empirical training time in Table 3 . Compared to APPNP, we only need to learn K+1 additional GPR weights for GPR-GNN, and usually K ≤ 20 (i.e. we choose K = 10 in our experiments). This additional computations are dominated by the computations performed by the neural network module f θ . We can observe from Table 3 that indeed GPR-GNN has a running time similar to that of APPNP. It is nevertheless worth pointing out that the authors of Bojchevski et al. (2020) successfully scaled APPNP to operate on large graphs. Whether the same techniques may be used to scale GPR-GNNs is an interesting open question.

6. CONCLUSIONS

We addressed two fundamental weaknesses of existing GNNs: Failing to act as universal learners by not generalizing to heterophilic graphs and making use of large number of propagation steps. We developed a novel GPR-GNN architecture which combines adaptive generalized PageRank (GPR) scheme with GNNs. We theoretically showed that our method does not only mitigates feature oversmoothing but also works on highly diverse node label patterns. We also tested GPR-GNNs on both homophilic and heterophilic node label patterns, and proposed a novel synthetic benchmark datasets generated by the contextual stochastic block model. Our experiments on real-world benchmark datasets showed clear performance gains of GPR-GNN over the state-of-the-art methods. Moreover, we showed that GPR-GNN has desirable interpretability properties which is of independent interest.

A APPENDIX

A.1 DETAILED DISCUSSION ON PREVENTING OVER-SMOOTHING. As mentioned in Section 4, another method -APPNP -can also provably prevents oversmoothing Klicpera et al. (2018) . The authors of this study use the fact that the PPR propagation will converge to Π ppr H (0) , where Π ppr = α(I n -(1 -α) Ãsym ) -1 is independent on the node label information provided in the training data. Each row of Π ppr H (0) still depends on H (0) and thus APPNP will not suffer from the over-smoothing effect. However, since Π ppr is independent of the label information, it can cause undesired consequences that we discuss in what follows. c) are identical and the only difference is the class label assignment. In Figure 5 (b), the graph is homophilic and hence the optimal graph filter should emphasize the low-frequency part of the graph signal. In contrast, in Figure 5 (c), the graph is heterophilic as the graph is bipartite with respect to the labels. Hence, the optimal graph filter should emphasize the high-frequency part of the graph signal. This example illustrates that the optimal graph filter should depend on both the graph topology and the node label information. Recall that the equivalent graph filter that APPNP uses in the asymptotic regime is Π ppr which is independent on the node label information. Also, Theorem 4.1 established that APPNP intrinsically utilizes a low-pass filter. In contrast, GPR-GNN learns the GPR weights guided by the node label information which allows it to account for both cases (homophilic and heterophilic) shown. , where the color of the nodes indicates their label. In case 1, blue and green nodes link to all orange and purple nodes. In case 2, blue nodes only link to orange nodes and green nodes only link to purple nodes. From the definition of H(G) one can see that both cases have H(G) = 0, since in both cases nodes do not link to other nodes of the same label. However, it is obvious that the graph topology carries more node label information in case 2 compared to case 1. In fact, for case 1 it is impossible to distinguish blue and green nodes merely from the graph topology (and the same is true of orange and purple nodes). One possible alternative for the homophily measure is the Chernoff-Hellinger divergence Abbe (2017) of the empirical edge probability matrix B; here B ij is the empirical probability of an edge with one end node labeled i and the other labeled j. The intuition behind our suggestion lies in the fact that the Chernoff-Hellinger divergence characterizes the fundamental limit of SBMs. However, as many practical graph generative processes may significantly differ from SBMs, investigating alternative homophily/heterophily measures is another interesting open problem.

A.3 PROOF OF THEOREM 4.1

We first state the formal version of Theorem 4.1. Theorem A.1 (Formal version of Theorem 4.1). Assume the graph G is connected. Let λ 1 ≥ λ 2 ≥ ... ≥ λ n be the eigenvalues of Ãsym . If γ k ≥ 0 ∀k ∈ {0, 1, ..., K}, K k=0 γ k = 1 and ∃k > 0 such that γ k > 0, then |g γ,K (λ i )/g γ,K (λ 1 )| < |λ i /λ 1 | ∀i ≥ 2. Also, if γ k = (-α) k , α ∈ (0, 1) and K → ∞, then | lim K→∞ g γ,K (λ i )/ lim K→∞ g γ,K (λ 1 )| > |λ i /λ 1 | ∀i ≥ 2. Note that |g γ,K (λ i )/g γ,K (λ 1 )| < |λ i /λ 1 | ∀i ≥ 2 implies that after applying the graph filter g γ,K , the lowest frequency component (correspond to λ 1 ) further dominates. Hence g γ,K acts like a low pass filter in this case. In contrast, | lim K→∞ g γ,K (λ i )/ lim K→∞ g γ,K (λ 1 )| > |λ i /λ 1 | ∀i ≥ 2 implies that after applying the graph filter, the lowest frequency component (correspond to λ 1 ) no longer dominates. This correspond to the high pass filter case. Proof. We start with the low pass filter result. From basic spectral analysis (Von Luxburg, 2007) we know that λ 1 = 1 and |λ i | < 1, ∀i ≥ 2. One can also find the analysis in the proof of our Lemma A.2 in the Supplement. Then by assumption we know that g γ,K (λ 1 ) = K k=0 γ k = 1. Hence, proving Theorem A.1 is equivalent to show |g γ,K (λ i )| < |λ i | ∀i ≥ 2. This is obvious since g γ,K (λ) = K k=0 γ k λ k is a polynomial of order K with nonnegative coefficients. It is easy to check that ∀k ≥ 1, |λ| k < |λ|, ∀|λ| < 1. Combine with the fact that all γ k 's are nonnegative we have |g γ,K (λ i )| ≤ K k=0 γ k |λ k | = K k=0 γ k |λ| k (a) ≤ K k=0 γ k |λ| = |λ|. Finally, note that the only possibility that the inequality (a) holds is γ k = δ 0,K since ∀k ≥ 1, |λ| k < |λ|, ∀|λ| < 1. However, by assumption K k=0 γ k = 1 and ∃k > 0 such that γ k > 0 we know that this is impossible. Hence (a) is a strict inequality <. Together we complete the proof for low pass filtering part. For the high pass filter result, it is not hard to see that lim K→∞ g γ,K (λ) = lim K→∞ K k=0 γ k λ k = lim K→∞ K k=0 (-αλ) k = 1 1 + αλ , where the last step is due to the fact that α ∈ (0, 1) and thus lim K→∞ (-αλ) K = 0, ∀|λ| ≤ 1. Thus we have lim K→∞ g γ,K (λ i ) lim K→∞ g γ,K (λ 1 ) = 1 + α 1 + αλ i (b) > 1 (c) > |λ i | ∀i ≥ 2. be negative when γ k < 0. Finally, it is not hard to check that the gradient is bounded in magnitude. Together we have shown that the gradient of γ k and γ k are of the same sign. This directly implies that |γ k | will approach to 0 until we escape from over-smoothing when we use a decreasing learning rate for the optimizer (i.e. SGD). Proof. First, let us assume the over-smoothing takes place and the γ k > 0 for the dominate term. By Definition A.3, we know that Z :j = c 0 β j π, ∀j ∈ [C] for some c 0 > 0 and K sufficiently large. By Lemma A.4 we have ∂L ∂γ k = i∈T ηπ i e ηZi: j∈[C] e ηZij -Y i: , β + o k (1) (5) = i∈T ηπ i e ηc0πiβ j∈[C] e ηc0πiβj -Y i: , β + o k (1), where the last step follows from Definition A.3. Next, by Lemma A.5, we may approximate the softmax η by the true argmax for η > 0 large enough according to i∈T ηπ i 1[c 0 π i β] -Y i: , β + o k (1) + o η (1) (7) = i∈T ηπ i 1[β] -Y i: , β + o k (1) + o η (1) (8) = i∈T ηπ i max j∈[C] β j -β 1[Yi:] + o k (1) + o η (1). The first equality is due to the fact that c 0 > 0 and π i > 0. Recall that by Lemma A.2, π i = √ Dii v∈V Dvv . Since we have a self-loop for each node, Dii > 0 and thus π i > 0. For the case γ k < 0, the same analysis still valid until (7). Hence we have i∈T ηπ i 1[-c 0 π i β] -Y i: , β + o k (1) + o η (1) (10) = i∈T ηπ i 1[-β] -Y i: , β + o k (1) + o η (1) (11) = i∈T ηπ i min j∈[C] β j -β 1[Yi:] + o k (1) + o η (1). Together we complete the proof.

A.5 CSBM DETAILS

The cSBM adds Gaussian random vectors as node features on top of the classical SBM. For simplicity, we assume C = 2 equally sized communities with node labels v i in {+1, -1}. Each node i is associate with a f dimensional Gaussian vector b i = µ n v i u + Zi √ f where n is the number of nodes, u ∼ N (0, I/f ) and Z i ∈ R f has independent standard normal entries. The (undirected) graph in cSBM is described by the adjacency matrix A defined as P (A ij = 1) = d+λ √ d n if v i v j > 0 d-λ √ d n otherwise . Similar to the classical SBM, given the node labels the edges are independent. The symbol d stands for the average degree of the graph. Also, recall that µ and λ control the information strength carried by the node features and the graph structure respectively. One reason for using the cSBM to generate synthetic data is that the information-theoretic limit of the model is already characterized in Deshpande et al. (2018) . This result is summarized below. Theorem A.7 (Informal main result in Deshpande et al. (2018) ). Assume that n, f → ∞, n f → ξ and d → ∞. Then there exists an estimator v such that lim inf n→∞ | v,v | n is bounded away from 0 if and only if λ 2 + µ 2 ξ > 1. In our experiment, we set n = 5000, f = 2000 and thus have ξ = 2.5. We vary µ and λ along the arc λ 2 + µ 2 /ξ = 1 + for some > 0 to ensure that we are in the achievable parameter regime. We also choose = 3.25 for all our experiment. A.6 PROOF OF LEMMA A.2 Note that the proof of Lemma A.2 reduces to a standard analysis of random walks on graph. We include it for completeness and refer the interested readers to the tutorial Von Luxburg (2007) . We start by showing that the symmetric graph Laplacian Lsym = I -D-1/2 Ã D-1/2 = I -Ãsym is positive semi-definite. Let u be any real vector of unit norm and f = D-1/2 u, then we have u T Lsym u = u T u -u T D-1/2 Ã D-1/2 u = n i=1 u 2 i - n i,j=1 f i f j Ãij (14) = n i=1 Dii f 2 i - n i,j=1 f i f j Ãij = 1 2 ( n i=1 Dii f 2 i -2 n i,j=1 f i f j Ãij + n j=1 Djj f 2 j ) (15) = 1 2 n i,j=1 Ãij (f i -f j ) 2 , ( ) where the last step follows from the definition of the degree. Next we show that 0 is indeed an eigenvalue of Lsym associated with the unit eigenvector π where π = √ Dii √ v Dvv . Let 1 be the all one vector. Then, a direct calculation reveals that Lsym π = π -D-1/2 Ã D-1/2 π = π -D-1/2 Ã D-1/2 D1/2 1 × 1 v Dvv (17) = π -D-1/2 Ã1 × 1 v Dvv = π -D-1/2 D1 × 1 v Dvv (18) = π -D1/2 1 × 1 v Dvv = π -π = 0. Combining this result with the positive semi-definite property of the Laplacian shows that 0 is indeed the smallest eigenvalue of Lsym associated with the eigenvector π. Moreover, from ( 16) and the assumption that the graph is connected, it is not hard to see that the multiplicity of the eigenvalue 0 is exactly 1 (See Proposition 2 and 4 in Von Luxburg (2007) for more detail). Finally, from (13) it is obvious that the the largest eigenvalue of Ãsym is 1, which correspond to the eigenvector π. Hence all other eigenvalues of Ãsym 1 > λ 2 ≥ ... ≥ λ n . Published as a conference paper at ICLR 2021 Next, we prove that |λ n | < 1. This can also be shown directly from ( 16). Note that u T Lsym u = 1 2 n i,j=1 Ãij (f i -f j ) 2 (20) ≤ n i,j=1 Ãij (f 2 i + f 2 j ) = 2 n i,j=1 Ãij f 2 i = 2 n i,j=1 Ãij u 2 i Dii (21) = 2 n i=1 u 2 i Dii n j=1 Ãij = 2 n i=1 u 2 i Dii Dii = 2 n i=1 u 2 i = 2. ( ) The inequality follows from an application of the Cauchy-Schwartz inequality.  Hence, for any H (0) we have lim k→∞ Ãk sym H (0) = ππ T H (0) = πβ T . Note that this can also be written with the o k (1) term as ). Ãk sym H (0) = πβ T + o k (1). ( Then by taking the partial derivative of the loss function with respect to γ k we have ∂L ∂γ k = ∂ ∂γ k i∈T (log( C m=1 e ηZim ) -ηZ i: , Y i: ). Next, recall that for GPR-GNN we also have Z = K k=0 γ k H (k) . Plugging this expression into the previous formula and applying the chain rule we obtain ∂ ∂γ k i∈T (log( C m=1 e ηZim ) -ηZ i: , Y i: ) = i∈T ( C m=1 e ηZim ∂ηZim ∂γ k C m=1 e Zim -ηH (k ) i: , Y i: ) (28) = i∈T ( C m=1 e ηZim ηH (k ) im C m=1 e ηZim -ηH (k ) i: , Y i: ) Settin k = k for large enough k, it follows from Lemma A.2 that ∂L ∂γ k = i∈T η( C m=1 e ηZim H (k) im C m=1 e ηZim -H (k) i: , Y i: ) (30) = i∈T η( C m=1 e ηZim (π i β m + o k (1)) C m=1 e ηZim -π i β + o k (1), Y i: ) (31) = i∈T π i η( C m=1 e ηZim β m C m=1 e ηZim -β, Y i: ) + o k (1) (32) = i∈T π i η( C m=1 Pim β m -β, Y i: ) + o k (1) = i∈T ηπ i Pi: -Y i: , β + o k (1). Note that in (32) and ( 33) we used the definition of the soft prediction P = softmax η (Z). This completes the proof. .

A.8 PROOF

(34) Note that ββ m > 0 when β m = β and ββ m = 0 when β m = β. Without loss of generality we assume that there are p maxima in β, where 1 ≤ p ≤ C, and let P denote the set of indices of those maxima. Then, taking the limit η → ∞ we have lim η→∞ softmax η (β) j = lim η→∞ e -η( β-βj ) m / ∈P e -η( β-βm) + p = 0, if β j = β 1 p , otherwise. This implies that for η > 0 large enough one has softmax η (β) = 1[β] + o η (1). The above result completes the proof. A.9 ADDITIONAL EXPERIMENTAL DETAILS For all baseline models, we directly use the implementation available in the Pytorch Geometric library Fey & Lenssen (2019) .We use early stopping 200 and a maximum number of epochs equal to 1000 for both real benchmark dataset and our cSBM synthetic datasets. All models use the Adam optimizer Kingma & Ba (2014) . Note that the early stopping criteria is exactly the same as in Pytorch Geometric -when the epoch is greater than half of the maximum epoch, we check if the current validation loss is lower than the average over the past 200 epochs. If it is not lower, we stop the training process. For GCN, we use 2 GCN layers with 64 hidden units. For GAT, we use 2 GAT convolutional layers, where the first layer has 8 attention heads and each head has 8 hidden units; the second layer has 1 attention head and 64 hidden units. For GCN-Cheby, we use 2 steps propagation for each layer with 32 hidden units. Note that the number of equivalent hidden units for each layer is64 for this case. For JK-Net, we use the GCN-based model with 2 layers and 16 hidden units in each layer. As for the layer aggregation part, we use a LSTM with 16 channels and 4 layers. For the MLP, we choose a 2-layer fully connected network with 64 hidden units. For APPNP we use the same 2-layer MLP with 10 steps of propagation. Besides the GPR-GNN, we fix the dropout rate for the NN part to be 0.5 as APPNP and optimize the dropout rate for the GPR part among {0, 0.5, 0.7}. For Geom-GCN, we choose the datasets already tested in the paper were the method was first described (Pei et al., 2019) . For SGC, we use the default K = 2 layers after test among {2, 3}. For SAGE, we use 2 SAGE convolutional layers with 64 hidden units. The heterophilic datasets used in (Pei et al., 2019) . The graphs Chameleon, Actor, Squirrel, Texas and Cornell in their original form are directed graphs (see the github repository of (Pei et al., 2019) ). Since the usual setting for semi-supervised node classifications involves undirected graph, we transformed the graphs into undirected to test them on all previously described benchmark methods. We keep the input graph directed for Geom-GCN as the method uses a fixed preprocessing scheme that Table 8 : Additional experiments illustrating that GPR-GNN escapes over-smoothing. We initialize the GPR weights γ k = δ kK as described in Section 5. We report the mean accuracy at Epoch 0 and after training (Final epoch). The over-smoothing ratio indicates how many time out of the 100 runs that GPR-GNN started with lead to the same label for all nodes. For an illustration of how GPR weights change over different epochs, please check Figure 9 . Note that the learned GPR weights are all positive for every homophilic dataset. There is at least one negative learned GPR weight for every heterophilic dataset.



https://github.com/graphdml-uiuc-jlu/geom-gcn http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-11/www/wwkb



Figure 1: (a) Hidden state feature extraction is performed by a neural networks using individual node features propagated via GPR. Note that both the GPR weights γ k and parameter set {θ} of the neural network are learned simultaneously in an end-to-end fashion (as indicated in red). (b)-(c)The learnt GPR weights of the GPR-GNN on real world datasets. Cora is homophilic while Texas is heterophilic (Here, H stands for the level of homophily defined below). An interesting trend may be observed: For the heterophilic case the weights alternate from positive to negative with dampening amplitudes (more examples are provided in Section 5). The shaded region corresponds to a 95% confidence interval.

Figure 3: Figure (a)-(d) shows the learnt GPR weights by GPR-GNN with random initialization on cSBM, dense split. The shaded region indicates 95% confidence interval.

Figure 4: Figures (a)-(d) show the learned GPR weights of our GPR-GNN method with random initialization on various datasets, for dense splitting. Figures (e)-(f) show the learned weights of our GPR-GNN method with initialization δ kK on cSBM(φ = -1), for dense splitting. The shaded region indicates a 95% confidence interval.

Figure 5: A simple example demonstrating how GPR-GNN escapes over-smoothing. Let us consider a simple example shown in Figure 5 involving a connected and undirected graph G = (V, E) (Figure 5 (a)). Consider two different node label assignments shown in Figure 5 (b) and Figure 5 (c). Obviously, the graph topologies depicted in Figure 5 (b) and (c) are identical and the only difference is the class label assignment. In Figure5(b), the graph is homophilic and hence the optimal graph filter should emphasize the low-frequency part of the graph signal. In contrast, in Figure5(c), the graph is heterophilic as the graph is bipartite with respect to the labels. Hence, the optimal graph filter should emphasize the high-frequency part of the graph signal. This example illustrates that the optimal graph filter should depend on both the graph topology and the node label information. Recall that the equivalent graph filter that APPNP uses in the asymptotic regime is Π ppr which is independent on the node label information. Also, Theorem 4.1 established that APPNP intrinsically utilizes a low-pass filter. In contrast, GPR-GNN learns the GPR weights guided by the node label information which allows it to account for both cases (homophilic and heterophilic) shown.

(a) Case 1.(b) Case 2.

Figure 6: A simple example for explaining the insufficiency of homophily measure H(G).As mentioned in Section 5, the homophily measure H(G) is inadequate for characterizing whether a heterophilic graph topology is informative or not. Consider two simple examples depicted in Fig-

OF LEMMA A.5 Let β = max(β). Then by the definition of softmax η for η > 0 we have softmax η

Figure 7: Figures (a)-(i) show the learned GPR weights by GPR-GNN with random initialization on cSBM, dense splitting. The shaded region indicates a 95% confidence interval.

Actor, (H(G) = 0.008) (h) Squirrel, (H(G) = 0.055) (i) Texas, (H(G) = 0.016) (j) Cornell, (H(G) = 0.137)

Figure 8: Figures (a)-(j) show the learned GPR weights by GPR-GNN with random initialization on various benchmark datasets, dense splitting. The shaded region indicates a 95% confidence interval.Note that the learned GPR weights are all positive for every homophilic dataset. There is at least one negative learned GPR weight for every heterophilic dataset.

Benchmark dataset properties and statistics.

Results on real world benchmark datasets: Mean accuracy (%) ± 95% confidence interval. Boldface letters are used to mark the best results while underlined boldface letters indicate results within the given confidence interval of the best result.



Consequently, the largest eigenvalue of Lsym is bounded by 2 which means that |λ n | ≤ 1. Note that equality holds if and only if the underlying graph is bipartite. However, this is impossible in our setting since we have added a self loop to each node. Hence |λ n | < 1. This means lim

The values of the homophily measure for cSBM datasets. Linux Machine with 48 cores, 376GB of RAM, and a NVIDIA Tesla P100 GPU with 12GB of GPU memory. For the training set, we ensure that number of nodes from each class is approximately the same an keep the total number of training nodes close to 2.5%/60%. For the validation set, we randomly sample 2.5%/20% of the nodes and place the remaining ones into the test set.

Results on homophilic real-world benchmark datasets tested in(Pei et al., 2019), dense splitting: Mean accuracy (%) ± 95% confidence interval. Boldface values indicate the best results found while boldface, underlined values indicates results within the confidence interval with respect to the best result.

ACKNOWLEDGMENTS

The work was supported in part by the NSF Emerging Frontiers of Science of Information Grant 0939370 and the NSF CIF 1618366 Grant.

availability

://github.com/jianhao2016/GPRGNN

annex

Both strict inequalities (b) and (c) are from the fact that |λ i | < 1, ∀i ≥ 2. Notably, sup λ∈[1,-1)happens at the boundary λ = -1, which corresponds the the bipartite graph. It further shows that the graph filter with respect to the choice γ k = (-α) k emphasizes high frequency components and thus it is indeed acting as a high pass filter.A.4 PROOF OF THEOREM 4.2We start by introducing some additional notation, lemmas and definition before we proceed to the formal statement of Theorem 4.2. The label matrix is denoted by Y ∈ R n×C , where each row is a one-hot vector. We use 1[β] ∈ R C to denote the argmax of the vector β ∈ R C : we have 1[β] i = 1 if and only if β i = max(β) (ties are broken evenly), and 1[β] i = 0 otherwise. Let us replace the softmax(•) with softmax η (•), where we let softmax η (β) i = e ηβi /( j e ηβj ) stand for the softmax with a smooth parameter η > 0. Note that for η = 1 we recover the standard softmax. With a slight abuse of notation, for the vector β we write exp(β) to denote element-wise exponentiation. We use•, • to denote the standard Euclidean inner product. Also we use L for the cross entropy loss whereLemma A.2. Assume that the nodes in an undirected and connected graph G have one of C labels.Then, for k large enough, we haveFor any H (0) and large enough k ≤ K, if the label prediction is dominated by H (k) , all nodes will have a representation proportional to γ k β. Hence, we will arrive at the same label for all nodes. This is what we refer to as the over-smoothing phenomenon.Definition A.3 (The over-smoothing phenomenon). First, recall that Z = k γ k H (k) . If oversmoothing occurs in the GPR-GNN for K sufficiently large, we have Z :j = c 0 β j π, ∀j ∈ [C] for some c 0 > 0 if γ k > 0 and Z :j = -c 0 β j π, ∀j ∈ [C] for some c 0 > 0 if γ k < 0.Lemma A.4. Let L = i∈T L i = i∈T -log( Pi: , Y i: ) be the cross entropy loss and let T be the training set. Under the same assumption as given in Lemma A.2, the gradient of γ k for k large enough is ∂L ∂γ k = i∈T ηπ i Pi: -Y i: , β + o k (1). Lemma A.5. For any real vector β ∈ R C and η > 0 large enough, we have softmax η

Now we are ready to state the formal version of Theorem 4.2.

Theorem A.6 (Formal version of Theorem 4.2). Under the same assumptions as those listed in Lemma A.2, if the training set contains nodes from each class, then the GPR-GNN method can always avoid over-smoothing. More specifically, for k, η large enough we haveNote that when γ k > 0, (3) ≥ 0 when ignoring the o(1) term. The equality is achieved if and only if. This means that over-smoothing results in a prediction that perfectly aligns with the ground truth label in the training set. However, if our training set contains at least one node from each class then the equality can never be attained. Thus, the gradient of γ k will always be positive when γ k > 0. Similarly when γ k < 0, (4) ≤ 0 when ignoring the o(1) term. The equality is achieved if and only if. By the same reason we know that under the assumption on training set the equality can never be attained. Thus, the gradient of γ k will always was unfortunately not made public by the authors. Our homophily measure values H(G) in Table 1 are all based on undirected graphs and hence the numbers are different from those reported in (Pei et al., 2019) .A.10 ADDITIONAL EXPERIMENTAL RESULTS 8 . Note that the GPR weights {γ k } K k=0 are identical to {-γ k } K k=0 in terms of graph filtering. 

