A SIMPLE AND GENERAL GRAPH NEURAL NETWORK WITH STOCHASTIC MESSAGE PASSING

Abstract

Graph neural networks (GNNs) are emerging machine learning models on graphs. One key property behind the expressiveness of existing GNNs is that the learned node representations are permutation-equivariant. Though being a desirable property for certain tasks, however, permutation-equivariance prevents GNNs from being proximity-aware, i.e., preserving the walk-based proximities between pairs of nodes, which is another critical property for graph analytical tasks. On the other hand, some variants of GNNs are proposed to preserve node proximities, but they fail to maintain permutation-equivariance. How to empower GNNs to be proximity-aware while maintaining permutation-equivariance remains an open problem. In this paper, we propose Stochastic Message Passing (SMP), a general and simple GNN to maintain both proximity-awareness and permutation-equivariance properties. Specifically, we augment the existing GNNs with stochastic node representations learned to preserve node proximities. Though seemingly simple, we prove that such a mechanism can enable GNNs to preserve node proximities in theory while maintaining permutation-equivariance with certain parametrization. Extensive experimental results demonstrate the effectiveness and efficiency of SMP for tasks including node classification and link prediction.

1. INTRODUCTION

Graph neural networks (GNNs), as generalizations of neural networks in analyzing graphs, have attracted considerable research attention. GNNs have been widely applied to various applications such as social recommendation (Ma et al., 2019) , physical simulation (Kipf et al., 2018) , and protein interaction prediction (Zitnik & Leskovec, 2017) . One key property of most existing GNNs is permutation-equivariance, i.e., if we randomly permutate the IDs of nodes while maintaining the graph structure, the representations of nodes in GNNs are permutated accordingly. Mathematically, permutation-equivariance reflects one basic symmetric group of graph structures. Although it is a desirable property for tasks such as node or graph classification (Keriven & Peyré, 2019; Maron et al., 2019b) , permutation-equivariance also prevents GNNs from being proximity-aware, i.e., permutation-equivariant GNNs cannot preserve walk-based proximities between nodes such as the shortest distance or high-order proximities (see Theorem 1). Pairwise proximities between nodes are crucial for graph analytical tasks such as link prediction (Hu et al., 2020; You et al., 2019) . To enable a proximity-aware GNN, Position-aware GNN (P-GNN) (You et al., 2019) foot_0 proposes a sophisticated GNN architecture and shows better performance for proximity-aware tasks. But P-GNN needs to explicitly calculate the shortest distance between nodes and its computational complexity is unaffordable for large graphs. Moreover, P-GNN completely ignores the permutation-equivariance property. Therefore, it cannot produce satisfactory results when permutation-equivariance is helpful. In real-world scenarios, both proximity-awareness and permutation-equivariance are indispensable properties for GNNs. Firstly, different tasks may require different properties. For example, recommendation applications usually require the model to be proximity-aware (Konstas et al., 2009) while permutation-equivariance is a basic assumption in centrality measurements (Borgatti, 2005) . Even for the same task, different datasets may have different requirements on these two properties. Taking link prediction as an example, we observe that permutation-equivariant GNNs such as GCN (Kipf & Welling, 2017) or GAT (Velickovic et al., 2018) show better results than P-GNN in coauthor graphs, but the opposite in biological graphs (please see Section 5.2 for details). Unfortunately, in the current GNN frameworks, these two properties are contradicting, as we show in Theorem 1. Whether there exists a general GNN to be proximity-aware while maintaining permutation-equivariance remains an open problem. In this paper, we propose Stochastic Message Passing (SMP), a general and simple GNN to preserve both proximity-awareness and permutation-equivariance properties. Specifically, we augment the existing GNNs with stochastic node representations learned to preserve proximities. Though seemingly simple, we prove that our proposed SMP can enable GNNs to preserve walk-based proximities in theory (see Theorem 2 and Theorem 3). Meanwhile, SMP is equivalent to a permutationequivariant GNN with certain parametrization and thus is at least as powerful as those GNNs in permutation-equivariant tasks (see Remark 1). Therefore, SMP is general and flexible in handling both proximity-aware and permutation-equivariant tasks, which is also demonstrated by our extensive experimental results. Besides, owing to the simple structure, SMP is computationally efficient, with a running time roughly the same as those of the most simple GNNs such as SGC (Wu et al., 2019) and is at least an order of magnitude faster than P-GNN on large graphs. Ablation studies further show that a linear instantiation of SMP is expressive enough as adding extra non-linearities does not lift the performance of SMP on the majority of datasets. Our contributions are as follows. • We propose SMP, a simple and general GNN to handle both proximity-aware and permutationequivariant graph analytical tasks. • We prove that SMP has theoretical guarantees in preserving walk-based proximities and is at least as powerful as the existing GNNs in permutation-equivariant tasks. • Extensive experimental results demonstrate the effectiveness and efficiency of SMP. We show that a linear instantiation of SMP is expressive enough on the majority of datasets.

2. RELATED WORK

We briefly review GNNs and their permutation-equivariance and proximity-awareness property. The earliest GNNs adopt a recursive definition of node states (Scarselli et al., 2008; Gori et al., 2005) or a contextual realization (Micheli, 2009) . GGS-NNs (Li et al., 2016) replace the recursive definition with recurrent neural networks (RNNs). Spectral GCNs (Bruna et al., 2014) defined graph convolutions using graph signal processing (Shuman et al., 2013; Ortega et al., 2018) with Cheb-Net (Defferrard et al., 2016) and GCN (Kipf & Welling, 2017) approximating the spectral filters using a K-order Chebyshev polynomial and the first-order polynomial, respectively. MPNNs (Gilmer et al., 2017) , GraphSAGE (Hamilton et al., 2017) , and MoNet (Monti et al., 2017) are proposed as general frameworks by characterizing GNNs with a message-passing function and an updating function. More advanced variants such as GAT (Velickovic et al., 2018) , JK-Nets (Xu et al., 2018b) , GIN (Xu et al., 2018a), and GraphNets (Battaglia et al., 2018 ) follow these frameworks. Li et al. (Li et al., 2018) , Xu et al. (Xu et al., 2018a) , Morris et al. (Morris et al., 2019), and Maron et al. (Maron et al., 2019a) show the connection between GNNs and the Weisfeiler-Lehman algorithm (Shervashidze et al., 2011) of graph isomorphism tests, in which permutation-equivariance holds a key constraint. Maron et al. (Maron et al., 2019b) and Keriven et al. (Keriven & Peyré, 2019) analyze the permutation-equivariance property of GNNs more theoretically. To date, most of the existing GNNs are permutation-equivariant and thus are not proximity-aware. The only exception is P-GNN (You et al., 2019) , which proposes to capture the positions of nodes using the relative distance between the target node and some randomly chosen anchor nodes. However, P-GNN cannot satisfy permutation-equivariance and is computationally expensive. Very recently, motivated by enhancing the expressive power of GNNs in graph isomorphism tests and distributed computing literature (Angluin, 1980; Linial, 1992; Naor & Stockmeyer, 1995) , some studies suggest assigning unique node identifiers for GNNs (Loukas, 2020) such as one-hot IDs (Murphy et al., 2019) or random numbers (Dasoulas et al., 2019; Sato et al., 2020; Corso et al., 2020) . For example, Sato et al. (Sato et al., 2020) novelly show that random numbers can enhance GNNs in tackling two important graph-based NP problems with a theoretical guarantee, namely the minimum dominating set and the maximum matching problem, and Fey et al. (Fey et al., 2020) empirically show the effectiveness of random features in the graph matching problem. Our work differs in that we systematically study how to preserve permutation-equivariance and proximity-awareness simultaneously in a simple yet effective framework, which is a new topic different from these existing works. Besides, we theoretically prove that our proposed method can preserve walk-based proximities by using the random projection literature. We also demonstrate the effectiveness of our method on various large-scale benchmarks for both node-and edge-level tasks, while no similar results are reported in the literature. The design of our method is also inspired by the random projection literature in dimensionality reduction (Vempala, 2005) and to the best of our knowledge, we are the first to study random projection in the scope of GNNs. More remotely, our definition of node proximities is inspired and inherited from graph kernels (Gärtner et al., 2003; Borgwardt & Kriegel, 2005) , network embedding (Perozzi et al., 2014; Grover & Leskovec, 2016) , and general studies of graphs (Newman, 2018) .

3. MESSAGE-PASSING GNNS

We consider a graph G = (V, E, F) where V = {v 1 , ..., v N } is the set of N = |V| nodes, E ⊆ V × V is the set of M = |E| edges, and F ∈ R N ×d0 is a matrix of d 0 node features. The adjacency matrix is denoted as A, where its i th row, j th column and an element denoted as A i,: , A :,j , and A i,j , respectively. In this paper, we assume the graph is unweighted and undirected. The neighborhood of node v i is denoted as N i and Ñi = N i ∪ {v i }. The existing GNNs usually follow a message-passing framework (Gilmer et al., 2017) , where the l th layer adopts a neighborhood aggregation function AGG(•) and an updating function UPDATE(•): m (l) i = AGG({h (l) j , ∀j ∈ Ñi }), h (l+1) i = UPDATE([h (l) i , m (l) i ]), i ∈ R d l is the representation of node v i in the l th layer, d l is the dimensionality, and m (l) i are the messages. We also denote H (l) = [h (l) 1 , ..., h N ] and [•, •] is the concatenation operation. The node representations are initialized as node features, i.e., H (0) = F. We denote a GNN following Eq. (1) with L layers as a parameterized function as followsfoot_2 : H (L) = F GNN (A, F; W), where H (L) are final node representations learned by the GNN and W denotes all the parameters. One key property of the existing GNNs is permutation-equivariance. Definition 1 (Permutation-equivariance). Consider a graph G = (V, E, F) and any permutation P : V → V so that G = (V, E , F ) has an adjacency matrix A = PAP T and a feature matrix F = PF, where P ∈ {0, 1} N ×N is the permutation matrix corresponding to P, i.e., P i,j = 1 iff P(v i ) = v j . A GNN satisfies permutation-equivariance if the node representations are equivariant with respect to P, i.e., PF GNN (A, F; W) = F GNN (PAP T , PF; W). It is known that GNNs following Eq. ( 1) are permutation-equivariant (Maron et al., 2019b) . Definition 2 (Automorphism). A graph G is said to have (non-trivial) automorphism if there exists a non-identity permutation matrix P = I N so that A = PAP T and F = PF. We denote the corresponding automorphic node pairs as C G = P =I N {(i, j)|P i,j = 0, i = j} Corollary 1. Using Definition 1 and 2, if a graph has automorphism, a permutation-equivariant GNN will produce identical node representations for automorphic node pairs: h (L) i = h (L) j , ∀(i, j) ∈ C G . Since the node representations are used for downstream tasks, the corollary shows that permutationequivariant GNNs cannot differentiate automorphic node pairs. A direct consequence of Corollary 1 is that permutation-equivariant GNNs cannot preserve walk-based proximities between pairs of nodes. The formal definitions are as follows. Definition 3 (Walk-based Proximities). For a given graph G = (V, E, F), we use a matrix S ∈ R N ×N to denote walk-based proximities between pairs of nodes defined as: S i,j = S ({v i v j }) , where v i v j denotes walks from node v i to v j and S(•) is an arbitrary real-valued function. The length of a walk-based proximity is the maximum length of all the walks of the proximity. Typical examples of walk-based proximities include the shortest distance (You et al., 2019) , the highorder proximities (a sum of walks weighted by their lengths) (Zhang et al., 2018) , and random walk probabilities (Klicpera et al., 2019) . Next, we give a definition of preserving walk-based proximities. Definition 4. For a given walk-based proximity, a GNN is said to be able to preserve the proximity if there exists a decoder function F de (•) satisfying that for any graph G = (V, E, F), there exist parameters W G so that ∀ > 0: S i,j -F de H (L) i,: , H (L) j,: < , where H (L) = F GNN (A, F; W G ). Note that we do not constrain the GNN architecture as long as it follows Eq. ( 1), and the decoder function is also arbitrary (but notice that it cannot take the graph structure as inputs). In fact, both the GNN and the decoder function can be arbitrarily deep and with sufficient hidden units. Theorem 1. The existing permutation-equivariant GNNs cannot preserve any walk-based proximity except the trivial solution that all node pairs have the same proximity.foot_3  The formulation and proof of the theorem are given in Appendix A.1. Since walk-based proximities are rather general and widely adopted in graph analytical tasks such as link prediction, the theorem shows that the existing permutation-equivariant GNNs cannot handle these tasks well.

4. THE MODEL 4.1 A GNN FRAMEWORK USING STOCHASTIC MESSAGE PASSING

A major shortcoming of permutation-equivariant GNNs is that they cannot differentiate automorphic node pairs. To solve that problem, we need to introduce some mechanism as "symmetry breaking", i.e., to enable GNNs to distinguish these nodes. To achieve this goal, we sample a stochastic matrix  E ∈ R N 1 √ 2 |E i,: -E j,: | ∼ χ d , ∀i, j. ( ) When d is reasonably large, e.g., d > 20, the probability of two signals being close is very low. Then, inspired by the message-passing framework, we apply a GNN on the stochastic matrix so that nodes can exchange information of the stochastic signals: Ẽ = F GNN (A, E; W) . We call Ẽ the stochastic representation of nodes. Using the stochastic matrix and message-passing, Ẽ can be used to preserve node proximities (see Theorem 2 and Theorem 3). Then, to let our model still be able to utilize node features, we concatenate Ẽ with the node representations from another GNN with node features as inputs: H = F output ([ Ẽ, H (L) ]) Ẽ = F GNN (A, E; W) , H (L) = F GNN (A, F; W ), where F output (•) is an aggregation function such as a linear function or simply the identity mapping. In a nutshell, our proposed method augments the existing GNNs with a stochastic representation learned by message-passings to differentiate different nodes and preserve node proximities. There is also a delicate choice worthy mentioning, i.e., whether the stochastic matrix E is fixed or resampled in each epoch. By fixing E, the model can learn to memorize the stochastic representation and distinguish different nodes, but with the cost of unable to handle nodes not seen during training. On the other hand, by resampling E in each epoch, the model can have a better generalization ability since the model cannot simply remember one specific stochastic matrix. However, the node representations are not fixed (but pairwise proximities are preserved; see Theorem 2). In these cases, Ẽ is more capable of handling pairwise tasks such as link prediction or pairwise node classification. In this paper, we use a fixed E for transductive datasets and resample E for inductive datasets. Time Complexity From Eq.( 10), the time complexity of our framework mainly depends on the two GNNs in learning the stochastic and permutation-equivariant node representations. In this paper, we instantiate these two GNNs using simple message-passing GNNs such as GCN (Kipf & Welling, 2017) and SGC (Wu et al., 2019) (see Section 4.2 and Section 4.3). Thus, the time complexity of our method is the same as these models, which is O(M ), i.e., linear with respect to the number of edges. We also empirically compare the running time of different models in Appendix 5.5. Besides, many acceleration schemes for GNNs such as sampling (Chen et al., 2018a; b; Huang et al., 2018) or partitioning the graph (Chiang et al., 2019) can be directly applied to our framework.

4.2. A LINEAR INSTANTIATION

Based on the general framework shown in Eq. ( 10), we attempt to explore its minimum model instantiation, i.e., a linear model. Specifically, inspired by Simplified Graph Convolution (SGC) (Wu et al., 2019) , we adopt a linear message-passing for both GNNs, i.e., H = F output ([ Ẽ, H (L) ]) = F output ( ÃK E, ÃK F ), where Ã = (D + I) -1 2 (A + I)(D + I) -1 2 is the normalized graph adjacency matrix with self-loops proposed in GCN (Kipf & Welling, 2017) and K is the number of propagation steps. We also set 11) as a linear mapping or identity mapping. F output (•) in Eq. ( Though seemingly simple, we show that such an SMP instantiation possesses a theoretical guarantee in preserving the walk-based proximities. Theorem 2. An SMP in Eq. ( 11) with the message-passing matrix Ã and the number of propagation steps K can preserve the walk-based proximity ÃK ( ÃK ) T with high probability if the dimensionality of the stochastic matrix d is sufficiently large, where the superscript T denotes matrix transpose. The theorem is regardless of whether E are fixed or resampled. The mathematical formulation and proof of the theorem are given in Appendix A.2. In addition, we show that SMP is equivalent to a permutation-equivariant GNN with certain parametrization. Remark 1. Suppose we adopt F output (•) as a linear function with the output dimensionality the same as F GNN . Then, Eq. ( 10) is equivalent to the permutation-equivariant F GNN (A, F; W ) if the parameters in F output (•) are all-zeros for Ẽ and an identity matrix for H (L) . The result is straightforward from the definition. Then, we have the following corollary. Corollary 2. For any task, Eq. (10) with the aforementioned linear F output (•) is at least as powerful as the permutation-equivariant F GNN (A, F; W ), i.e., the minimum training loss of using H in Eq. ( 10) is equal to or smaller than using H (L) = F GNN (A, F; W ). In other words, SMP will not hinder the performancefoot_4 even the tasks are permutation-equivariant since the stochastic representations are concatenated with the permutation-equivariant GNNs followed by a linear mapping. In these cases, the linear SMP is equivalent to SGC (Wu et al., 2019) . Combining Theorem 2 and Corollary 2, the linear SMP instantiation in Eq. ( 11) is capable of handling both proximity-aware and permutation-equivariant tasks.

4.3. NON-LINEAR EXTENSIONS

One may question whether a more sophisticated variant of Eq. ( 10) can further improve the expressiveness of SMP. There are three adjustable components in Eq. ( 10): two GNNs in propagating the stochastic matrix and node features, respectively, and an output function. In theory, adopting nonlinear models as either component is able to enhance the expressiveness of SMP. Indeed, if we use a sufficiently expressive GNN in learning Ẽ instead of linear propagations, we can prove a more general version of Theorem 2 as follows. Theorem 3. An SMP variant following Eq.( 10) with F GNN (A, E; W) containing L layers can preserve any length-L walk-based proximity if the message-passing and updating functions in the GNN are sufficiently expressive. In this theorem, we also assume the Gaussian random vectors E are rounded to machine precision so that E is drawn from a countable subspace of R. The proof of the theorem is given in Appendix A.3. Similarly, we can adopt more advanced methods for F output (•) such as gating or attention so that the two GNNs are more properly integrated. Although non-linear extensions of SMP can, in theory, increase the model expressiveness, they also take a higher risk of over-fitting due to model complexity, not to mention that the computational cost will also increase. In practice, we find in ablation studies that the linear SMP instantiation in Eq. ( 11) works reasonably well on most of the datasets (please refer to Section 5.4 for further details). Baselines We adopt two sets of baselines. The first set is permutation-equivariant GNNs including GCN (Kipf & Welling, 2017) , GAT (Velickovic et al., 2018) , and SGC (Wu et al., 2019) , which are widely adopted GNN architectures. The second set contains P-GNN (You et al., 2019) , the only proximity-aware GNN to date. We use the P-GNN-F version.

5. EXPERIMENTS

In comparing with the baselines, we mainly evaluate two variants of SMP with different F output (•): SMP-Identity, i.e., F output (•) as an identity mapping, and SMP-Linear, i.e., F output (•) as a linear mapping. Note that both variants adopt linear message-passing functions as SGC. We conduct more ablation studies with different SMP variants in Section 5.4. For fair comparisons, we adopt the same architecture and hyper-parameters for all the methods (please refer to Appendix C.2 for the details). For datasets without node features, we adopt a constant vector as the node features. We experiment on two tasks: link prediction and node classification. Additional experiments on graph reconstruction, pairwise node classification, and running time comparison are provided in Appendix B. We repeat the experiments 10 times for datasets except for PPA and 3 times for PPA, and report the average results.

5.2. LINK PREDICTION

Link prediction aims to predict missing links of a graph. Specifically, we split the edges into 80%-10%-10% and use them for training, validation, and testing, respectively. Besides adopting those real edges as positive samples, we obtain negative samples by randomly sampling an equal number of node pairs that do not have edges. For all the methods, we set a simple classifier: Sigmoid(H T i H j ), i.e., use the inner product to predict whether a node pair (v i , v j ) forms a link, and use AUC (area under the curve) as the evaluation metric. One exception to the aforementioned setting is that on the PPA dataset, we follow the splits and evaluation metric (i.e., Hits@100) provided by the dataset (Hu et al., 2020) . The results except PPA are shown in Table 2 . We make the following observations. • Our proposed SMP achieves the best results on five out of the six datasets and is highly competitive (the second-best result) on the other (Physics). The results demonstrate the effectiveness of our proposed method on link prediction tasks. We attribute the strong performance of SMP to its capability of maintaining both proximity-awareness and permutation-equivariance properties. • On Grid, Communities, Email, and PPI, both SMP and P-GNN outperform the permutationequivariant GNNs, proving the importance of preserving node proximities. Although SMP is simpler and more computationally efficient than P-GNN, SMP reports even better results. • When node features are available (CS, Physics, and PPI), SGC can outperform GCN and GAT. The results re-validate the experiments in SGC (Wu et al., 2019) that the non-linearity in GNNs is not necessarily indispensable. Some plausible reasons include that the additional model complexity brought by non-linear operators makes the models tend to overfit and also difficult to train (see Appendix B.6). On those datasets, SMP retains comparable performance on two coauthor graphs and shows better performance on PPI, possibly because node features on protein graphs are less informative than node features on coauthor graphs for predicting links, and thus preserving graph structure is more beneficial on PPI. • As Email and PPI are conducted in an inductive setting, i.e., using different graphs for training/validation/testing, the results show that SMP can handle inductive tasks as well. The results on PPA are shown in Table 1 . SMP again outperforms all the baselines, showing that it can handle large-scale graphs with millions of nodes and edges. PPA is part of a recently released Open Graph Benchmark (Hu et al., 2020) . The superior performance on PPA further demonstrates the effectiveness of our proposed method in the link prediction task.

5.3. NODE CLASSIFICATION

Next, we conduct experiments of node classification, i.e., predicting the labels of nodes. Since we need ground-truths in the evaluation, we only adopt datasets with node labels. Specifically, for CS and Physics, following (Shchur et al., 2018) , we adopt 20/30 labeled nodes per class for training/validation and the rest for testing. For Communities, we adjust the number as 5/5/10 labeled nodes per class for training/validation/testing. For Cora, CiteSeer, and PubMed, we use the default splits that came with the datasets. We do not adopt Email because some graphs in the dataset are too small to show stable results and exclude PPI as it is a multi-label dataset. Under review as a conference paper at ICLR 2021 We use a softmax layer on the learned node representations as the classifier and adopt accuracy, i.e., how many percentages of nodes are correctly classified, as the evaluation criteria. We omit the results of SMP-Identity for this task since the node representations in SMP-Identity have a fixed dimensionality that does not match the number of classes. The results are shown in Table 3 . From the table, we observe that SMP reports nearly perfect results on Communities. Since the node labels are generated by graph structures on Communities and there are no node features, the model needs to be proximity-aware to handle it well. P-GNN, which shows promising results in the link prediction task, also fails miserably here. On the other five graphs, SMP reports highly competitive performance. These graphs are commonlyused benchmarks for GNNs. P-GNN, which completely ignores permutation-equivariance, performs poorly as expected. In contrast, SMP can manage to recover the permutation-equivariant GNNs and avoid being misled, as proven in Remark 1. In fact, SMP even shows better results than its counterpart, SGC, indicating that preserving proximities is also helpful for these datasets.

5.4. ABLATION STUDIES

We conduct ablation studies by comparing different SMP variants, including SMP-Identity, SMP-Linear, and the additional three variants as follows: • SMP-MLP: we set F output (•) as a fully-connected network with 1 hidden layer. • SMP-Linear-GCN feat : we set F GNN (A, F; W ) in Eq. ( 10) to be a GCN (Kipf & Welling, 2017) , i.e., induce non-linearity in message passing for features. F output (•) is still linear. • SMP-Linear-GCN both : we set both F GNN (A, E; W) and F GNN (A, F; W ) to be a GCN (Kipf & Welling, 2017) , i.e., induce non-linearity in message passing for both features and stochastic representations. F output (•) is linear. We show the results for link prediction tasks in Table 4 . The results for node classification and pairwise node classification, which imply similar conclusions, are provided in Table 10 and Table 11 in Appendix B.5. We make the following observations. • In general, SMP-Linear shows good-enough performance, achieving the best or second-best results on six datasets and highly competitive on the other (Communities). SMP-Identity, which does not have parameters in the output function, performs slightly worse. The results demonstrate the importance of adopting a learnable linear layer in the output function, which is consistent with Remark 1. SMP-MLP does not lift the performance in general, showing that adding extra complexities in F output (•) brings no gain in those datasets. • SMP-Linear-GCN feat reports the best results on Communities, PPI, and PPA, indicating that adding extra non-linearities in propagating node features are helpful for some graphs. • SMP-Linear-GCN both reports the best results on Gird with a considerable margin. Recall that Grid has no node features. The results indicate that inducing non-linearities can help the stochastic representations capture more proximities, which is more helpful for featureless graphs.

5.5. EFFICIENCY COMPARISON

To compare the efficiency of different methods quantitatively, we report the running time of different methods in Table 5 . The results are averaged over 3,000 epochs on an NVIDIA TESLA M40 GPU with 12 GB of memory. The results show that SMP is computationally efficient, i.e., only marginally slower than SGC and comparable to GCN. P-GNN is at least an order of magnitude slower except for the extremely small graphs such as Grid, Communities, or Email with no more than a thousand nodes. In addition, the expensive memory cost makes P-GNN unable to work on large-scale graphs.

5.6. MORE EXPERIMENTAL RESULTS

Besides the aforementioned experiments, we also conduct experiments on the following tasks: graph reconstruction (Appendix B.1), pairwise node classification (Appendix B.2), and comparing with one-hot IDs (Appendix B.3). Please refer to the Appendix for experimental results and corresponding analyses.

6. CONCLUSION

In this paper, we propose SMP, a general and simple GNN to maintain both proximity-awareness and permutation-equivariance properties. We propose to augment the existing GNNs with stochastic node representations learned to preserve node proximities. We prove that SMP can enable GNN to preserve node proximities in theory and is equivalent to a permutation-equivariant GNN with certain parametrization. Experimental results demonstrate the effectiveness and efficiency of SMP. Ablation studies show that a linear SMP instantiation works reasonably well on most of the datasets.

A THEOREMS AND PROOFS

A.1 THEOREM 1 Here we formulate and prove Theorem 1. Theorem 1. For any walk-based proximity function S(•), a permutation-equivariant GNN cannot preserve S(•), except the trivial solution that all node pairs have the same proximity, i.e., S i,j = c, ∀i, j, where c is a constant. Proof. We prove the theorem by contradiction. Assume there exists a non-trivial S(•) which a permutation-equivariant GNN can preserve. Consider any graph G = (V, E, F) and denote N = |V|. We can create G = (V , E , F ) with |V | = 2N so that: E i,j =    E i,j if i ≤ N, j ≤ N E i-N,j-N if i > N, j > N 0 else , F i,: = F i,: if i ≤ N F i-N,: if i > N . ( ) Basically, we generate two "copies" of the original graph, one indexing from 1 to N , and the other indexing from N + 1 to 2N . By assumption, there exists a permutation-equivariant GNN which can preserve S(•) in G and we denote the node representations as H (L) = F GNN (A , F ; W G ). It is easy to see that node v i and v i+N in G form an automorphic node pair. Using Corollary 1, their representations will be identical in any permutation-equivariant GNN, i.e., H (L) i,: = H (L) i+N,: , ∀i ≤ N. Also, note that there exists no walk from the two copies, i.e. v i v j = v j v i = ∅, ∀i ≤ N, j > N . As a result, for ∀i ≤ N, j ≤ N, ∀ > 0, we have: |S i,j -S(∅)| ≤ S i,j -F de H (L) i,: , H (L) j,: + S(∅) -F de H (L) i,: , H (L) j,: = S i,j -F de H (L) i,: , H (L) j,: + S i,j+N -F de H (L) i,: , H (L) j+N,: < 2 . ( ) We can prove the same for ∀i > N, j > N . The equation naturally holds if i ≤ N, j > N or i > N, j ≤ N since v i v j = ∅. Combining the results, we have ∀ > 0, ∀i, j, |S i,j -S(∅)| < 2 . Since can be arbitrarily small, the equation shows that all node pairs have the same proximity c = S(∅), which leads to a contraction and finishes our proof. Notice that in our proof, G can be constructed for any graph, so rather than designing one specific counter-example, we have shown that there always exists an infinite number of counter-examples by constructing automorphisms in the graph. Some may find that our counter-examples in the above proof will lead to multiple connected components. Next, we give an alternative proof maintaining one connected component (assuming the original graph is connected) under the assumption that the walk-based proximity is of finite length. Proof. Similar to the previous proof, we assume there exists a non-trivial S(•) which a permutationequivariant GNN can preserve. Besides, we assume the length of S(•) is upper bounded by l max , where l max is any finite number, i.e., ∀i, j, S i,j = S ({v i v j }) = S ({v i v j |len(v i v j ) ≤ l max }) . easily achieved by ignoring H (L) in F output ([ Ẽ, H (L) ]), e.g., if we set F output as a linear function, the model can learn to set the corresponding weights for H (L) as all-zeros. We set the decoder function as a normalized inner product: F de (H i,: , H j,: ) = 1 d H i,: H T j,: . Then, denoting a i = ÃK i,: and recalling Ẽ = ÃK E, we have: |S i,j -F de (H i,: , H j,: )| = |a i a T j - 1 d Ẽi,: ẼT j,: | = |a i a T j -a i 1 d EE T a T j |. ( ) Since E is a Gaussian random matrix, from the Johnson-Lindenstrauss lemma (Vempala, 2005) (in the inner product preservation forum, e.g., see Corollary 2.1 and its proof in (Sham & Greg, 2020 )), ∀0 < < 1 2 , we have: P |a i a T j -a i 1 d EE T a T j | ≤ 2 ( a i + a j ) > 1 -4e -( 2 -3 )d 4 . By setting = maxi ai , we have > 2 ( a i + a j ) and: P (|S i,j -F de (H i,: , H j,: )| < ) > 1 -4e - ( max i a i 2 - max i a i 3 )d 4 , which leads to the theorem by solving and setting d 0 as follows: 4e - ( max i a i 2 - max i a i 3 )d 0 4 = δ ⇒ d 0 = 4 log 4 δ (max i a i ) 3 2 max i a i -3 . A.3 THEOREM 3 Here we formulate and prove Theorem 3. Note that some notations and definitions are introduced in Appendix A.1. Theorem 3. For any length-L walk-based proximity, i.e., S i,j = S ({v i v j }) = S ({v i v j |len(v i v j ) ≤ L}) , where len(•) is the length of a walk, there exists an SMP variant in Eq. (10) with F GNN (A, E; W) containing L layers (including the input layer) to preserve that proximity if the following conditions hold: (1) The stochastic matrix E contains unique signals for different nodes, i.e. E i,: = E j,: , ∀i = j. (2) The message-passing and updating functions in learning Ẽ are bijective. (3) The decoder function F de (•) also takes E as inputs and is universal approximation. Proof. Similar as Theorem 2, we only utilize Ẽ during our proof. We use e (l) i , 0 ≤ l < L to denote the node representations in the l th layer of F GNN (A, E; W), i.e., e = E i,: and e (L-1) i = Ẽi,: . Our proof strategy is to show that the stochastic node representations can remember all the information about the walks. Firstly, as the message-passing and updating function are bijective by assumption, we can recover from the node representations in each layer all their neighborhood representations in the previous layer. Specifically, there exist F (l) (•), 1 ≤ l < L such that: F (l) e (l) i = e (l-1) i , e (l-1) j , j ∈ N i 6 . For notation conveniences, we split the function into two parts, one for the node itself and the other for its neighbors: F (l) self e (l) i = e (l-1) i , F (l) neighbor e (l) i = e (l-1) j , j ∈ N i . For the first function, if we successively apply such functions from the l th layer to the input layer, we can recover the input features of the GNN, i.e., E. Since the stochastic matrix E contains a unique signal for different nodes, we can decode the node ID from e (0) i , i.e., there exists F (0) self e i ; E = i. For brevity, we denote applying such l + 1 functions to get the node ID as F (0:l) self e (l) i = F (0) self F (1) self ... F (l) self e (l) i ; E = i. For the second function, we can apply F (l-1) neighbor to the decoded vector set so that we can recover their neighborhood representations in the (l -2) th layer, etc. Next, we show that for e (l-1) j , there exists a length-l walk v i v j = (v a1 , v a2 , ..., v a l ), where v a1 = v i , v a l = v j if and only if F (0:l-1 ) self e (l-1) j = a l = j and there exists e (l-2) , ..., e (0) such that: e (l-2) ∈ F (l-1) neighbor e (l-1) j , F (0:l-2) self e (l-2) = a l-1 , e (l-3) ∈ F (l-2) neighbor e (l-2) , F (0:l-3) self e (l-3) = a l-2 , ... e (0) ∈ F (1) neighbor e (1) , F (0:0) self e (0) = a 1 = i. ( ) This result is easily verified as: (v a1 , v a2 , ..., v a l ) is a walk ⇔ E ai,aj = E aj ,ai = 1 ⇔ a i ∈ N ai+1 , ∀1 ≤ i < l ⇔ ∃e (i-1) ∈ F (i) neighbor e (i) , F (0:i-1) self e (i-1) = a i , ∀1 ≤ i < l. (30) Note that all the information is encoded in Ẽ, i.e., we can decode {v i v j |len(v i v j ) ≤ L} from e (L-1) j by successively applying F (l) self (•) , F neighbor (•). We can also apply F , e (L-1) i = {v i v j |len(v i v j ) ≤ L} , where F(•) is composed of F  We finish the proof by setting the real decoder function F de (•) to arbitrarily approximate this desired function S (F (•, •)) under the universal approximation assumption.

B.1 GRAPH RECONSTRUCTION

To verify that our proposed SMP can indeed preserve node proximities, we conduct experiments of graph reconstruction (Wang et al., 2016) , i.e., using the node representations learned by GNNs to reconstruct the edges of the graph. Graph reconstruction corresponds to the first-order proximity between nodes, i.e., whether two nodes directly have a connection, which is the most straightforward node proximity (Tang et al., 2015) . Specifically, following Section 5.2, we adopt the inner product classifier Sigmoid(H T i H j ) and use AUC as the evaluation metric. To control the impact of node features (i.e., since many graphs exhibit assortative mixing, even models only using node features can reconstruct the edges to a certain extent), we do not use node features for all the models. We report the results in Table 6 . The results show that SMP greatly outperforms permutationequivariant GNNs such as GCN and GAT in graph reconstruction, clearly demonstrating that SMP can better preserve node proximities. PGNN shows highly competitive results as SMP. However, similar to other tasks, the intensive memory usage makes PGNN unable to handle medium-scale graphs such as Physics and PubMed. on the relations between nodes and thus requires the model to be proximity-aware to perform well. Similar to link prediction, we split the positive samples (i.e., node pairs with the same label) into an 80%-10%-10% training-validation-testing set with an equal number of randomly sampled negative pairs. For large graphs, since the possible positive samples are intractable (i.e. O(N 2 )), we use a random subset. Since we also need node labels as the ground-truth, we only conduct pairwise node classification on datasets when node labels are available. We also exclude the results of PPI since the dataset is multi-label and cannot be used in a pairwise setting (You et al., 2019) . Similar to Section 5.2, we adopt a simple inner product classifier and use AUC as the evaluation metric. The results are shown in Table 7 . We observe consistent results as link prediction in Section 5.2, i.e., SMP reports the best results on four datasets and the second-best results on the other three datasets. These results again verify that SMP can effectively preserve and utilize node proximities when needed while retaining comparable performance when the tasks are more permutation-equivariant like, e.g., on CS and Physics. We further compare SMP with augmenting GNNs using a one-hot encoding of node IDs, i.e., the identity matrix. Intuitively, since the IDs of nodes are unique, such a method does not suffer from the automorphism problem and should also enable GNNs to preserve node proximities. However, theoretically speaking, using such a one-hot encoding has two major problems. Firstly, the dimensionality of the identity matrix is N × N , and thus the number of parameters in the first message-passing layer is also on the order of O(N ). Therefore, the method will inevitably be computationally expensive and may not be scalable to large-scale graphs. A large number of parameters will also more likely lead to the overfitting problem. Secondly, the node IDs are not transferable across different graphs, i.e., the node v 1 in one graph and the node v 1 in another graph do not necessarily share a similar meaning. But as the parameters in the message-passings depend on the node IDs (since they are input features), such a mechanism cannot handle inductive tasks well.foot_7 We also empirically compare such a method with SMP and report the results in Table 8 . The results show that SMP-Linear outperforms GCN onehot in most cases. Besides, GCN onehot fails to handle Physics, which is only a medium-scale graph, due to the heavy memory usage. One surprising result is that GCN onehot outperforms SMP-Linear on Grid, the simulated graph where nodes are placed on a 20 × 20 grid. A plausible reason is that since the edges in Grid follow a specific rule, using a one-hot encoding gives GCN onehot enough flexibility to learn and remember the rules, and the model does not overfit because the graph has a rather small scale.

B.4 ADDITIONAL LINK PREDICTION RESULTS

We further report the results of link prediction on three GNN benchmarks: Cora, CiteSeer, and PubMed. The results are shown in Table 9 . The results show similar trends as other datasets presented in Section 5.2, i.e., SMP reports comparable results as other permutation-equivariant GNNs while PGNN fails to handle the task well. 

B.5 ADDITIONAL ABLATION STUDIES

We report the ablation study results for the node classification task and pairwise node classification task in Table 10 and Table 11 , respectively. The results again show that SMP-Linear generally achieves good-enough results on the majority of the datasets and adding non-linearities does not necessarily lift the performance of SMP. We also compare whether the stochastic signals E are fixed or not during different training epochs for our proposed SMP. For brevity, we only report the results for the link prediction task in Table 12 . The results show that fixing E usually leads to better results on transductive datasets (recall that datasets except Email and PPI are transductive) and resampling E leads to better results on inductive datasets in general. The results are consistent with our analysis in Section 4.1. as Email and PPI. One plausible reason is that since the proximities of nodes are preserved even the random features per se are different (see Theorem 2), all subsequent parameters based on proximities can be transferred. To investigate the performance of linear and non-linear variants of permutation-equivariant GNNs for the link prediction task, we additionally report both the training accuracies and the testing accuracies of SGC, GCN, and GAT in Table 13 . Notice that to ensure a fair comparison, we do not adopt the early stopping strategy here so that different models have the same number of training epochs (otherwise, if a model tends to overfit, the early stopping strategy will terminate the training process when the number of training epochs is small and result in a spurious underfitting phenomena). The results show that non-linear variants of GNNs (GCN and GAT) are more likely to overfit, i. • PPIfoot_8 (Hamilton et al., 2017) : 24 protein-protein interaction networks. Each node has a 50dimensional feature vector. • PPAfoot_10 (Hu et al., 2020): A network representing biological associations between proteins from 58 different species. The node features are one-hot vectors of the species that the proteins are taken from.



In(You et al., 2019), the authors consider the special case of shortest distance between nodes and name such property as "position-aware". In this paper, we consider a more general case of any walk-based proximity. Since the final layer of GNNs is task-specific, e.g., a softmax layer for node classification or a readout layer for graph classification, we only consider the GNN architecture to its last hidden layer. Proposition 1 in(You et al., 2019) can be regarded as a special case of Theorem 1 using the shortest distance proximity. Similar to previous works such as(Hamilton et al., 2017; Xu et al., 2018a), we only consider the minimum training loss because the optimization landscapes and generalization gaps are difficult to analyze analytically. The results of PGNN are slightly different compared to the paper because we adopt a more practical and common setting that negative samples in the data are not known apriori but randomly sampled in each epoch. To let F (l) (•) output a set with arbitrary lengths, we can adopt sequence-based models such an LSTM. One may question whether SMP is transferable across different graphs since the stochastic features are independently drawn. Empirically, we find that SMP reports reasonably well results on inductive datasets such https://github.com/JiaxuanYou/P-GNN/tree/master/data https://github.com/shchur/gnn-benchmark/tree/master/data/npz/ https://snap.stanford.edu/ogb/data/linkproppred/ppassoc.zip



EXPERIMENTAL SETUPSDatasets We conduct experiments on the following ten datasets: two simulation datasets, Grid and Communities(You et al., 2019), a communication dataset Email(You et al., 2019), two coauthor networks, CS and Physics(Shchur et al., 2018), two protein interaction networks, PPI(Hamilton et al., 2017) and PPA(Hu et al., 2020), and three GNN benchmarks, Cora, CiteSeer, and PubMed(Yang et al., 2016). We only report the results of three benchmarks for the node classification task and the results for other tasks are shown in Appendix B due to the page limit. More details of the datasets including their statistics are provided in Appendix C.1. These datasets cover a wide spectrum of domains, sizes, and with or without node features. Since Email and PPI contain more than one graph, we conduct experiments in an inductive setting on these two datasets, i.e., the training, validation, and testing set are split with respect to different graphs.

, 0 ≤ l < L and F (l) neighbor (•) , 1 ≤ l < L. Applying the proximity function S(•), we have:

e., the margins between the training accuracies and the testing accuracies are usually larger, than the linear variant SGC. Besides, though possessing extra model expressiveness, non-linear GNNs are also difficult to train, i.e., the training accuracies of GCN and GAT are not necessarily higher than SGC. The results are consistent with the literature Wu et al. (2019); He et al. (2020). C EXPERIMENTAL DETAILS FOR REPRODUCIBILITY C.1 DATASETS • Grid (You et al., 2019): A simulated 2D grid graph with size 20 × 20 and no node feature. • Communities (You et al., 2019): A simulated caveman graph (Watts, 1999) composed of 20 communities with each community containing 20 nodes. The graph is perturbed by rewiring 1% edges randomly. It has no node feature and the label of each node indicates which community the node belongs to. • Email 8 (You et al., 2019): Seven real-world email communication graphs. Each graph has six communities and each node has an integer label indicating the community the node belongs to. • Coauthor Networks 9 (Shchur et al., 2018): Two networks from Microsoft academic graph in CS and Physics with their nodes representing authors and edges representing co-authorships between authors. The node features are embeddings of the paper keywords of the authors.

×d where each element follows an i.i.d. normal distribution N (0, 1).

The results of link prediction tasks measured in AUC (%). The best results and the secondbest results for each dataset, respectively, are in bold and underlined.

The results of link prediction on the PPA dataset. The best result and the second-best result are in bold and underlined, respectively.

The results of node classification tasks measured by accuracy (%). The best results and the second-best results for each dataset, respectively, are in bold and underlined.

The ablation study of different SMP variants for the link prediction task. Datasets except PPA are measured by AUC (%) and PPA is measured by Hits@100. The best results and the secondbest results for each dataset are in bold and underlined, respectively.

The average running time (in milliseconds) for each epoch (including both training and testing), on link prediction task.

The results of graph reconstruction measured in AUC (%). The best and the second-best results for each dataset, respectively, are in bold and underlined. OOM represents out of memory.

The results of pairwise node classification tasks measured in AUC (%). The best results and the second-best results for each dataset, respectively, are in bold and underlined.

The results of comparing SMP with using one-hot IDs in GCNs. OOM represents out of memory. -represents the task is unavailable.

The results of the link prediction task measured in AUC (%). The best results and the second-best results for each dataset, respectively, are in bold and underlined.

The ablation study of different SMP variants for the node classification task. The best results and the second-best results are in bold and underlined, respectively.

The ablation study of different SMP variants for the pairwise node classification task. The best results and the second-best results are in bold and underlined, respectively.

annex

Then, for a connected graph G = (V, E, F), we create G = (V , E , F ) similar to Eq. ( 12). Specifically, denoting Ñ = N + l max , we let G have 3 Ñ nodes so that:Intuitively, we create three "copies" of G and three "bridges" to connect the copies and thus make G also connected. It is also easy to see that nodes v i , v i+ Ñ , and v i+2 Ñ all form automorphic node pairs and thus we have:Next, we can see that the nodes in G are divided into six parts (three copies and three bridges), which we denote asare bridges with length l max , any walk crosses these bridges will have a length large than l max . For example, let us focus onwalk v i v j will either pass the bridge V 2 or V 6 and thus has a length larger than l max . As a result, we have:If v j ∈ V 1 or v j ∈ V 2 , i.e., j ≤ Ñ , we can use the fact that v j and v j+ Ñ forms an automorphic node pair similar to Eq. ( 14), i.e., ∀ > 0, we haveSimilarly, if v j ∈ V 6 , i.e., 2 Ñ + N < j, we can use the fact that v j and v j-Ñ forms an automorphic node pair to prove the same inequality. Thus, we prove that ifThe same proof strategy can be applied to i > N . Since can be arbitrarily small, the results show that all node pairs have the same proximity S(∅), which leads to a contraction and finishes our proof.

A.2 THEOREM 2

Here we formulate and prove Theorem 2. Note that some notations and definitions are introduced in Appendix A.1.Theorem 2. For the walk-based proximity S = ÃK ( ÃK ) T , SMP can preserve the proximity with high probability if the dimensionality of the stochastic matrix is sufficiently large, i.e., ∀ > 0, ∀δ > 0, there ∃d 0 so that any d > d 0 :where H are the node representation obtained from SMP in Eq. (11). The result holds for any stochastic matrix and thus is regardless of whether E is fixed or resampled during each epoch.Proof. Our proof is mostly based on the standard random projection theory. Firstly, since we have proven in Theorem 1 that the permutation-equivariant representations cannot preserve any walkbased proximity, here we prove that we can preserve the proximity only using Ẽ, which can be • Cora, CiteSeer, PubMed 11 (Yang et al., 2016) : Three citation graphs where nodes correspond to papers and edges correspond to citations between papers. The node features are bag-of-words and the node labels are the ground truth topics of the papers.We summarize the statistics of datasets in Table 14 .

C.2 HYPER-PARAMETERS

We use the following hyper-parameters:• All datasets except PPA: we uniformly set the number of layers for all the methods as 2, i.e., 2 message-passing steps, and set the dimensionality of hidden layers as 32, i.e., H (l) ∈ R N ×32 , for all 1 ≤ l ≤ L (for GAT, we use 4 heads with each head containing 8 units). We use Adam optimizer with an initial learning rate of 0.01 and decay the learning rate by 0.1 at epoch 200.The weight decay is 5e-4. We train the model for 1,000 epochs and evaluate the model every 5 epochs. We adopt an early-stopping strategy by reporting the testing performance at the epoch which achieves the best validation performance. For SMP, the dimensionality of the stochastic matrix is d = 32. For P-GNN, we use the P-GNN-F version, which uses the truncated 2-hop shortest path distance instead of the exact shortest distance.• PPA: as suggested in the original paper (Hu et al., 2020) , we set the number of GNN layers as 3 with each layer containing 256 hidden units and add a three-layer MLP after taking the Hadamard product between pair-wise node embeddings as the predictor, i.e., MLP(H i H j ).We use Adam optimizer with an initial learning rate of 0.01. We set the number of epochs for training as 40, evaluate the results on validation sets every epoch, and report the testing results using the model with the best validation performance. We also found that the dataset had issues with exploding gradients and adopt a gradient clipping strategy by limiting the maximum p2norm of gradients as 1.0. The dimensionality of the stochastic matrix in SMP is d = 64.

C.3 HARDWARE AND SOFTWARE CONFIGURATIONS

All experiments are conducted on a server with the following configurations.• Operating System: Ubuntu 18.04.1 LTS• CPU: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz• GPU: NVIDIA TESLA M40 with 12 GB of memory 11 https://github.com/kimiyoung/planetoid/tree/master/data 

