NON-LOCAL GRAPH NEURAL NETWORKS

Abstract

Modern graph neural networks (GNNs) learn node embeddings through multilayer local aggregation and achieve great success in applications on assortative graphs. However, tasks on disassortative graphs usually require non-local aggregation. In addition, we find that local aggregation is even harmful for some disassortative graphs. In this work, we propose a simple yet effective non-local aggregation framework with an efficient attention-guided sorting for GNNs. Based on it, we develop various non-local GNNs. We perform thorough experiments to analyze disassortative graph datasets and evaluate our non-local GNNs. Experimental results demonstrate that our non-local GNNs significantly outperform previous state-of-the-art methods on six benchmark datasets of disassortative graphs, in terms of both model performance and efficiency.

1. INTRODUCTION

Graph neural networks (GNNs) process graphs and map each node to an embedding vector (Zhang et al., 2018b; Wu et al., 2019) . These node embeddings can be directly used for node-level applications, such as node classification (Kipf & Welling, 2017) and link prediction (Schütt et al., 2017) . In addition, they can be used to learn the graph representation vector with graph pooling (Ying et al., 2018; Zhang et al., 2018a; Lee et al., 2019; Yuan & Ji, 2020) , in order to fit graph-level tasks (Yanardag & Vishwanathan, 2015) . Many variants of GNNs have been proposed, such as ChebNets (Defferrard et al., 2016) , GCNs (Kipf & Welling, 2017) , GraphSAGE (Hamilton et al., 2017) , GATs (Veličković et al., 2018) , LGCN (Gao et al., 2018) and GINs (Xu et al., 2019) . Their advantages have been shown on various graph datasets and tasks (Errica et al., 2020) . However, these GNNs share a multilayer local aggregation framework, which is similar to convolutional neural networks (CNNs) (LeCun et al., 1998) on grid-like data such as images and texts. In recent years, the importance of non-local aggregation has been demonstrated in many applications in the field of computer vision (Wang et al., 2018; 2020) and natural language processing (Vaswani et al., 2017) . In particular, the attention mechanism has been widely explored to achieve non-local aggregation and capture long-range dependencies from distant locations. Basically, the attention mechanism measures the similarity between every pair of locations and enables information to be communicated among distant but similar locations. In terms of graphs, non-local aggregation is also crucial for disassortative graphs, while previous studies of GNNs focus on assortative graph datasets (Section 2.2). In addition, we find that local aggregation is even harmful for some disassortative graphs (Section 4.3). The recently proposed Geom-GCN (Pei et al., 2020) explores to capture longrange dependencies in disassortative graphs. It contains an attention-like step that computes the Euclidean distance between every pair of nodes. However, this step is computationally prohibitive for large-scale graphs, as the computational complexity is quadratic in the number of nodes. In addition, Geom-GCN employs pre-trained node embeddings (Tenenbaum et al., 2000; Nickel & Kiela, 2017; Ribeiro et al., 2017) that are not task-specific, limiting the effectiveness and flexibility. In this work, we propose a simple yet effective non-local aggregation framework for GNNs. At the heart of the framework lies an efficient attention-guided sorting, which enables non-local aggregation through classic local aggregation operators in general deep learning. The proposed framework can be flexibly used to augment common GNNs with low computational costs. Based on the framework, we build various efficient non-local GNNs. In addition, we perform detailed analysis on existing disassortative graph datasets, and apply different non-local GNNs accordingly. Experimental results show that our non-local GNNs significantly outperform previous state-of-the-art methods on node classification tasks on six benchmark datasets of disassortative graphs.

2.1. GRAPH NEURAL NETWORKS

We focus on learning the embedding vector for each node through graph neural networks (GNNs). Most existing GNNs are inspired by convolutional neural networks (CNNs) (LeCun et al., 1998) and follow a local aggregation framework. In general, each layer of GNNs scans every node in the graph and aggregates local information from directly connected nodes, i.e., the 1-hop neighbors. Specifically, a common layer of GNNs performs a two-step processing similar to the depthwise separable convolution (Chollet, 2017): spatial aggregation and feature transformation. The first step updates each node embedding using embedding vectors of spatially neighboring nodes. For example, GCNs (Kipf & Welling, 2017) and GATs (Veličković et al., 2018) compute a weighted sum of node embeddings within the 1-hop neighborhood, where weights come from the degree of nodes and the interaction between nodes, respectively. GraphSAGE (Hamilton et al., 2017) applies the max pooling, while GINs (Xu et al., 2019) simply sums the node embeddings. The feature transformation step is similar to the 1×1 convolution, where each node embedding vector is mapped into a new feature space through a shared linear transformation (Kipf & Welling, 2017; Hamilton et al., 2017; Veličković et al., 2018) or multilayer perceptron (MLP) (Xu et al., 2019) . Different from these studies, LGCN (Gao et al., 2018) explores to directly apply the regular convolution through top-k ranking. Nevertheless, each layer of these GNNs only aggregates local information within the 1-hop neighborhood. While stacking multiple layers can theoretically enable communication between nodes across the multi-hop neighborhood, the aggregation is essentially local. In addition, deep GNNs usually suffer from the over-smoothing problem (Xu et al., 2018; Li et al., 2018; Chen et al., 2020) .

2.2. ASSORTATIVE AND DISASSORTATIVE GRAPHS

There are many kinds of graphs in the literature, such as citation networks (Kipf & Welling, 2017) , community networks (Chen et al., 2020) , co-occurrence networks (Tang et al., 2009) , and webpage linking networks (Rozemberczki et al., 2019) . We focus on graph datasets corresponding to the node classification tasks. In particular, we categorize graph datasets into assortative and disassortative ones (Newman, 2002; Ribeiro et al., 2017) according to the node homophily in terms of labels, i.e., how likely nodes with the same label are near each other in the graph. Assortative graphs refer to those with a high node homophily. Common assortative graph datasets are citation networks and community networks. On the other hand, graphs in disassortative graph datasets contain more nodes that have the same label but are distant from each other. Example disassortative graph datasets are co-occurrence networks and webpage linking networks. As introduced above, most existing GNNs perform local aggregation only and achieve good performance on assortative graphs (Kipf & Welling, 2017; Hamilton et al., 2017; Veličković et al., 2018; Gao et al., 2018) . However, they may fail on disassortative graphs, where informative nodes in the same class tend to be out of the local multi-hop neighborhood and non-local aggregation is needed. Thus, in this work, we explore the non-local GNNs.

2.3. ATTENTION MECHANISM

The attention mechanism (Vaswani et al., 2017) has been widely used in GNNs (Veličković et al., 2018; Gao & Ji, 2019; Knyazev et al., 2019) as well as other deep learning models (Yang et al., 2016; Wang et al., 2018; 2020) . A typical attention mechanism takes three groups of vectors as inputs, namely the query vector q, key vectors (k 1 , k 2 , . . . , k n ), value vectors (v 1 , v 2 , . . . , v n ). Note that key and value vectors have a one-to-one correspondence and can be the same sometimes. The attention mechanism computes the output vector o as a i = ATTEND(q, k i ) ∈ R, i = 1, 2, . . . , n; o = i a i v i , where the ATTEND(•) function could be any function that outputs a scalar attention score a i from the interaction between q and k i , such as dot product (Gao & Ji, 2019) or even a neural net-work (Veličković et al., 2018) . The definition of the three groups of input vectors depends on the models and applications. Notably, existing GNNs usually use the attention mechanism for local aggregation (Veličković et al., 2018; Gao & Ji, 2019) . Specifically, when aggregating information for node v, the query vector is the embedding vector of v while the key and value vectors come from node embeddings of v's directly connected nodes. And the process is iterated for each v ∈ V . It is worth noting that the attention mechanism can be easily extended for non-local aggregation (Wang et al., 2018; 2020) , by letting the key and value vectors correspond to all the nodes in the graph when aggregating information for each node. However, it is computationally prohibitive given large-scale graphs, as iterating it for each node in a graph of n nodes requires O(n 2 ) time. In this work, we propose a novel non-local aggregation method that only requires O(n log n) time.

3.1. NON-LOCAL AGGREGATION WITH ATTENTION-GUIDED SORTING

We consider a graph G = (V, E), where V is the set of nodes and E is the set of edges. Each edge e ∈ E connects two nodes so that E ⊆ V × V . Each node v ∈ V has a node feature vector x v ∈ R d . The k-hop neighborhood of v refers to the set of nodes N k (v) that can reach v within k edges. For example, the set of v's directly connected nodes is its 1-hop neighborhood N 1 (v). Our proposed non-local aggregation framework is composed of three steps, namely local embedding, attention-guided sorting, and non-local aggregation. In the following, we describe them one by one. Local Embedding: Our proposed framework is built upon a local embedding step that extracts local node embeddings from the node feature vectors. The local embedding step can be as simple as z v = MLP(x v ) ∈ R f , ∀v ∈ V. The MLP(•) function is a multilayer perceptron (MLP), and f is the dimension of the local node embedding z v . Note that the MLP(•) function is shared across all the nodes in the graph. Applying MLP only takes the node itself into consideration without aggregating information from the neighborhood. This property is very important on some disassortative graphs, as shown in Section 4.3. On the other hand, graph neural networks (GNNs) can be used as the local embedding step as well, so that our proposed framework can be easily employed to augment existing GNNs. As introduced in Section 2.1, modern GNNs perform multilayer local aggregation. Typically, for each node, one layer of a GNN aggregates information from its 1-hop neighborhood. Stacking L such local aggregation layers allows each node to access information that is L hops away. To be specific, the -th layer of a L-layer GNN ( = 1, 2, . . . , L) can be described as z ( ) v = TRANSFORM ( ) AGGREGATE ( ) {z ( -1) u : u ∈ N 1 (v) ∪ v} ∈ R f , ∀v ∈ V, where z (0) v = x v , and z v = z (L) v represents the local node embedding. The AGGREGATE ( ) (•) and TRANSFORM ( ) (•) functions represent the spatial aggregation and feature transformation step introduced in Section 2.1, respectively. With the above framework, GNNs can capture the node feature information from nodes within a local neighborhood as well as the structural information. When either MLP or GNNs is used as the local embedding step, the local node embedding z v only contains local information of a node v. However, z v can be used to guide non-local aggregation, as distant but informative nodes are likely to have similar node features and local structures. Based on this intuition, we propose the attention-guided sorting to enable the non-local aggregation. Attention-Guided Sorting: The basic idea of the attention-guided sorting is to learn an ordering of nodes, where distant but informative nodes are put near each other. Specifically, given the local node embedding z v obtained through the local embedding step, we compute one set of attention scores by a v = ATTEND(c, z v ) ∈ R, ∀v ∈ V, ) where c is a calibration vector that is randomly initialized and jointly learned during training (Yang et al., 2016) . In this attention operator, c serves as the query vector and z v are the key vectors. In addition, we also treat z v as the value vectors. However, unlike the attention mechanism introduced in Section 2.3, we use the attention scores to sort the value vectors instead of computing a weighted sum to aggregating them. Note that originally there is no ordering among nodes in a graph. To be specific, as a v and z v have one-to-one correspondence through Equation ( 4), sorting the attention scores in non-decreasing order into (a 1 , a 2 , . . . , a n ) provides an ordering among nodes, where n = |V | is the number of nodes in the graph. The resulting sequence of local node embeddings can be denoted as (z 1 , z 2 , . . . , z n ). The attention process in Equation ( 4) can be also understood as a projection of local node embeddings onto a 1-dimensional space. The projection depends on the concrete ATTEND(•) function and the calibration vector c. As indicated by its name, the calibration vector c is used to calibrate the 1-dimensional space, in order to push distant but informative nodes close to each other in this space. This goal is fulfilled through the following non-local aggregation step and the training of the calibration vector c, as demonstrated below. Non-Local Aggregation: We point out that, with the attention-guided sorting, the non-local aggregation can be achieved by convolution, the most common local aggregation operator in deep learning. Specifically, given the sorted sequence of local node embeddings (z 1 , z 2 , . . . , z n ), we compute (ẑ 1 , ẑ2 , . . . , ẑn ) = CONV(z 1 , z 2 , . . . , z n ), where the CONV(•) function represents a 1D convolution with appropriate padding. Note that the CONV(•) function can be replaced by a 1D convolutional neural network as long as the number of input and output vectors remains the same. To see how the CONV(•) function performs non-local aggregation with the attention-guided sorting, we take an example where the CONV(•) function is a 1D convolution of kernel size 2s + 1. In this case, ẑi is computed from (z i+s , . . . , z i-s ), corresponding to the receptive field of the CONV(•) function. As a result, if the attention-guided sorting leads to (z i+s , . . . , z i-s ) containing nodes that are distant but informative to z i , the output ẑi aggregates non-local information. Another view is that we can consider the attention-guided sorting as re-connects nodes in the graph, where (z i+s , . . . , z i-s ) can be treated as the 1-hop neighborhood of z i . After the CONV(•) function, ẑi and z i are concatenated as the input to a classifier to predict the label of the corresponding node, where both non-local and local dependencies can be captured. In order to enable the end-to-end training of the calibration vector c, we modify Equation ( 5) into (ẑ 1 , ẑ2 , . . . , ẑn ) = CONV(a 1 z 1 , a 2 z 2 , . . . , a n z n ), where we multiply the attention score with the corresponding local node embedding. As a result, the calibration vector c receives gradients through the attention scores during training. The remaining question is how to make sure that the attention-guided sorting pushes distant but informative nodes together. The short answer is that it is not necessary to guarantee this, as the requirement of non-local aggregation depends on the concrete graphs. In fact, our proposed framework grants GNNs the ability of non-local aggregation but lets the end-to-end training process determine whether to use non-local information. The back-propagation from the supervised loss will tune the calibration vector c and encourage ẑi to capture useful information that is not encoded by z i . In the case of disassortative graphs, ẑi usually needs to aggregate information from distant but informative nodes. Hence, the calibration vector c tends to arrange the attention-guided sorting to put distant but informative nodes together, as demonstrated experimentally in Section 4.5. On the other hand, nodes within the local neighborhood are usually much more informative than distant nodes in assortative graphs. In this situation, ẑi may simply perform local aggregation that is similar to GNNs. In Section 4, we demonstrate the effectiveness of our proposed non-local aggregation framework on six disassortative graph datasets. In particular, we achieve the state-of-the-art performance on all the datasets with significant improvements over previous methods.

3.2. TIME COMPLEXITY ANALYSIS

We perform theoretical analysis of the time complexity of our proposed framework. As discussed in Section 2.3, using the attention mechanism (Vaswani et al., 2017; Wang et al., 2018; 2020) to achieve non-local aggregation requires O(n 2 ) time for a graph of n nodes. Essentially, the O(n 2 ) time complexity is due to the fact that the ATTEND(•) function needs to be computed between every pair of nodes. In particular, the recently proposed Geom-GCN (Pei et al., 2020 ) contains a similar non-local aggregation step. For each v ∈ V , Geom-GCN finds the set of nodes from which the Euclidean distance to v is less than a pre-defined number, where the Euclidean distance between every pair of nodes needs to be computed. As the computation of the the Euclidean distance between two nodes can be understood as the ATTEND(•) function, Geom-GCN has at least O(n 2 ) time complexity. In contrast, our proposed non-local aggregation framework requires only O(n log n) time. To see this, note that the ATTEND(•) function in Equation ( 4) only needs to be computed once, instead of iterating it for each node. As a result, computing the attention scores only takes O(n) time. Therefore, the time complexity of sorting, i.e. O(n log n), dominates the total time complexity of our proposed framework. In Section 4.6, we compare the real running time on different datasets among common GNNs, Geom-GCN, and our non-local GNNs as introduced in the next section.

3.3. EFFICIENT NON-LOCAL GRAPH NEURAL NETWORKS

We apply our proposed non-local aggregation framework to build efficient non-local GNNs. Recall that our proposed framework starts with the local embedding step, followed by the attention-guided sorting and the non-local aggregation step. In particular, the local embedding step can be implemented by either MLP or common GNNs, such as GCNs (Kipf & Welling, 2017) or GATs (Veličković et al., 2018) . MLP extracts the local node embedding only from the node feature vector and excludes the information from nodes within the local neighborhood. This property can be helpful on some disassortative graphs, where nodes within the local neighborhood provide more noises than useful information. On other disassortative graphs, informative nodes locate in both local neighborhood and distant locations. In this case, GNNs are more suitable as the local embedding step. Depending on the disassortative graphs in hand, we build different non-local GNNs with either MLP or GNNs as the local embedding step. In Section 4.3, we show that these two categories of disassortative graphs can be distinguished through simple experiments, where we apply different non-local GNNs accordingly. Specifically, the number of layers is set to 2 for both MLP and GNNs. In terms of the attention-guided sorting, we only need to specify the ATTEND(•) function in Equation (4). In order to make it as efficient as possible, we choose the ATTEND(•) function as a v = ATTEND(c, z v ) = c T z v ∈ R, ∀v ∈ V, ( ) where c is part of the training parameters, as described in Section 3.1. With the attention-guided sorting, we can implement the non-local aggregation step through convolution, as explained in Section 3.1 and shown in Equation ( 6). Specifically, CONV(•) function is set as a 2-layer convolutional neural network composed of two 1D convolutions. The kernel size is set to 3 or 5 depending on the datasets. The activation function is ReLU (Krizhevsky et al., 2012) . Finally, we use a linear classifier that takes the concatenation of ẑi and z i as inputs and makes prediction for the corresponding node. Depending on the local embedding step, we build three efficient non-local GNNs, namely non-local MLP (NLMLP), non-local GCN (NLGCN), and nonlocal GAT (NLGAT). The models can be end-to-end trained with the classification loss.

4.1. DATASETS

We perform experiments on six disassortative graph datasets (Rozemberczki et al., 2019; Tang et al., 2009; Pei et al., 2020) (Chameleon, Squirrel, Actor, Cornell, Texas, Wisconsin) and three assortative graph datasets (Sen et al., 2008) (Cora, Citeseer, Pubmed) . These datasets are commonly used to evaluate GNNs on node classification tasks (Kipf & Welling, 2017; Veličković et al., 2018; Gao et al., 2018; Pei et al., 2020) . We provide detailed descriptions of disassortative graph datasets in Appendix A.1. In order to distinguish assortative and disassortative graph datasets, Pei et al. (2020) propose a metric to measure the homophily of a graph G, defined as In our experiments, we focus on comparing the model performance on disassortative graph datasets, in order to demonstrate the effectiveness of our non-local aggregation framework. The performances on assortative graph datasets are provided for reference, indicating that the proposed framework will not hurt the performance when non-local aggregation is not strongly desired. H(G) = 1 |V | v∈V Number of v'

4.2. BASELINES

We compare our proposed non-local MLP (NLMLP), non-local GCN (NLGCN), and non-local GAT (NLGAT) with various baselines: (1) MLP is the simplest deep learning model. It makes prediction solely based on the node feature vectors, without aggregating any local or non-local information. (2) GCN (Kipf & Welling, 2017) and GAT (Veličković et al., 2018) are the most common GNNs. As introduced in Section 2.1, they only perform local aggregation. (3) Geom-GCN (Pei et al., 2020) is a recently proposed GNN that can capture long-range dependencies. It is the current stateof-the-art model on several disassortative graph datasets. Geom-GCN requires the use of different node embedding methods, such as Isomap (Tenenbaum et al., 2000) , Poincare (Nickel & Kiela, 2017), and struc2vec (Ribeiro et al., 2017) . We simply report the best results from Pei et al. (2020) for Geom-GCN and the following two variants without specifying the node embedding method. (4) Geom-GCN-g (Pei et al., 2020) is a variant of Geom-GCN that performs local aggregation only. It is similar to common GNNs. ( 5) Geom-GCN-s (Pei et al., 2020) is a variant of Geom-GCN that does not force local aggregation. The designed functionality is similar to our NLMLP. We implement MLP, GCN, GAT, and our methods using Pytorch (Adam et al., 2017) and Pytorch Geometric (Fey & Lenssen, 2019) . As has been discussedfoot_0 , in fair settings, the results of GCN and GAT differ from those in Pei et al. (2020) . On each dataset, we follow Pei et al. (2020) and randomly split nodes of each class into 60%, 20%, and 20% for training, validation, and testing. The experiments are repeatedly run 10 times with different random splits and the average test accuracy over these 10 runs are reported. Testing is performed when validation accuracy achieves maximum on each run. Apart from the details specified in Section 3.3, we tune the following hyperparameters individually for our proposed models: (1) the number of hidden unit ∈ {16, 48, 96}, (2) dropout rate ∈ {0, 0.5, 0.8}, (3) weight decay ∈ {0, 5e-4, 5e-5, 5e-6}, and (4) learning rate ∈ {0.01, 0.05}.

4.3. ANALYSIS OF DISASSORTATIVE GRAPH DATASETS

As discussed in Section 3.3, the disassortative graph datasets can be divided into two categories. Nodes within the local neighborhood provide more noises than useful information in disassortative graphs belonging to the first category. Therefore, local aggregation should be avoided in models on such disassortative graphs. As for the second category, informative nodes locate in both local neighborhood and distant locations. Intuitively, a graph with lower H(G) is more likely to be in the first category. However, it is not an accurate way to determine the two categories. Knowing the exact category of a disassortative graph is crucial, as we need to apply non-local GNNs accordingly. As analyzed above, the key difference lies in whether the local aggregation is useful. Hence, we can distinguish two categories of disassortative graphs by comparing the performance between MLP and common GNNs (GCN, GAT) on each of the six disassortative graph datasets. The results are summarized in Table 2 . We can see that Actor, Cornell, Texas, and Wisconsin fall into the first category, while Chameleon and Squirrel belong to the second category. We add the performance on assortative graph datasets for reference, where the local aggregation is effective so that GNNs tend to outperform MLP.

4.4. COMPARISONS WITH BASELINES

According to the insights from Section 4.3, we apply different non-local GNNs according to the category of disassortative graph datasets, and make comparisons with corresponding baselines. Specifically, we employ NLMLP on Actor, Cornell, Texas, and Wisconsin. The corresponding baselines are MLP, Geom-GCN, and Geom-GCN-s, as Table 2 has shown that GCN and GAT perform much worse than MLP on these datasets. And Geom-GCN-g is similar to GCN and has worse performance than Geom-GCN-s, which is shown in Appendix A.2. The comparison results are reported in Table 3 . While Geom-GCN-s are the previous state-of-the-art GNNs on these datasets (Pei et al., 2020) , we find that MLP consistently outperforms Geom-GCN-s by large margins. In particular, although Geom-GCN-s does not explicitly perform local aggregation, it is still outperformed by MLP. A possible explanation is that Geom-GCN-s uses pre-trained node embeddings, which aggregates information from the local neighborhood implicitly. In contrast, our NLMLP is built upon MLP with the proposed non-local aggregation framework, which excludes the local noises and collects useful information from non-local informative nodes. The NLMLP sets the new state-of-the-art performance on these disassortative graph datasets. On Chameleon and Squirrel that belong to the second category of disassortative graph datasets, we apply NL-GCN and NLGAT accordingly. The baselines are GCN, GAT, Geom-GCN, and Geom-GCN-g. On these datasets, these baselines that explicitly perform local aggregation show advantages over MLP and Geom-GCN-s, as shown in Appendix A.2. As shown in Table 4 , our proposed NLGCN achieves the best performance on both datasets. In addition, it is worth noting that our NLGCN and NL-GAT are built upon GCN and GAT, respectively. They show improvements over their counterparts, which indicates that the advantages of our proposed non-local aggregation framework are general for common GNNs. We provide the results of all the models on all datasets in Appendix A.2 for reference.

4.5. ANALYSIS OF THE ATTENTION-GUIDED SORTING

We analyze the results of the attention-guided sorting in our proposed framework, in order to show that our non-local GNNs indeed perform non-local aggregation. Suppose the attention-guided sorting leads to the sorted sequence (z 1 , z 2 , . . . , z n ), which goes through a convolution or CNN into (ẑ 1 , ẑ2 , . . . , ẑn ). As discussed in Section 3.1, we can consider the sequence (z 1 , z 2 , . . . , z n ) as a re-connected graph Ĝ, where we treat nodes within the receptive field of ẑi as directly connected to z i , i.e. z i 's 1-hop neighborhood. The information within this new 1-hop neighborhood will be aggregated. If our non-local GNNs indeed perform non-local aggregation, the homophily of the re-connected graph should be larger than the original graph. Therefore, we compute H( Ĝ) for each dataset to verify this statement. Following Section 4.4, we apply NLMLP on Actor, Cornell, Texas, and Wisconsin and NLGCN on Chameleon and Squirrel. As analyzed in Section 3.2, our proposed nonlocal aggregation framework is more efficient than previous methods based on the original attention mechanism, such as Geom-GCN (Pei et al., 2020) . Concretely, our method requires only O(n log n) computation time in contrast to O(n 2 ). In this section, we compare the real running time to verify our analysis. Specifically, we compare NLGCN with Geom-GCN as well as GCN and GAT. For Geom-GCN, we use the code provided in Pei et al. (2020) . Each model is trained for 500 epochs on each dataset and the average training time per epoch is reported. The results are shown in Table 5 . Although our NLGCN is built upon GCN, it is just slightly slower than GCN and faster than GAT, showing the efficiency of our non-local aggregation framework. On the other hand, Geom-GCN is significantly slower due to the fact that it has O(n 2 ) time complexity.

5. CONCLUSION

In this work, we propose a simple yet effective non-local aggregation framework for GNNs. The core of the framework is an efficient attention-guided sorting, which enables non-local aggregation through convolution. The proposed framework can be easily used to build non-local GNNs with low computational costs. We perform thorough experiments on node classification tasks to evaluate our proposed method. In particular, we experimentally analyze existing disassortative graph datasets and apply different non-local GNNs accordingly. The results show that our non-local GNNs significantly outperform previous state-of-the-art methods on six benchmark datasets of disassortative graphs, in terms of both accuracy and speed.



https://openreview.net/forum?id=S1e2agrFvS&noteId=8tGKV1oSzCr



Figure1compares H( Ĝ) with H(G) for each dataset. We can observe that H( Ĝ) is much larger than H(G), indicating that distant but informative nodes are near each other in the re-connected graph Ĝ. We also provide the visualizations of the sorted sequence for Cornell and Texas. We can see that nodes with the same label tend to be clustered together. These facts indicate that our non-local GNNs perform non-local aggregation with the attention-guided sorting.

Figure 1: (a) Comparisons of the homophily between the original graph and the re-connected graph given by our NLGCN on Chameleon and Squirrel. (b) Comparisons of the homophily between the original graph and the re-connected graph given by our NLMLP on Actor, Cornell, Texas, and Wisconsin. (c) Visualization of sorted node sequence after the attention-guided sorting for Cornell and Texas. The colors denote node labels. Details are explained in Section 4.5.4.6 EFFICIENCY COMPARISONS

s directly connected nodes who have the same label as v Statistics of the nine datasets used in our experiments. The definition of H(G) is provided in Section 4.1. H(G) can be used to distinguish assortative and disassortative graph datasets.

Comparisons between MLP and common GNNs. These analytical experiments are used to determine the two categories of disassortative graph datasets, as introduced in Section 4.3.

Comparisons between our NLMLP and strong baselines on the four disassortative graph datasets belonging to the first category as defined in Section 4.3.

Comparisons between our NL-GCN, NLGAT and strong baselines on the two disassortative graph datasets belonging to the second category as defined in Section 4.3.

Comparisons in terms of real running time (milliseconds).

A APPENDIX

A.1 DETAILS OF DISASSORTATIVE GRAPH DATASETS Here are the details of disassortative graph datasets used in our experiments:• Chameleon and Squirrel are Wikipedia networks (Rozemberczki et al., 2019) 

