NON-LOCAL GRAPH NEURAL NETWORKS

Abstract

Modern graph neural networks (GNNs) learn node embeddings through multilayer local aggregation and achieve great success in applications on assortative graphs. However, tasks on disassortative graphs usually require non-local aggregation. In addition, we find that local aggregation is even harmful for some disassortative graphs. In this work, we propose a simple yet effective non-local aggregation framework with an efficient attention-guided sorting for GNNs. Based on it, we develop various non-local GNNs. We perform thorough experiments to analyze disassortative graph datasets and evaluate our non-local GNNs. Experimental results demonstrate that our non-local GNNs significantly outperform previous state-of-the-art methods on six benchmark datasets of disassortative graphs, in terms of both model performance and efficiency.

1. INTRODUCTION

Graph neural networks (GNNs) process graphs and map each node to an embedding vector (Zhang et al., 2018b; Wu et al., 2019) . These node embeddings can be directly used for node-level applications, such as node classification (Kipf & Welling, 2017) and link prediction (Schütt et al., 2017) . In addition, they can be used to learn the graph representation vector with graph pooling (Ying et al., 2018; Zhang et al., 2018a; Lee et al., 2019; Yuan & Ji, 2020) , in order to fit graph-level tasks (Yanardag & Vishwanathan, 2015) . Many variants of GNNs have been proposed, such as ChebNets (Defferrard et al., 2016 ), GCNs (Kipf & Welling, 2017) , GraphSAGE (Hamilton et al., 2017) , GATs (Veličković et al., 2018) , LGCN (Gao et al., 2018) and GINs (Xu et al., 2019) . Their advantages have been shown on various graph datasets and tasks (Errica et al., 2020) . However, these GNNs share a multilayer local aggregation framework, which is similar to convolutional neural networks (CNNs) (LeCun et al., 1998) on grid-like data such as images and texts. In recent years, the importance of non-local aggregation has been demonstrated in many applications in the field of computer vision (Wang et al., 2018; 2020) and natural language processing (Vaswani et al., 2017) . In particular, the attention mechanism has been widely explored to achieve non-local aggregation and capture long-range dependencies from distant locations. Basically, the attention mechanism measures the similarity between every pair of locations and enables information to be communicated among distant but similar locations. In terms of graphs, non-local aggregation is also crucial for disassortative graphs, while previous studies of GNNs focus on assortative graph datasets (Section 2.2). In addition, we find that local aggregation is even harmful for some disassortative graphs (Section 4.3). The recently proposed Geom-GCN (Pei et al., 2020) explores to capture longrange dependencies in disassortative graphs. It contains an attention-like step that computes the Euclidean distance between every pair of nodes. However, this step is computationally prohibitive for large-scale graphs, as the computational complexity is quadratic in the number of nodes. In addition, Geom-GCN employs pre-trained node embeddings (Tenenbaum et al., 2000; Nickel & Kiela, 2017; Ribeiro et al., 2017) that are not task-specific, limiting the effectiveness and flexibility. In this work, we propose a simple yet effective non-local aggregation framework for GNNs. At the heart of the framework lies an efficient attention-guided sorting, which enables non-local aggregation through classic local aggregation operators in general deep learning. The proposed framework can be flexibly used to augment common GNNs with low computational costs. Based on the framework, we build various efficient non-local GNNs. In addition, we perform detailed analysis on existing disassortative graph datasets, and apply different non-local GNNs accordingly. Experimental results show that our non-local GNNs significantly outperform previous state-of-the-art methods on node classification tasks on six benchmark datasets of disassortative graphs.

2.1. GRAPH NEURAL NETWORKS

We focus on learning the embedding vector for each node through graph neural networks (GNNs). Most existing GNNs are inspired by convolutional neural networks (CNNs) (LeCun et al., 1998) and follow a local aggregation framework. In general, each layer of GNNs scans every node in the graph and aggregates local information from directly connected nodes, i.e., the 1-hop neighbors. Specifically, a common layer of GNNs performs a two-step processing similar to the depthwise separable convolution (Chollet, 2017): spatial aggregation and feature transformation. The first step updates each node embedding using embedding vectors of spatially neighboring nodes. For example, GCNs (Kipf & Welling, 2017) and GATs (Veličković et al., 2018) compute a weighted sum of node embeddings within the 1-hop neighborhood, where weights come from the degree of nodes and the interaction between nodes, respectively. GraphSAGE (Hamilton et al., 2017) applies the max pooling, while GINs (Xu et al., 2019) simply sums the node embeddings. The feature transformation step is similar to the 1×1 convolution, where each node embedding vector is mapped into a new feature space through a shared linear transformation (Kipf & Welling, 2017; Hamilton et al., 2017; Veličković et al., 2018) or multilayer perceptron (MLP) (Xu et al., 2019) . Different from these studies, LGCN (Gao et al., 2018) explores to directly apply the regular convolution through top-k ranking. Nevertheless, each layer of these GNNs only aggregates local information within the 1-hop neighborhood. While stacking multiple layers can theoretically enable communication between nodes across the multi-hop neighborhood, the aggregation is essentially local. In addition, deep GNNs usually suffer from the over-smoothing problem (Xu et al., 2018; Li et al., 2018; Chen et al., 2020) .

2.2. ASSORTATIVE AND DISASSORTATIVE GRAPHS

There are many kinds of graphs in the literature, such as citation networks (Kipf & Welling, 2017), community networks (Chen et al., 2020), co-occurrence networks (Tang et al., 2009) , and webpage linking networks (Rozemberczki et al., 2019) . We focus on graph datasets corresponding to the node classification tasks. In particular, we categorize graph datasets into assortative and disassortative ones (Newman, 2002; Ribeiro et al., 2017) according to the node homophily in terms of labels, i.e., how likely nodes with the same label are near each other in the graph. Assortative graphs refer to those with a high node homophily. Common assortative graph datasets are citation networks and community networks. On the other hand, graphs in disassortative graph datasets contain more nodes that have the same label but are distant from each other. Example disassortative graph datasets are co-occurrence networks and webpage linking networks. As introduced above, most existing GNNs perform local aggregation only and achieve good performance on assortative graphs (Kipf & Welling, 2017; Hamilton et al., 2017; Veličković et al., 2018; Gao et al., 2018) . However, they may fail on disassortative graphs, where informative nodes in the same class tend to be out of the local multi-hop neighborhood and non-local aggregation is needed. Thus, in this work, we explore the non-local GNNs.

2.3. ATTENTION MECHANISM

The attention mechanism (Vaswani et al., 2017) has been widely used in GNNs (Veličković et al., 2018; Gao & Ji, 2019; Knyazev et al., 2019) as well as other deep learning models (Yang et al., 2016; Wang et al., 2018; 2020) . A typical attention mechanism takes three groups of vectors as inputs, namely the query vector q, key vectors (k 1 , k 2 , . . . , k n ), value vectors (v 1 , v 2 , . . . , v n ). Note that key and value vectors have a one-to-one correspondence and can be the same sometimes. The attention mechanism computes the output vector o as a i = ATTEND(q, k i ) ∈ R, i = 1, 2, . . . , n; o = i a i v i , where the ATTEND(•) function could be any function that outputs a scalar attention score a i from the interaction between q and k i , such as dot product (Gao & Ji, 2019) or even a neural net-

