NEIGHBOR2SEQ: DEEP LEARNING ON MASSIVE GRAPHS BY TRANSFORMING NEIGHBORS TO SE-QUENCES

Abstract

Modern graph neural networks (GNNs) use a message passing scheme and have achieved great success in many fields. However, this recursive design inherently leads to excessive computation and memory requirements, making it not applicable to massive real-world graphs. In this work, we propose the Neighbor2Seq to transform the hierarchical neighborhood of each node into a sequence. This novel transformation enables the subsequent use of general deep learning operations, such as convolution and attention, that are designed for grid-like data. Therefore, our Neighbor2Seq naturally endows GNNs with the efficiency and advantages of deep learning operations on grid-like data by precomputing the Neighbor2Seq transformations. In addition, our Neighbor2Seq can alleviate the over-squashing issue suffered by GNNs based on message passing. We evaluate our method on a massive graph, with more than 111 million nodes and 1.6 billion edges, as well as several medium-scale graphs. Results show that our proposed method is scalable to massive graphs and achieves superior performance across massive and mediumscale graphs.

1. INTRODUCTION

Graph neural networks (GNNs) have shown effectiveness in many fields with rich relational structures, such as citation networks (Kipf & Welling, 2016; Veličković et al., 2018) , social networks (Hamilton et al., 2017) , drug discovery (Gilmer et al., 2017; Stokes et al., 2020) , physical systems (Battaglia et al., 2016) , and point clouds (Wang et al., 2019) . Most current GNNs follow a message passing scheme (Gilmer et al., 2017; Battaglia et al., 2018) , in which the representation of each node is recursively updated by aggregating the representations of its neighbors. Various GNNs (Li et al., 2016; Kipf & Welling, 2016; Veličković et al., 2018; Xu et al., 2019) mainly differ in the forms of aggregation functions. Real-world applications usually generate massive graphs, such as social networks. However, message passing methods have difficulties in handling such large graphs as the recursive message passing mechanism leads to prohibitive computation and memory requirements. To date, sampling methods (Hamilton et al., 2017; Ying et al., 2018; Chen et al., 2018a; b; Huang et al., 2018; Zou et al., 2019; Zeng et al., 2020; Gao et al., 2018; Chiang et al., 2019; Zeng et al., 2020) and precomputing methods (Wu et al., 2019; Rossi et al., 2020; Bojchevski et al., 2020) have been proposed to scale GNNs on large graphs. While the sampling methods can speed up training, they might result in redundancy, still incur high computational complexity, lead to loss of performance, or introduce bias (see Section 2.2). Generally, precomputing methods can scale to larger graphs as compared to sampling methods as recursive message passing is still required in sampling methods. In this work, we propose the Neighbor2Seq that transforms the hierarchical neighborhood of each node to a sequence in a precomputing step. After the Neighbor2Seq transformation, each node and its associated neighborhood tree are converted to an ordered sequence. Therefore, each node can be viewed as an independent sample and is no longer constrained by the topological structure. This novel transformation from graphs to grid-like data enables the use of mini-batch training for subsequent models. As a result, our models can be used on extremely large graphs, as long as the Neighbor2Seq step can be precomputed. As a radical departure from existing precomputing methods, we consider the hierarchical neighborhood of each node as an ordered sequence. The order information corresponds to hops between nodes. As a result of our Neighbor2Seq transformation, generic deep learning operations for gridlike data, such as convolution and attention, can be applied in subsequent models. In addition, our Neighbor2Seq can alleviate the over-squashing issue (Alon & Yahav, 2020) suffered by current GNNs. Experimental results indicate that our proposed method can be used on a massive graph, where most current methods cannot be applied. Furthermore, our method achieves superior performance as compared with previous sampling and precomputing methods.

2. ANALYSIS OF CURRENT GRAPH NEURAL NETWORK METHODS

We start by defining necessary notations. A graph is formally defined as G = (V, E), where V is the set of nodes and E ⊆ V × V is the set of edges. We use n = |V | and m = |E| to denote the numbers of nodes and edges, respectively. The nodes are indexed from 1 to n. We consider a node feature matrix X ∈ R n×d , where each row x i ∈ R d is the d-dimensional feature vector associated with node i. The topology information of the graph is encoded in the adjacency matrix A ∈ R n×n , where A (i,j) = 1 if an edge exists between node i and node j, and A (i,j) = 0 otherwise.

2.1. GRAPH NEURAL NETWORKS VIA MESSAGE PASSING

There are two primary deep learning methods on graphs (Bronstein et al.); those are, spectral methods and spatial methods. The spectral method in Bruna et al. ( 2014) extends convolutional neural networks (LeCun et al., 1989) to the graph domain based on the spectrum of the graph Laplacian. The main limitation of spectral methods is the high complexity. ChebNet (Defferrard et al., 2016) and GCN (Kipf & Welling, 2016) simplify the spectral methods and can be understood from the spatial perspective. In this work, we focus on the analysis of the current mainstream spatial methods. Generally, most existing spatial methods, such as ChebNet (Defferrard et al., 2016) , GCN (Kipf & Welling, 2016) , GG-NN (Li et al., 2016) , GAT (Veličković et al., 2018), and GIN (Xu et al., 2019) , can be understood from the message passing perspective (Gilmer et al., 2017; Battaglia et al., 2018) . Specifically, we iteratively update node representations by aggregating representations from its immediate neighbors. These message passing methods have been shown to be effective in many fields. However, they have inherent difficulties when applied on large graphs due to their excessive computation and memory requirements, as described in Section 2.2.

2.2. GRAPH NEURAL NETWORKS ON LARGE GRAPHS

The above message passing methods are often trained in full batch. This requires the whole graph, i.e., all the node representations and edge connections, to be in memory to allow recursive message passing on the whole graph. Usually, the number of neighbors grows very rapidly with the increase of receptive field. Hence, these methods cannot be applied directly on large-scale graphs due to the prohibitive requirements on computation and memory. To enable deep learning on large graphs, two families of methods have been proposed; those are methods based on sampling and precomputing. To circumvent the recursive expansion of neighbors across layers, sampling methods apply GNNs on a sampled subset of nodes with mini-batch training. Sampling methods can be further divided into three categories. First, node-wise sampling methods perform message passing for each node in its sampled neighborhood. This strategy is first proposed in GraphSAGE (Hamilton et al., 2017) , where neighbors are randomly sampled. This is extended by PinSAGE (Ying et al., 2018) , which selects neighbors based on random walks. VR-GCN (Chen et al., 2018a) further proposes to use variance reduction techniques to obtain a convergence guarantee. Although these node-wise sampling methods can reduce computation, the remaining computation is still very expensive and some redundancy might have been introduced, as described in Huang et al. (2018) . Second, layer-wise sampling methods sample a fixed number of nodes for each layer. In particular, FastGCN (Chen et al., 2018b) samples a fixed number of nodes for each layer independently based on the degree of each node. AS- GCN (Huang et al., 2018) and LADIES (Zou et al., 2019) introduce between-layer dependencies during sampling, thus alleviating the loss of information. Layer-wise sampling methods can avoid the redundancy introduced by node-wise sampling methods. However, the expensive sampling algorithms that aim to ensure performance may themselves incur high computational cost,

