DEEP GRAPH NEURAL NETWORKS WITH SHALLOW SUBGRAPH SAMPLERS

Abstract

While Graph Neural Networks (GNNs) are powerful models for learning representations on graphs, most state-of-the-art models do not have significant accuracy gain beyond two to three layers. Deep GNNs fundamentally need to address: 1). expressivity challenge due to oversmoothing, and 2). computation challenge due to neighborhood explosion. We propose a simple "deep GNN, shallow sampler" design principle to improve both the GNN accuracy and efficiency -to generate representation of a target node, we use a deep GNN to pass messages only within a shallow, localized subgraph. A properly sampled subgraph may exclude irrelevant or even noisy nodes, and still preserve the critical neighbor features and graph structures. The deep GNN then smooths the informative local signals to enhance feature learning, rather than oversmoothing the global graph signals into just "white noise". We theoretically justify why the combination of deep GNNs with shallow samplers yields the best learning performance. We then propose various sampling algorithms and neural architecture extensions to achieve good empirical results. Experiments on five large graphs show that our models achieve significantly higher accuracy and efficiency, compared with state-of-the-art.

1. INTRODUCTION

Graph Neural Networks (GNNs) have now become the state-of-the-art models for graph mining (Wu et al., 2020; Hamilton et al., 2017b; Zhang et al., 2019) , facilitating applications such as social recommendation (Monti et al., 2017; Ying et al., 2018; Pal et al., 2020) , knowledge understanding (Schlichtkrull et al., 2018; Park et al., 2019; Zhang et al., 2020) and drug discovery (Stokes et al., 2020; Lo et al., 2018) . With the numerous architectures proposed (Kipf & Welling, 2016; Hamilton et al., 2017a; Veličković et al., 2018) , it still remains an open question how to effectively design deep GNNs. There are two fundamental obstacles that are intrinsic to the underlying graph structure: • Expressivity challenge: deep GNNs tend to oversmooth (Li et al., 2018) . They collapse embeddings of different nodes into a fixed low-dimensional subspace after repeated neighbor mixing. • Computation challenge: deep GNNs recursively expand the adjacent nodes along message passing edges. The neighborhood size may grow exponentially with model depth (Chen et al., 2017) . Due to oversmoothing, one of the most popular GNN architectures, Graph Convolutional Network (GCN) (Kipf & Welling, 2016) , has been theoretically proven as incapable of scaling to deep layers (Oono & Suzuki, 2020; Rong et al., 2020; Huang et al., 2020) . Remedies to overcome the GCN limitations are two-folded. From the neural architecture perspective, researchers are actively seeking for more expressive neighbor aggregation operations (Veličković et al., 2018; Hamilton et al., 2017a; Xu et al., 2018a) , or transferring design components (such as residual connection) from deep CNNs to GNNs (Xu et al., 2018b; Li et al., 2019; Huang et al., 2018) . From the data perspective, various works (Klicpera et al., 2019a; b; Bojchevski et al., 2020) revisit classic graph analytic algorithms to reconstruct a graph with nicer topological property. The two kinds of works can also be combined to jointly improve the quality of message passing in deep GNNs. All the above GNN variants take a "global" view on the input graph G (V, E) -i.e., all nodes are considered as belonging to the same G, whose size can often be massive. To generate the node embedding, no matter how we modify the architecture and the graph structure, a deep enough GNN would always propagate the influence from the entire node set V into a single target node. Intuitively, for a large graph, most nodes in V barely provide any useful information to the target nodes. We thus regard such "global view" on G as one of the root causes for both the expressivity and computation challenges discussed above. In this work, for the node embedding task, we take an alternative "local view" and interpret the GNN input as V = v∈V V [v] and E = v∈V E [v] . In other words, each target node v belongs to some small graph G [v] capturing the characteristics of only the node v. The entire input graph G is observed as the union of all such local yet latent G [v] . Such simple global-tolocal switch of perspective enables us to address both the expressivity and computation challenges without resorting to alternative GNN architectures or reconstructing the graph. Present work: SHADOW-GNN. We propose a "Deep GNN, shallow sampler" design principle that helps improve the expressive power and inference efficiency of various GNN architectures. We break the conventional thinking that an L-layer (deep) GNN has to aggregate L-hop (faraway) neighbors. We argue that the GNN receptive field for a target node should be shallower than the GNN depth. In other words, an L-layer GNN should only operate on a small subgraph G [v] surrounding the target node v, where G [v] consists of (part of) the L 0 -hop neighborhood. The deep vs. shallow comparison is reflected by setting L 0 < L. We name such a GNN on G [v] as a SHADOW-GNN. We justify our design principle from two aspects. Firstly, why do we need the neighborhood to be shallow? As a motivating example, the average number of 4-hop neighbors for the ogbn-products graph (Hu et al., 2020) is 0.6M, corresponding to 25% of the full graph size. Blindly encoding the 0.6M node features into a single embedding vector can create the "information bottleneck" (Alon & Yahav, 2020). The irrelevant information from the majority of the 0.6M nodes may also "dilute" the truly useful signals from a small set of close neighbors. A simple solution to the above issues is to manually create a shallow neighborhood by subgraph sampling. The second question regarding SHADOW-GNN is: why do we still need deep GNNs? Using more number of layers than the number of hops means the same pair of nodes may exchange messages with each other multiple times. Intuitively, this helps the GNN better absorb the subgraph information. Theoretically, we prove that a GNN deeper than the hops of the subgraph can be more powerful than the 1-dimensional Weisfeiler-Lehman test (Shervashidze et al., 2011) . A shallow GNN, on the contrary, cannot accurately learn certain simple functions such as unweighted mean of the shallow neighborhood features. Note that with GCN as the backbone, a SHADOW-GCN still performs signal smoothing in each layer. However, the important distinction is that a deep GCN smooths the full G regardless of the target node, while a SHADOW-GCN constructs a customized smoothing domain G [v] for each target v. The variance in those smoothing domains created by SHADOW-GCN encourages variances in the node embedding vectors. With such intuition, our analysis shows that SHADOW-GNN does not oversmooth. Finally, since the sizes of the shallow neighborhoods are independent of the GNN depth, the computation challenge due to neighbor explosion is automatically addressed. We propose various subgraph samplers for SHADOW-GNN, including the simplest k-hop sampler and a sampler based on personalized PageRank, to improve the inference accuracy and computation efficiency. By experiments on five standard benchmarks, our SHADOW-SAGE and SHADOW-GAT models achieve significant accuracy gains compared with the original GraphSAGE and GAT models. In the meantime, the inference cost is reduced by orders of magnitude.

2. RELATED WORK AND PRELIMINARIES

Deep GNNs. Recently, numerous GNN models (Kipf & Welling, 2016; Defferrard et al., 2016; Hamilton et al., 2017a; Veličković et al., 2018; Xu et al., 2018b; a) 



have been proposed. In general, the input to a GNN is the graph G, and the outputs are representation vectors for each node, capturing both the feature and structural information of the neighborhood. Most state-of-the-art GNNs use shallow models (i.e., 2 to 3 layers). As first proposed by Li et al. (2018) and further elaborated by Luan et al. (2019); Oono & Suzuki (2020); Zhao & Akoglu (2020); Huang et al. (2020), one of the major challenges to deepen GNNs is the "oversmoothing" of node features -each layer aggregation pushes the neighbor features towards similar values. Repeated aggregation over many layers results in node features being averaged over the full graph. A deep GNN may thus generate indistinguishable embeddings for different nodes. Viewing oversmoothing as a limitation of the layer aggregation, researchers develop alternative architectures. AS-GCN (Huang et al., 2018), DeepGCN (Li et al., 2019) and JK-net (Xu et al., 2018b) use skip-connection across layers. MixHop (Abu-El-Haija et al., 2019), Snowball (Luan et al., 2019) and DAGNN (Liu et al., 2020) enable multi-hop message

