SCALABLE GRAPH NEURAL NETWORKS FOR HETEROGENEOUS GRAPHS

Abstract

Graph neural networks (GNNs) are a popular class of parametric model for learning over graph-structured data. Recent work has argued that GNNs primarily use the graph for feature smoothing, and have shown competitive results on benchmark tasks by simply operating on graph-smoothed node features, rather than using end-to-end learned feature hierarchies that are challenging to scale to large graphs. In this work, we ask whether these results can be extended to heterogeneous graphs, which encode multiple types of relationship between different entities. We propose Neighbor Averaging over Relation Subgraphs (NARS), which trains a classifier on neighbor-averaged features for randomly-sampled subgraphs of the "metagraph" of relations. We describe optimizations to allow these sets of node features to be computed in a memory-efficient way, both at training and inference time. NARS achieves a new state of the art accuracy on several benchmark datasets, outperforming more expensive GNN-based methods.

1. INTRODUCTION

In recent years, deep learning on graphs has attracted a great deal of interest, with new applications ranging from social networks and recommender systems, to biomedicine, scene understanding, and modeling of physics (Wu et al., 2020) . One popular branch of graph learning is based on the idea of stacking learned "graph convolutional" layers that perform feature transformation and neighbor aggregation (Kipf & Welling, 2017) , and has led to an explosion of variants collectively referred to as Graph Neural Networks (GNNs) (Hamilton et al., 2017; Xu et al., 2018; Velickovic et al., 2018) . Most benchmarks for learning on graphs focus on very small graphs, but the relevance of such models to large-scale social network and e-commerce datasets was quickly recognized (Ying et al., 2018) . Since the computational cost of training and inference on GNNs scales poorly to large graphs, a number of sampling approaches have been proposed that improve the time and memory cost of GNNs by operating on subsets of graph nodes or edges (Hamilton et al., 2017; Chen et al., 2017; Zou et al., 2019; Zeng et al., 2019; Chiang et al., 2019) . Recently several papers have argued that on a range of benchmark tasks -social network and e-commerce tasks in particular -GNNs primarily derive their benefits from performing feature smoothing over graph neighborhoods rather than learning non-linear hierarchies of features as implied by the analogy to CNNs (Wu et al., 2019; NT & Maehara, 2019; Chen et al., 2019; Rossi et al., 2020) . Surprisingly, Rossi et al. ( 2020) demonstrate that a one-layer MLP operating on concatenated N-hop averaged features, which they call Scalable Inception Graph Network (SIGN), performs competitively with state-of-the-art GNNs on large web datasets while being more scalable and simpler to use than sampling approaches. Neighbor-averaged features can be precomputed, reducing GNN training and inference to a standard classification task. However, in practice the large graphs used in web-scale classification problems are often heterogeneous, encoding many types of relationship between different entities (Lerer et al., 2019) . While GNNs extend naturally to these multi-relation graphs (Schlichtkrull et al., 2018) and specialized methods further improve the state-of-the-art on them (Hu et al., 2020b; Wang et al., 2019b) , it is not clear how to extend neighbor-averaging approaches like SIGN to these graphs. In this work, we investigate whether neighbor-averaging approaches can be applied to heterogeneous graphs (HGs). We propose Neighbor Averaging over Relation Subgraphs (NARS), which computes neighbor averaged features for random subsets of relation types, and combines them into a single set of features for a classifier using a 1D convolution. We find that this scalable approach exceeds the accuracy of state-of-the-art GNN methods for heterogeneous graphs on tasks in three benchmark datasets. P 2 A 2 F 3 F 2 F 1 P 1 A 1 A 3 Relation Set ℛ: 𝑃 → 𝑃 𝐴 → 𝑃 𝑃 → 𝐹 𝑅 ! = {𝑃 → 𝑃} Heterogeneous Graph 𝐺 𝑅" = {𝑃 → 𝐹, 𝑃 → 𝐹} . . . 𝑅 #$ ! = {𝐴 → 𝑃} . . . Sample Relation Subset 𝑅 % ⊆ ℛ P 2 F 3 F 2 F 1 P 1 𝐺 ! P 2 P 1 𝐺 " P 2 A 2 P 1 A 1 A 3 𝐺 #$" Build Relation Subgraph 𝐺 % . . . A !! (#) 𝑋 1 D C o n v 1 D C o n v 1 D C o n v .. . A !! (%) 𝑋 A !! (&) 𝑋 A !" (&) 𝑋 A !" (%) 𝑋 A !#$" (%) 𝑋 A !" (#) 𝑋 A !#$" (#) 𝑋 A !#$" One downside of NARS is that it requires a large amount of memory to store node features for many random subgraphs. We describe an approximate version that fixes the memory scaling issue, and show that it does not degrade accuracy on benchmark tasks.

2. BACKGROUND

Graph Neural Networks are a type of neural model for graph data that uses graph structure to transform input node features into a more predictive representation for a supervised task. A popular flavor of graph neural network consists of stacked layers of operators composed of learned transformations and neighbor aggregation. These "message-passing" GNNs were inspired by spectral notions of graph convolution (Bruna et al., 2014; Defferrard et al., 2016; Kipf & Welling, 2017) . Consider a graph G with n vertices and adjacency matrix A ∈ R n×n . A graph convolution g x of node features x by a filter g is defined as a multiplication by g in the graph Fourier basis, just as a standard convolution is a multiplication in Fourier space. The Fourier basis for a graph is defined as the eigenvectors U of the normalized Laplacian, and can be thought of as a basis of functions of varying smoothness over the graph. g x = U gU T x Any convolution g can be approximated by a series of k-th order polynomials in the Laplacian, which depend on neighbors within a k-hop radius (Hammond et al., 2011) . By limiting this approximation to k = 1, Kipf & Welling (2017) arrive at an operation that consists of multiplying node features by the normalized adjacency matrix, i.e. averaging each node's neighbor features. Such an operation can be viewed as a graph convolution by a particular smoothing kernel. A Graph Convolutional Network (GCN) is constructed by stacking multiple layers, each with a neighbor averaging step followed by a linear transformation. Many variants of this approach of stacked message-passing layers have since been proposed with different aggregation functions and for different applications (Velickovic et al., 2018; Xu et al., 2018; Hamilton et al., 2017; Schlichtkrull et al., 2018) . Early GNN work focused on tasks with small graphs (thousands of nodes), and it's not straightforward to scale these methods to large-scale graphs. Applying neighbor aggregation by directly multiplying node features by the sparse adjacency matrix at each training step is computationally expensive and does not permit minibatch training. On the other hand, applying a GCN for a minibatch of labeled nodes requires aggregation over a receptive field (neighborhood) of diameter d equal to the GCN depth, which can grow exponentially in d. Recent work in scaling GNNs to very large graphs have focused on training the GNN on sampled subsets of neighbors or subgraphs to allevate the computation and memory cost (Hamilton et al., 2017; Chen et al., 2017; Zou et al., 2019; Zeng et al., 2019; Chiang et al., 2019) .



Figure 1: Neighbor Averaging over Relation Subgraphs on heterogeneous graph G. G has three node types: Paper (P), Author (A), and Field (F), and three relation types: Paper cites Paper (P→ P), Paper belongs-to Field (P→F), Author writes Paper (A→P).

