LOCAL CLUSTERING GRAPH NEURAL NETWORKS

Abstract

Graph Neural Networks (GNNs), which benefit various real-world problems and applications, have emerged as a powerful technique for learning graph representations. The depth of a GNN model, denoted by K, restricts the receptive field of a node to its K-hop neighbors and plays a subtle role in the performance of GNNs. Recent works demonstrate how different choices of K produce a trade-off between increasing representation capacity and avoiding over-smoothing. We establish a theoretical connection between GNNs and local clustering, showing that short random-walks in GNNs have a high probability to be stuck at a local cluster. Based on the theoretical analysis, we propose Local Clustering Graph Neural Networks (LCGNN), a GNN learning paradigm that utilizes local clustering to efficiently search for small but compact subgraphs for GNN training and inference. Compared to full-batch GNNs, sampling-based GNNs and graph partition-based GNNs, LCGNN performs comparably or even better, achieving state-of-the-art results on four Open Graph Benchmark (OGB) datasets. The locality of LCGNN allows it to scale to graphs with 100M nodes and 1B edges on a single GPU.

1. INTRODUCTION

Recent emergence of the Graph Neural Networks (GNNs), exemplified by models like ChebyNet (Defferrard et al., 2016) , GCN (Kipf & Welling, 2017) , GraphSAGE (Hamilton et al., 2017) , GAT (Veličković et al., 2018) , and GIN (Xu et al., 2019) , has drastically reshaped the landscape of the graph learning research. These methods generalize traditional deep learning algorithms to model graph-structured data by combining graph propagation and neural networks. Despite its conceptual simplicity, GNNs have reestablished the new state-of-the-art methods in various graph learning tasks, such as node classification, link prediction, and graph classification (Hu et al., 2020; Dwivedi et al., 2020) , also served as key contributors to many real-world applications, such as recommendation system (Ying et al., 2018 ), smart transportation (Luo et al., 2020) , visual question answering (Teney et al., 2017) and molecular de-novo design (You et al., 2018) . With the growth of real-world social and information networks (Leskovec et al., 2005) , there is an urgent need to scale GNNs to massive graphs. For example, the recommendation systems in Alibaba (Zhu et al., 2019) and Pinterest (Ying et al., 2018) require training and inferring GNNs on graphs with billions of edges. Building such large-scale GNNs, however, is a notoriously expensive process. For instance, the GNN models in Pinterest are trained on a 500GB machine with 16 Tesla K80 GPUs, and served on a Hadoop cluster with 378 d2.8xlarge Amazon AWS machines. Although one may think model parameters are the main contributors to the huge resource consumption of GNNs, previous work (Ma et al., 2019) suggests the main bottleneck actually comes from the entanglement between graph propagation and neural networks, which leads to a large and irregular computation graph for GNNs. This problem is further exacerbated by the small-world phenomenon (Watts & Strogatz, 1998) , i.e., even a relatively small number of graph propagation can involve full-graph computation. For example, in Facebook college graphs of John Hopkins (Traud et al., 2012) , the 2-hop neighbors of node 1, as shown in Fig. 1a , covers 74.5% of the whole graph. A common strategy to reduce the overhead of GNNs is to make the graph smaller but may bring side effects. For instance, graph sampling techniques, such as neighborhood sampling in Graph-SAGE (Hamilton et al., 2017) , may lead to the high variance issue (Chen et al., 2018a) . Alternatively, graph partition techniques, such as METIS (Karypis & Kumar, 1998 ) that adopted by Cluster-GCN (Chiang et al., 2019) and AliGraph (Zhu et al., 2019) , essentially involves extra full- The rest of the paper is organized as follows. Section 2 gives a brief background summary followed by a survey of related works in section 3. In section 4 and section 5, we establish the connection between GNNs and local clustering, and then describe our LCGNN framework. Section 6 presents the experimental results and ablation study. Finally, we concludes this work in section 7.

2. BACKGROUND

In this section, we bring the necessary background about graph, graph convolutional networks (GCN), (lazy) random walk on graphs, and graph conductance.

Graph Notations

The graph G = (V, E, A) consists of |V | = n nodes and |E| = m edges. A ∈ R n×n + is the adjacency matrix where its entry A(i, j), if nonzero, denote there is an edge between node i and j with edge weight A ij . In this work, we assume the input graph is undirected and unweighted, and our analysis can be generalized to the weighted graph case easily. For undirected graph, the degree matrix D diag(d(1), • • • , d(n)) is a diagonal matrix where d(i) j A(i, j) is the degree of node i. Moreover, each node in G is associated with a F -dimensional feature vector, denoted by x i ∈ R F . The entire feature matrix X ∈ R n×F is the concatenation of node feature vectors. There are two matrices that play importance roles in the design and analysis of GCN (Kipf & Welling, 2017) -the normalized graph Laplacian L D -1/2 AD -1/2 and the random walk transition probability matrix P AD -1 . Note that the entry P (i, j) indicates the probability that the random walk goes from node j to node i. Graph Convolutional Networks (GCN) GCN (Kipf & Welling, 2017) initializes the node representation as the input feature matrix H (0) ← X, and iteratively apply non-linear transformation and graph propagation on node representation: H (k) ← ReLU LH (k-1) W (k) , where left multiplying H (k-1) by normalized graph Laplacian L acts as the graph propagation, and right multiplying H (k-1) by W as well as the ReLU (Glorot et al., 2011) activation acts as the non-linear transformation. For the node classification task, a K-layer GCN predicts node labels Y with a softmax



(a) 2-hop neighbors of node 1 covers 74.5% of the graph. (b) A local cluster around node 1.

Figure 1: Motivating examples from the John Hopkins graph.

