LOCAL CLUSTERING GRAPH NEURAL NETWORKS

Abstract

Graph Neural Networks (GNNs), which benefit various real-world problems and applications, have emerged as a powerful technique for learning graph representations. The depth of a GNN model, denoted by K, restricts the receptive field of a node to its K-hop neighbors and plays a subtle role in the performance of GNNs. Recent works demonstrate how different choices of K produce a trade-off between increasing representation capacity and avoiding over-smoothing. We establish a theoretical connection between GNNs and local clustering, showing that short random-walks in GNNs have a high probability to be stuck at a local cluster. Based on the theoretical analysis, we propose Local Clustering Graph Neural Networks (LCGNN), a GNN learning paradigm that utilizes local clustering to efficiently search for small but compact subgraphs for GNN training and inference. Compared to full-batch GNNs, sampling-based GNNs and graph partition-based GNNs, LCGNN performs comparably or even better, achieving state-of-the-art results on four Open Graph Benchmark (OGB) datasets. The locality of LCGNN allows it to scale to graphs with 100M nodes and 1B edges on a single GPU.

1. INTRODUCTION

Recent emergence of the Graph Neural Networks (GNNs), exemplified by models like ChebyNet (Defferrard et al., 2016) , GCN (Kipf & Welling, 2017) , GraphSAGE (Hamilton et al., 2017) , GAT (Veličković et al., 2018) , and GIN (Xu et al., 2019) , has drastically reshaped the landscape of the graph learning research. These methods generalize traditional deep learning algorithms to model graph-structured data by combining graph propagation and neural networks. Despite its conceptual simplicity, GNNs have reestablished the new state-of-the-art methods in various graph learning tasks, such as node classification, link prediction, and graph classification (Hu et al., 2020; Dwivedi et al., 2020) , also served as key contributors to many real-world applications, such as recommendation system (Ying et al., 2018) , smart transportation (Luo et al., 2020) , visual question answering (Teney et al., 2017) and molecular de-novo design (You et al., 2018) . With the growth of real-world social and information networks (Leskovec et al., 2005) , there is an urgent need to scale GNNs to massive graphs. For example, the recommendation systems in Alibaba (Zhu et al., 2019) and Pinterest (Ying et al., 2018) require training and inferring GNNs on graphs with billions of edges. Building such large-scale GNNs, however, is a notoriously expensive process. For instance, the GNN models in Pinterest are trained on a 500GB machine with 16 Tesla K80 GPUs, and served on a Hadoop cluster with 378 d2.8xlarge Amazon AWS machines. Although one may think model parameters are the main contributors to the huge resource consumption of GNNs, previous work (Ma et al., 2019) suggests the main bottleneck actually comes from the entanglement between graph propagation and neural networks, which leads to a large and irregular computation graph for GNNs. This problem is further exacerbated by the small-world phenomenon (Watts & Strogatz, 1998) , i.e., even a relatively small number of graph propagation can involve full-graph computation. For example, in Facebook college graphs of John Hopkins (Traud et al., 2012) , the 2-hop neighbors of node 1, as shown in Fig. 1a , covers 74.5% of the whole graph. A common strategy to reduce the overhead of GNNs is to make the graph smaller but may bring side effects. For instance, graph sampling techniques, such as neighborhood sampling in Graph-SAGE (Hamilton et al., 2017) , may lead to the high variance issue (Chen et al., 2018a) . Alternatively, graph partition techniques, such as METIS (Karypis & Kumar, 1998 ) that adopted by Cluster-GCN (Chiang et al., 2019) and AliGraph (Zhu et al., 2019) , essentially involves extra full- The rest of the paper is organized as follows. Section 2 gives a brief background summary followed by a survey of related works in section 3. In section 4 and section 5, we establish the connection between GNNs and local clustering, and then describe our LCGNN framework. Section 6 presents the experimental results and ablation study. Finally, we concludes this work in section 7.

2. BACKGROUND

In this section, we bring the necessary background about graph, graph convolutional networks (GCN), (lazy) random walk on graphs, and graph conductance.

Graph Notations

The graph G = (V, E, A) consists of |V | = n nodes and |E| = m edges. A ∈ R n×n

+

is the adjacency matrix where its entry A(i, j), if nonzero, denote there is an edge between node i and j with edge weight A ij . In this work, we assume the input graph is undirected and unweighted, and our analysis can be generalized to the weighted graph case easily. For undirected graph, the degree matrix D diag(d(1), • • • , d(n)) is a diagonal matrix where d(i) j A(i, j ) is the degree of node i. Moreover, each node in G is associated with a F -dimensional feature vector, denoted by x i ∈ R F . The entire feature matrix X ∈ R n×F is the concatenation of node feature vectors. There are two matrices that play importance roles in the design and analysis of GCN (Kipf & Welling, 2017) -the normalized graph Laplacian L D -1/2 AD -1/2 and the random walk transition probability matrix P AD -1 . Note that the entry P (i, j) indicates the probability that the random walk goes from node j to node i. Graph Convolutional Networks (GCN) GCN (Kipf & Welling, 2017) initializes the node representation as the input feature matrix H (0) ← X, and iteratively apply non-linear transformation and graph propagation on node representation: H (k) ← ReLU LH (k-1) W (k) , where left multiplying H (k-1) by normalized graph Laplacian L acts as the graph propagation, and right multiplying H (k-1) by W as well as the ReLU (Glorot et al., 2011) activation acts as the non-linear transformation. For the node classification task, a K-layer GCN predicts node labels Y with a softmax classifier: Y ← softmax LH (K-1) W (K) . Take a two-layer GCN (K = 2) as a running example, the predicted node labels Y is defined as Y ← softmax LReLU LH (0) W (1) W (2) . Lazy Random Walk In practice, many GNNs add (weighted) self-loops to the graphs (A ← A + αI, Kipf & Welling (2017); Xu et al. (2019) ) or create residual connections (He et al., 2016) in neural networks (Li et al., 2019; Dehmamy et al., 2019) . Such techniques can be viewed as variants of lazy random walk on graphs -at every step, with probability 1/2 the walker stays at the current node (through a self-loop) and with probability 1/2 the walker travels to a neighbor. The transition matrix of a lazy random walk is M (I + AD -1 )/2. In this work, we mainly consider lazy random walk. Because it has several desired properties and it is consistent with the actual situation. Graph Conductance For an undirected unweighted graph G = (V, E, A), the graph volume of any non-empty node set S ⊂ V is defined as vol(S) i∈S d(i), which measures the total number of edges incident from S. The conductance of a non-empty node set S ⊂ V is defined as Φ(S) i∈S j∈V -S A(i,j) min (vol(S),vol(V -S)) . Roughly speaking, conductance Φ(S) is the ratio of the number of edges across S and V -S to the number of edges incident from S, measuring the clusterability of a subset S. Low conductance indicates a good cluster because its internal connections are significantly richer than its external connections. Although it is NP-hard to minimize conductance ( Šíma & Schaeffer, 2006) , there have been theoretically-guaranteed approximation algorithms that identify clusters near a given node that satisfy a target conductance condition, such as Spielman & Teng (2013); Andersen et al. (2006) ; Chung (2007) .

3. RELATED WORK

The design of scalable GNNs has attracted wide attention from the machine learning community. We review related work from three perspectives: (1) full-batch GNNs with co-design of systems and algorithms; (2) sampling-based GNNs; (3) graph partition-based GNNs. Full-batch GNNs A full-batch GNN takes a whole graph as input for forward and backward. Consequently, its computational cost is proportional to the graph size. Earlier GNN models (Kipf & Welling, 2017; Veličković et al., 2018) evaluated on relatively small graphs, thus can be trained in a full-batch manner. Scaling full-batch GNNs to large graphs requires the co-design of ML systems and ML algorithms (Jiang et al., 2020; Zhang et al., 2020; Ma et al., 2019) . For example, NeuGraph (Ma et al., 2019) runs full-batch GNN models on a graph with 8.6M nodes and 231.6M edges on an eight-P100-GPU server. SGC (Wu et al., 2019) is another attempt at full-batch GNN. It simplifies GCN by conducting graph propagation and classification separately and efficiently. However, such simplification may sacrifice performance in some downstream tasks. GNNs based on Graph Sampling GraphSAGE (Hamilton et al., 2017) first proposed the idea of neighborhood sampling, and later it was applied in a real-world recommendation system by Pin-SAGE (Ying et al., 2018) . At each GNN layer, GraphSAGE computes node representation by first down-sampling its neighborhoods and then aggregating the sampled ones. As a randomized algorithm, Neighborhood Sampling was further improved by FastGCN (Chen et al., 2018b) , Stochastic GCN (Chen et al., 2018a) and Adaptive Sampling (Huang et al., 2018) for variance reduction. A recent work about sampling-based GNN is GraphSAINT (Zeng et al., 2020) , which samples subgraphs (Leskovec & Faloutsos, 2006) and run full-batch GNN on sampled subgraphs. GNNs based on Graph Partition Cluster-GCN (Chiang et al., 2019) is the most related work to ours. Cluster-GCN adopts global graph partition algorithms, METIS (Karypis & Kumar, 1998) , to partition the input graph into subgraphs, and run a GNN on each subgraph. A similar idea was also proposed in AliGraph (Zhu et al., 2019) . However, global graph partition algorithms involve additional whole graph computation. Moreover, global graph partition algorithms are vulnerable to dynamic and evolving graphs (Xu et al., 2014; Vaquero et al., 2014) , with nodes and edges being constantly added and removed, which are very common in real-world applications.

4. SHORT RANDOM WALK AS LOCAL CLUSTERING

Most GNNs adopt short random walks to explore a graph. For example, the default 2-layer GCN in Kipf & Welling (2017) can be viewed as enumerating all length-2 paths and aggregating them with a neural network; Hamilton et al. (2017) uses a 2-hop neighborhood sampling method, a variant of 2hop random walk, to sample neighbors in each GraphSAGE layer. GraphSAINT (Zeng et al., 2020) samples subgraphs by 2-hop random walks and then build a full-batch GCN on them. SGC (Wu et al., 2019) conducts 2-hop feature propagation and then apply node-wise logistic regression. We reveal the theoretical connection between short random walk and local clustering. To be more formal, let q (K) be the K-th step lazy random-walk distribution starting from an arbitrary node u according to transition probability matrix M , i.e., q (K) ← M K 1 u . We want to study the probability vector q (K) in terms of K, especially when K is small (e.g., K = 2). Due to the the small world phenomenon (Watts & Strogatz, 1998) , for most social/information networks, q (K) can have O(n) non-zeros, even K is small, e.g., K = 2 or 3. However, the following theorem shows that the probability that a random walk escaping from a local cluster can be bounded by its conductance: Theorem 1 (Escaping Mass, Proposition 2.5 in Spielman & Teng ( 2013)). For all K ≥ 0 and all S ⊂ V , the probability that any K-step lazy random walk staring in S escapes S is at most KΦ(S)/2. I.e., the escaping probability satisfies q (K) (V -S) ≤ KΦ(S)/2. The key point of Theomre 1 is to relate the K-th step random-walk probability to graph conductance -for a node u, suppose there exists a subset S such that (1) u ∈ S and (2) Φ(S) is small (low conductance), Theorem 1 guarantees that the probability that a lazy random walk starting from node u is very likely to be stuck at S, revealing the following facts and potential problems of existing GNNs: (1) for full-batch GNNs, although its receptive field induced by K-hop neighbors may cover the whole graph, most probability mass still concentrates around a local cluster (if exists), and the remaining probabilities (i.e., escaping mass) are small and bounded. Consequently, the computation cost of full-batch GNNs can be largely reduced; (2) Sampling-based methods can be viewed as a randomized and implicit version of finding a local clustering, however, with their sample-efficiency and variance non-guaranteed. The above facts encourage us to design local clustering-based GNNs. A crucial question about the above analysis is the existence of a low-conductance S for every node u (or most nodes in the graph). This is generally not true for arbitrary graphs, e.g., a complete graph. However, evidence from network science and social science agrees with our assumption. For example, (1) Many networks of interest in the sciences are found to divide naturally into communities (Girvan & Newman, 2002; Newman, 2006) ; (2) Real-world social networks consist of compact communities with size scale of around 100 nodes (Leskovec et al., 2009) ; (3) Roughly 150 individuals are the upper limit on the size of a well-functioning human community (Dunbar, 1998) .

5. LOCAL CLUSTERING GRAPH NEURAL NETWORKS (LCGNN)

The analysis in section 4 lays the theoretical foundation of the design of our LCGNN framework. In the section, we formally introduce LCGNN. Roughly speaking, our framework consists of two steps. In the first step, for each node u ∈ V , we run local clustering to produces a local cluster S u surrounding it. In the second step, we feed the subgraph induced by S u to a GNN encoder.

5.1. LOCAL CLUSTERING

Local clustering algorithms find a small cluster near given seed(s). Different from global graph partition methods involving full-graph computation, local clustering conducts local exploration in the graph and its running time depends only on the size of the output cluster. Over the past two decades, many local clustering algorithms have been developed (Spielman & Teng, 2013; Andersen et al., 2006; Chung, 2007; Li et al., 2015; Kloster & Gleich, 2014; Kloumann & Kleinberg, 2014; Whang et al., 2013; Yin et al., 2017; Fountoulakis et al., 2019) . In this works, we mainly focus on PPR-Nibble (Andersen et al., 2006) , one of the most popular spectral-based local clustering algorithms among the above methods. As its name indicates, PPR-Nibble adopts the personalized PageRank (PPR) vector for local clustering. The PPR vector p u of a node u is given by equation p u = α1 u + (1 -α)P p u , which is the stationary distribution of the following random walk: at each Algorithm 1: Approximate-PPR. Input Graph G = (V, E, A), seed node u, teleportation parameter α, tolerance ; Output An -approximate PPR vector pu; pu ← 0; r ← 1u; while r(v)/d(v) ≥ for some v ∈ V do ρ ← r(v) -2 d(v); pu(v) ← pu(v) + αρ; r(v) ← 2 d(v); for each (v, u) ∈ E do r(u) ← r(u) + A(v,u) d(v) (1 -α)ρ; return pu; Algorithm 2: PPR-Nibble. Input Graph G = (V, E, A), seed node u, teleportation parameter α, tolerance ; Output A local cluster S ⊂ V ; pu ← Approximate-PPR(G, u, α, ); σi ← i-th largest entry of D -1 pu; return S ← arg min S Φ(S ), where S = {σ1, • • • , σ }; step of the random walk, with probability α the walker teleports back to the node u, and with probability 1 -α the walker performs a normal random walk. However, PPR vector p u is a dense vector and thus computationally expensive. Andersen et al. (2006) developed an efficient algorithm, named Approximate-PPR to compute its sparse approximation p u so that |p u (v)/d(v) -p u (v)/d(v)| ≤ for each node v. As shown in Algorithm 1, the key idea is to gradually push probabilities from a residual vector r to approximate PPR vector p u (Line 5-7 of Algorithm 1). After computing the approximate PPR vector p u , a sweep procedure is then adopted to extract a cluster S with small conductance Φ(S). More formally, the sweep procedure first sort nodes according to D -1 p u in descending order (Line 4 of Algorithm 2), and then evaluate the conductance of each node prefix in the sorted list and output the one with smallest conductance (Line 5 of Algorithm 2). Note that PPR-Nibble is a local algorithm (Spielman & Teng, 2013) with theoretical guarantee -(1) The input to the algorithm is a starting node u; (2) At each step of Approximate-PPR in Algorithm 1, it only examines nodes connected to those it has seen before. The following theorems characterize the complexity and error bounds of Approximate-PPR of PPR-Nibble, respectively. Theorem 2 (Lemma 2 in Andersen et al. (2006) ). Algorithm 1 runs in time O 1 α . and the number of non-zeros in p u satisfies nnz( p u ) ≤ 1 α . Theorem 3 (Theorem 1 in Zhu et al. (2013); Theorem 4.3 in Yin et al. (2017) ). Let S ⊂ V be some unknown targeted cluster, we are trying to retrieve from an unweighted graph. Let η be the inverse mixing time of the random walk on the subgraph induced by S. Then there exists S g ⊆ S with vol(S g ) ≥ vol(S)/2, such that for any seed u ∈ S g , Algorithm 2 with α = Θ(η) and ∈ 1 10 vol(T ) , 1 5 vol(T ) outputs a set S with Φ(S) ≤ O min Φ(T ), Φ(T )/ √ η .

5.2. LOCAL CLUSTER ENCODER

For each node u ∈ V , PPR-Nibble in Algorithm 2 produces a local cluster S u ⊂ V with |S u | ≤ 1 α . We denote G u to be the subgraph induced by the cluster S u , which is then encoded to a hidden representation via an encoder (usually a GNN model): h u ← ENCODER(G u ). The encoded hidden representation can be further used for various graph learning tasks. For the node classification task, we predict the label of node u with a softmax classifier: y u ← softmax(W h u + b); For the link prediction task, we measure the likelihood of a link e = (u, v) by first element-wisely multiplying h u and h v and then feeding it to a MLP encoder, i.e., y e ← MLP(h u h v ). The choice of the encoder is flexible. In this work, we mainly examine four candidate encoders: GCN/GAT/GraphSAGE encoders Our first candidate encoders are traditional GNNs such as GCN, GAT, and GraphSAGE. We denote them as LCGNN-GCN/-GAT/-SAGE, respectively. Transformer Encoder We also examine a more complex and powerful encoder based on Transformer (Vaswani et al., 2017) . Our hypothesis is that low conductance subgraphs extracted by local clustering have such rich internal connections that we can almost treat them as complete graphs. Thus we adopt the Transformer encoder whose attention mechanism allows dense interaction within a subgraph. We initialize the positional embedding in Transfomer as the pre-trained Node2vec embedding on the input graph. We denote the Transformer-based encoder as LCGNN-Transformer.

6. EXPERIMENTS

In this section, we conduct experiments on two major tasks of graph learning, node classification and link prediction. For each task, we use the datasets from Open Graph Benchmark (OGB) (Hu et al., 2020) , which presents significant challenges of scalability to large-scale graphs and out-ofdistribution generalization. The dataset statistics are summarized in Table 1 , Another graph task, graph classification, is not explored in our experiments because it is unnecessary to utilize local clustering for small graphs with only hundreds of nodes. The average and standard deviation of test performance under 10 different seeds are reported in all experiments. For the local clustering algorithm, we use the software provided by Fountoulakis et al. (2018) . We set α = 0.15 in Approximate-PPR, and constraint the maximum cluster size to be 64 or 128 in the PPR-Nibble step, i.e., the sweep procedure only examines the prefix of first 64 (128) nodes in Algorithm 2. Detailed hyper-parameter configuration of LCGNN can be found in Appendix A.2.

6.1. NODE CLASSIFICATION

Node classification datasets include products, arxiv, and papers100M at different scales. We train LCGNN on a single GPU on all three datasets. Limited by space, the results of the arxiv dataset are reported in the Appendix A.1 because relatively small datasets are not our target scenario. Baselines. The OGB team provides MLP, Node2vec (Grover & Leskovec, 2016) , GCN (Kipf & Welling, 2017), GraphSAGE (Hamilton et al., 2017) as the common baselines for products and arxiv datasets. For the large-scale papers100M dataset, the OGB team only provides MLP, Node2vec, and SGC (Wu et al., 2019) . Other teams and researchers also contribute numerous models to the leaderboards: For the products dataset, three GAT-based models with different mini-batch training techniques are also reported. DeeperGCN (Li et al., 2020) explores how to design and train deep GCNs. UniMP (Shi et al., 2020) is a most recent modelfoot_0 which combines feature propagation and label propagation. Results. The results of products and papers100M datasets are listed in the Table 2 and Table 3 , respectively. In papers100M dataset, SGC (Wu et al., 2019) is the only reported GNN model that can handle this large-scale dataset with more than 1 billion edges. SGC gets better performance than Node2vec and MLP due to the expressive power of (simplified) graph convolution. Compared with SGC, LCGNN uses a semi-supervised manner and can learn feature transformation in the training procedure. Our proposed LCGNN obtains better performance than SGC with 2.73% absolute improvement, which shows stronger expressiveness of our model. In products dataset, our LCGNN (rank 2 in Table 2 ) gets comparable results with other state-of-the-art GNN models. The arxiv dataset is relatively small and well-tuned full-bath GNNs achieve the best results. Our (Hamilton et al., 2017) , ClusterGCN (Chiang et al., 2019) , and GraphSAINT (Zeng et al., 2020) .

6.2. LINK PREDICTION

We evaluate LCGNN on three link prediction tasks -ppa, collab, and citation. We use a single GPU to train on the collab dataset and use multi-GPUs to train on the ppa and citation datasets (5 GPUs for ppa and 4 GPUs for citation). Baselines. The OGB team provides Matrix Factorization, Node2vec (Grover & Leskovec, 2016) , GCN (Kipf & Welling, 2017), GraphSAGE (Hamilton et al., 2017) as the common baselines. For the citation dataset, GCN/SAGE-based models with three different mini-batch training techniques are also provided by the OGB team. Other researchers also contribute some state-of-the-art models to the leaderboards. DeepWalk (Perozzi et al., 2014) is submitted by other researchers using DGL (Wang et al., 2019) . LRGA+GCN (Puny et al., 2020 ) is a recently proposed model which aligns 2-folklore Weisfeiler-Lehman algorithm to improve the generalization of GNNs. Results. The results of ppa, collab, and citation datasets are listed in the Table 4 , 5, and 6, respectively. We compare LCGNN with a recently developed model, LRGA+GCN (Puny et al., 2020) , as well as traditional baselines. For all three datasets for link prediction, our proposed LCGNN achieves the best results over state-of-the-art models with 0.68% ∼ 2.64% absolute improvements, showing the ability of local clustering and Transfomer encoder to boost link-prediction performance. Ablation Study. We report the results on the collab when the Transformer encoder is replaced with GCN and GraphSAGE encoders. Compared with full-batch GCN and GraphSAGE, our LCGNN-GCN and LCGNN-SAGE obtains much better performance, which suggests the significance of graph local clustering. LCGNN-Transformer gets better results than LCGNN-GCN and LCGNN-SAGE due to the powerful expressiveness of the Transformer encoder. Overall, not only LCGNN achieves four first places (ogbn-paper100m, ogbl-ppa, ogbl-collab, and ogbl-citation) and one second place (ogbn-products) on OGB datasets, it also improves the scalability of GNN models for large-scale graphs.

7. CONCLUSION

In this work, we present Local Clustering Graph Neural Networks (LCGNN), a lightweight, effective, and scalable GNN framework with theoretical guarantees. LCGNN combines local clustering algorithms and graph neural network models to achieve state-of-the-art performance on four Open Graph Benchmark (OGB) datasets. By incorporating local clustering algorithms, LCGNN can run on compact and small subgraphs without conducting full-graph computation, scaling to graphs with 100 million nodes and 1 billion edges on a single GPU. In the future, it would be interesting to try more advanced local clustering algorithms other than the PPR-Nibble. Applying LCGNN in real-world applications, such as the recommendation system, is also a promising direction.



UniMP was submitted to OGB leaderboard on Sep 8, 2020, in one month before ICLR 2021 deadline.



(a) 2-hop neighbors of node 1 covers 74.5% of the graph. (b) A local cluster around node 1.

Figure 1: Motivating examples from the John Hopkins graph.

Statistics of datasets for node classification and link prediction tasks.

ogbn-products leaderboard (collected on Oct. 1, 2020). Limited by paper space, we only list top results. * denotes that the results are submitted in one month before ICLR 2021 deadline.

ogbn-papers100M leaderboard (collected on Oct. 1, 2020) gets comparable results to full-batch GNNs and achieves better results than sampling-based GNNs (such as GAT with neighbor sampling), as shown in the Table7in the Appendix A.1.Ablation Study. Table2suggests that LCGNN-GCN and LCGNN-SAGE surpass the corresponding full-batch GCN and GraphSAGE. Furthermore, LCGNN-SAGE and LCGNN-GAT perform competitively or even better on products dataset comparing to corresponding GraphSAGE and GAT models with other training and sampling techniques, including Neighborhood Sampling



ogbl-collab leaderboard (collected on Oct. 1, 2020)

ogbl-citation leaderboard (collected on Oct. 1, 2020)

A APPENDIX A.1 EXPERIMENTAL RESULTS

We report the results of the ogbn-arxiv dataset in the Table 7 . There are some models that are only evaluated on the smallest dataset (i.e., ogbn-arxiv), including GraphZoom (Deng et al., 2020) , GaAN (Zhang et al., 2018) , DAGNN (Liu et al., 2020) , JKNet (Xu et al., 2018) , GCNII (Chen et al., 2020) . Most of these models cannot handle ogbn-products with millions of nodes. Our LCGNN models get comparable results with most state-of-the-art GNN models with and without mini-batch training techniques. We run our experiments on a single machine with Intel Xeon CPUs (Platinum 8163 @ 2.50GHz), 330GB memory, and 8 NVIDIA Tesla V100 (16GB). The code is written in Python 3.6. We use PyTorch 1.5.1 on CUDA 10.1 to train our models.

A.2.2 HYPERPARAMETER CONFIGURATION

For our models, the optimizer used in our experiments is AdamW (Loshchilov & Hutter, 2019) with β 1 = 0.9, β 2 = 0.999, and eps = 10 -8 . For LCGNN-GCN/SAGE/GAT, we use this optimizer with no warmup steps. But for LCGNN, we use the following learning rate scheduler with warmup steps, similar to Transformer (Vaswani et al., 2017) except an extra hyper-parameter lr scale:We use the wandb (Biewald, 2020) tool to help track experiments and search the hyperparameters. The final hyper-parameters used for our models are listed in the Table 8 and Table 9 . 

