(LA)YER-NEIGH(BOR) SAMPLING: DEFUSING NEIGHBORHOOD EXPLOSION IN GNNS

Abstract

Graph Neural Networks have recently received a significant attention, however, training them at a large scale still remains a challenge. Minibatch training coupled with sampling is used to alleviate this challenge. Even so existing approaches either suffer from the neighborhood explosion phenomenon or do not have good performance. To deal with these issues, we propose a new sampling algorithm called LAyer-neighBOR sampling (LABOR). It is designed to be a direct replacement for Neighbor Sampling with the same fanout hyperparameter while sampling upto 7× fewer vertices, without sacrificing quality. By design, the variance of the estimator of each vertex matches Neighbor Sampling from the point of view of a single vertex. Moreover under the same vertex sampling budget constraints, LA-BOR converges faster than existing layer sampling approaches and can use upto 112× larger batch size compared to Neighbor Sampling.

1. INTRODUCTION

Graph Neural Networks (GNN) Hamilton et al. (2017) ; Kipf & Welling (2017) have become de facto models for representation learning on graph structured data. Hence they have started being deployed in production systems Ying et al. (2018) ; Niu et al. (2020) . These models iteratively update the node embeddings by passing messages along the direction of the edges in the given graph with nonlinearities in between different layers. With l layers, the computed node embeddings contain information from the l-hop neighborhood of the seed vertex. In the production setting, the GNN models need to be trained on billion-scale graphs (Ching et al., 2015; Ying et al., 2018) . The training of these models takes hours to days even on distributed systems Zheng et al. (2022b; a) . As in general Deep Neural Networks (DNN), it is more efficient to use minibatch training (Bertsekas, 1994) on GNNs, even though it is a bit trickier in this case. The node embeddings in GNNs depend recursively on their set of neighbors' embeddings, so when there are l layers, this dependency spans the l-hop neighborhood of the node. Real world graphs usually have a very small diameter and if l is large, the l-hop neighborhood may very well span the entire graph, also known as the Neighborhood Explosion Phenomenon (NEP) (Zeng et al., 2020) . To solve these issues, researchers proposed sampling a subgraph of the l-hop neighborhood of the nodes in the batch. There are mainly three different approaches: Node-based, Layer-based and Subgraph-based methods. Node-based sampling methods (Hamilton et al., 2017; Chen et al., 2018a; Liu et al., 2020; Zhang et al., 2021) sample independently and recursively for each node. It was noticed that node-based methods sample subgraphs that are too shallow, i.e., with a low ratio of number of edges to nodes. Thus layer-based sampling methods were proposed (Chen et al., 2018b; Zou et al., 2019; Huang et al., 2018; Dong et al., 2021) , where the sampling for the whole layer is done collectively. On the other hand subgraph sampling methods (Chiang et al., 2019; Zeng et al., 2020; Hu et al., 2020b; Zeng et al., 2021) do not use the recursive layer by layer sampling scheme used in the node-and layer-based sampling methods and instead tend to use the same subgraph for all of the layers. Some of these sampling methods take the magnitudes of embeddings into account (Liu et al., 2020; Zhang et al., 2021; Huang et al., 2018) Node-based sampling methods suffer the most from the NEP but they guarantee a good approximation for each embedding by ensuring each vertex gets k neighbors which is the only hyperparameter of the sampling algorithm. Layer-based sampling methods do not suffer as much from the NEP because number of vertices sampled is a hyperparameter but they can not guarantee that each vertex approximation is good enough and also their hyperparameters are hard to reason with, number of nodes to sample at each layer depends highly on the graph structure (as the numbers in Table 2 show). Subgraph sampling methods usually have more bias than their node-and layer-based counterparts. Hence, in this paper, we focus on the node-and layer-based sampling methods and combine their advantages. The major contributions of this work can be listed as follows: • We propose a new sampling algorithm called LABOR, combining advantages of neighbor and layer sampling approaches using Poisson Sampling. LABOR correlates the sampling procedures of the given set of seed nodes so that the sampled vertices from different seeds have a lot of overlap, resulting into a 7× reduction in computation, memory and communication. Furthermore, LABOR has the same hyperparameters as neighbor sampling to use as a drop-in replacement and can speed up training by upto 2.6×. • We experimentally verify our findings, show that our proposed sampling algorithm LABOR outperforms both neighbor sampling and layer sampling approaches. LABOR can enjoy a batch-size of upto 112× larger than NS while sampling the same number of vertices.

2. BACKGROUND

Graph Neural Networks: Given a directed graph G = (V, E), where V and E ⊂ V × V are vertex and edge sets respectively, (t → s) ∈ E denotes an edge from a source vertex t ∈ V to a destination vertex s ∈ V , and A ts denotes the corresponding edge weight if provided. If we have a batch of seed vertices S ⊂ V , let us define l-hop neighborhood N l (S) for the incoming edges as follows: N (s) = {t|(t → s) ∈ E}, N 1 (S) = N (S) = ∪ s∈S N (s), N l (S) = N (N l-1 (S)) Let us also define the degree d s of vertex s as d s = |N (s)|. To simplify the discussion, let's assume uniform edge weights, A ts = 1, ∀(t → s) ∈ E. Then, our goal is to estimate the following for each vertex s ∈ S, where H (l-1) t is defined as the embedding of the vertex t at layer l -1, and W (l-1) is the trainable weight matrix at layer l -1, and σ is the nonlinear activation function (Hamilton et al., 2017) : Z (l) s = 1 d s t→s H (l-1) t W (l-1) , H (l) s = σ(Z (l) s ) Exact Stochastic Gradient Descent: If we have a node prediction task and V t ⊆ V is the set of training vertices, y s , s ∈ V t are the labels of the prediction task, and ℓ is the loss function for the prediction task, then our goal is to minimize the following loss function: 1 |Vt| s∈Vt ℓ(y s , Z l s ). Replacing V t in the loss function with S ⊂ V t for each iteration of gradient descent, we get stochastic gradient descent for GNNs. However with l layers, the computation dependency is on N l (S), which reaches large portion of the real world graphs, i.e. |N l (S)| ≈ |V |, making each iteration costly both in terms of computation and memory. Neighbor Sampling: Neighbor sampling approach was proposed by Hamilton et al. (2017) to approximate Z (l) s for each s ∈ S with a subset of N l (S). Given a fanout hyperparameter k, this subset is computed recursively by randomly picking k neighbors for each s ∈ S from N (s) to form the next layer S 1 , that is a subset of N 1 (S). If d s ≤ k, then the exact neighborhood N (s) is used. For the next layer, S 1 is treated as the new set of seed vertices and this procedure is applied recursively. Revisiting LADIES, Dependent Layer-based Sampling From now on, we will drop the layer notation and focus on a single layer and also ignore the nonlinearities. Let us define M t = H t W as a shorthand notation. Then our goal is to approximate: H s = 1 d s t→s M t



, while others, such as Chen et al. (2018a); Cong et al. (2021), cache the historical embeddings to reduce the variance of the computed approximate embeddings. There are methods sampling from a vertex cache Dong et al. (2021) filled with popular vertices. Most of these approaches are orthogonal to each other and they can be incorporated into other sampling algorithms.

