(LA)YER-NEIGH(BOR) SAMPLING: DEFUSING NEIGHBORHOOD EXPLOSION IN GNNS

Abstract

Graph Neural Networks have recently received a significant attention, however, training them at a large scale still remains a challenge. Minibatch training coupled with sampling is used to alleviate this challenge. Even so existing approaches either suffer from the neighborhood explosion phenomenon or do not have good performance. To deal with these issues, we propose a new sampling algorithm called LAyer-neighBOR sampling (LABOR). It is designed to be a direct replacement for Neighbor Sampling with the same fanout hyperparameter while sampling upto 7× fewer vertices, without sacrificing quality. By design, the variance of the estimator of each vertex matches Neighbor Sampling from the point of view of a single vertex. Moreover under the same vertex sampling budget constraints, LA-BOR converges faster than existing layer sampling approaches and can use upto 112× larger batch size compared to Neighbor Sampling.

1. INTRODUCTION

Graph Neural Networks (GNN) Hamilton et al. (2017) ; Kipf & Welling (2017) have become de facto models for representation learning on graph structured data. Hence they have started being deployed in production systems Ying et al. (2018) ; Niu et al. (2020) . These models iteratively update the node embeddings by passing messages along the direction of the edges in the given graph with nonlinearities in between different layers. With l layers, the computed node embeddings contain information from the l-hop neighborhood of the seed vertex. In the production setting, the GNN models need to be trained on billion-scale graphs (Ching et al., 2015; Ying et al., 2018) . The training of these models takes hours to days even on distributed systems Zheng et al. (2022b; a) . As in general Deep Neural Networks (DNN), it is more efficient to use minibatch training (Bertsekas, 1994) on GNNs, even though it is a bit trickier in this case. The node embeddings in GNNs depend recursively on their set of neighbors' embeddings, so when there are l layers, this dependency spans the l-hop neighborhood of the node. Real world graphs usually have a very small diameter and if l is large, the l-hop neighborhood may very well span the entire graph, also known as the Neighborhood Explosion Phenomenon (NEP) (Zeng et al., 2020) . To solve these issues, researchers proposed sampling a subgraph of the l-hop neighborhood of the nodes in the batch. There are mainly three different approaches: Node-based, Layer-based and Subgraph-based methods. Node-based sampling methods (Hamilton et al., 2017; Chen et al., 2018a; Liu et al., 2020; Zhang et al., 2021) sample independently and recursively for each node. It was noticed that node-based methods sample subgraphs that are too shallow, i.e., with a low ratio of number of edges to nodes. Thus layer-based sampling methods were proposed (Chen et al., 2018b; Zou et al., 2019; Huang et al., 2018; Dong et al., 2021) , where the sampling for the whole layer is done collectively. On the other hand subgraph sampling methods (Chiang et al., 2019; Zeng et al., 2020; Hu et al., 2020b; Zeng et al., 2021) do not use the recursive layer by layer sampling scheme used in the node-and layer-based sampling methods and instead tend to use the same subgraph for all of the layers. Some of these sampling methods take the magnitudes of embeddings into account (Liu et al., 2020; Zhang et al., 2021; Huang et al., 2018) , while others, such as Chen et al. (2018a) ; Cong et al. (2021) , cache the historical embeddings to reduce the variance of the computed approximate embeddings. There are methods sampling from a vertex cache Dong et al. (2021) filled with popular vertices. Most of these approaches are orthogonal to each other and they can be incorporated into other sampling algorithms. Node-based sampling methods suffer the most from the NEP but they guarantee a good approximation for each embedding by ensuring each vertex gets k neighbors which is the only hyperparameter of the sampling algorithm. Layer-based sampling methods do not suffer as much from the NEP because number of vertices sampled is a hyperparameter but they can not guarantee that each vertex approximation is good enough and also their hyperparameters are hard to reason with, number of nodes to sample at each layer depends highly on the graph structure (as the numbers in Table 2 show). Subgraph sampling methods usually have more bias than their node-and layer-based counterparts. Hence, in this paper, we focus on the node-and layer-based sampling methods and combine their advantages. The major contributions of this work can be listed as follows: • We propose a new sampling algorithm called LABOR, combining advantages of neighbor and layer sampling approaches using Poisson Sampling. LABOR correlates the sampling procedures of the given set of seed nodes so that the sampled vertices from different seeds have a lot of overlap, resulting into a 7× reduction in computation, memory and communication. Furthermore, LABOR has the same hyperparameters as neighbor sampling to use as a drop-in replacement and can speed up training by upto 2.6×. • We experimentally verify our findings, show that our proposed sampling algorithm LABOR outperforms both neighbor sampling and layer sampling approaches. LABOR can enjoy a batch-size of upto 112× larger than NS while sampling the same number of vertices.

2. BACKGROUND

Graph Neural Networks: Given a directed graph G = (V, E), where V and E ⊂ V × V are vertex and edge sets respectively, (t → s) ∈ E denotes an edge from a source vertex t ∈ V to a destination vertex s ∈ V , and A ts denotes the corresponding edge weight if provided. If we have a batch of seed vertices S ⊂ V , let us define l-hop neighborhood N l (S) for the incoming edges as follows: N (s) = {t|(t → s) ∈ E}, N 1 (S) = N (S) = ∪ s∈S N (s), N l (S) = N (N l-1 (S)) Let us also define the degree d s of vertex s as d s = |N (s)|. To simplify the discussion, let's assume uniform edge weights, A ts = 1, ∀(t → s) ∈ E. Then, our goal is to estimate the following for each vertex s ∈ S, where H (l-1) t is defined as the embedding of the vertex t at layer l -1, and W (l-1) is the trainable weight matrix at layer l -1, and σ is the nonlinear activation function (Hamilton et al., 2017) : Z (l) s = 1 d s t→s H (l-1) t W (l-1) , H (l) s = σ(Z (l) s ) Exact Stochastic Gradient Descent: If we have a node prediction task and V t ⊆ V is the set of training vertices, y s , s ∈ V t are the labels of the prediction task, and ℓ is the loss function for the prediction task, then our goal is to minimize the following loss function: 1 |Vt| s∈Vt ℓ(y s , Z l s ). Replacing V t in the loss function with S ⊂ V t for each iteration of gradient descent, we get stochastic gradient descent for GNNs. However with l layers, the computation dependency is on N l (S), which reaches large portion of the real world graphs, i.e. |N l (S)| ≈ |V |, making each iteration costly both in terms of computation and memory. Neighbor Sampling: Neighbor sampling approach was proposed by Hamilton et al. (2017) to approximate Z (l) s for each s ∈ S with a subset of N l (S). Given a fanout hyperparameter k, this subset is computed recursively by randomly picking k neighbors for each s ∈ S from N (s) to form the next layer S 1 , that is a subset of N 1 (S). If d s ≤ k, then the exact neighborhood N (s) is used. For the next layer, S 1 is treated as the new set of seed vertices and this procedure is applied recursively. Revisiting LADIES, Dependent Layer-based Sampling From now on, we will drop the layer notation and focus on a single layer and also ignore the nonlinearities. Let us define M t = H t W as a shorthand notation. Then our goal is to approximate: H s = 1 d s t→s M t If we assign probabilities π t > 0, ∀t ∈ N (S) and normalize it so that t∈N (S) π t = 1, then use sampling with replacement to sample T ⊂ N (S) with |T | = n, where n is the number of vertices to sample given as input to the LADIES algorithm and T is a multiset possibly with multiple copies of the same vertices, and let ds = |T ∩ N (s)| which is the number of sampled vertices for a given vertex s, we get the following two possible estimators for each vertex s ∈ S: H ′ s = 1 nd s t∈T ∩N (s) M t π t H ′′ s = t∈T ∩N (s) Mt πt t∈T ∩N (s) 1 πt Note that H ′ s in Eq. 4 is the Thompson-Horvitz estimator and the H ′′ s in Eq. 5 is the Hajek estimator. For a comparison between the two and how to get an even better estimator by combining them, see Khan & Ugander (2021) . The formulation in the LADIES paper uses H ′ s , but it proposes to rownormalize the sampled adjacency matrix, meaning they use H ′′ s in their implementation. However, analysing the variance of the Thompson-Horvitz estimator is simpler and its variance serves as an upper bound for the variance of the Hajek estimator when |M t | and π t are uncorrelated Khan & Ugander (2021) ; Dorfman (1997) , which we assume to be true in our case. Var(H ′′ s ) ≤ Var(H ′ s ) = 1 ds d 2 s t→s π t t ′ →s Var(M t ′ ) π t ′ Since we do not have access to the computed embeddings and to simplify the analysis, we assume that Var(M t ) = 1 from now on. One can see that Var(H ′ s ) is minimized when π t = p, ∀t → s under the constraint t→s π t ≤ pd s for some constant p ∈ [0, 1], hence any deviation from uniformity increases the variance. The variance is also smaller the larger ds is. However, in theory and in practice, there is no guarantee that each vertex s ∈ S will get any neighbors in T , not to mention equal numbers of neighbors. Some vertices will have pretty good estimators with thousands of samples and very low variances, while others might not even get a single neighbor sampled. For this reason, we designed LABOR so that every vertex in S will sample enough neighbors in expectation. While LADIES is optimal from an approximate matrix multiplication perspective Chen et al. (2022) , it is far from optimal in the case of nonlinearities and multiple layers. Even if there is a single layer, then the used loss functions are nonlinear. Moreover, the existence of nonlinearities in-between layers and the fact that there are multiple layers exacerbates this issue and necessitates that each vertex gets a good enough estimator with low enough variance. Also, LADIES gives a formulation using sampling with replacement instead of without replacement and that is sub-optimal from the variance perspective while its implementation uses sampling without replacement without taking care of the bias created thereby. In the next section, we will show how all of these problems are addressed by our newly proposed poisson sampling framework and LABOR sampling.

3. LABOR: LAYER NEIGHBOR SAMPLING

As mentioned previously, node-based sampling methods suffer from sampling too shallow subgraphs leading to NEP in just a few hops (e.g., see Table 2 ). Layer sampling methods Zou et al. ( 2019) attempt to fix this by sampling a fixed number of vertices in each layer, however they can not ensure that the estimators for the vertices are of high quality, and it is hard to reason how to choose the number of vertices to sample in each layer. The original paper of LADIES Zou et al. (2019) proposes using the same number for each layer while papers evaluating it found it is better to sample an increasing number of vertices in each layer Liu et al. (2020) ; Chen et al. (2022) . There is no systematic way to choose how many vertices to sample in each layer for the LADIES method, and since each graph has different density and connectivity structure, this choice highly depends on the graph in question. Therefore, due to its simplicity and high quality results, Neighbor Sampling currently seems to be the most popular sampling approach and there exists high quality implementations on both CPUs and GPUs in the popular GNN frameworks Wang et al. (2019) ; Fey & Lenssen (2019) . We propose a new approach that combines the advantages of layer and neighbor sampling approaches using a vertex-centric variance based framework, reducing the number of sampled vertices drastically while ensuring the training quality does not suffer and matches the quality of neighbor sampling. Another advantage of our method is that the user only needs to choose the batch size and the fanout hyperparameters as in the Neighbor Sampling approach, the algorithm itself then samples the minimum number of vertices in the later layers in an unbiased way while ensuring each vertex gets enough neighbors and a good approximation.

3.1. LABOR SAMPLING

The design philosophy of LABOR Sampling is to create a direct alternative to Neighbor Sampling while incorporating the advantages of layer sampling. In layer sampling, the main idea can be summarized as individual vertices making correlated decisions while sampling their neighbors, because in the end if a vertex t is sampled, all edges t → s, into the seed vertices S (s ∈ S) are added to the sampled subgraph. This can be interpreted as vertices in S making a collective decision on whether to sample t. The other thing to keep in mind is that, the existing layer sampling methods use sampling with replacement when doing importance sampling with unequal probabilities, because it is nontrivial to compute the inclusion probabilities in the without replacement case. The Hajek estimator in the without replacement case with equal probabilities becomes: H ′′ s = t∈T ∩N (s) Mt πt t∈T ∩N (s) 1 πt = t∈T ∩N (s) M t |N (S)| t∈T ∩N (s) |N (S)| = 1 ds t∈T ∩N (s) M t (7) and it has the variance: Var(H ′′ s ) = d s -ds d s -1 1 ds Keeping these two points in mind, we use Poisson Sampling and design LABOR sampling around it. First, let us show how one can do layer sampling using Poisson sampling (PLADIES). Given probabilities π t ∈ [0, 1], ∀t ∈ N (S) so that t∈N (S) π t = n, we include t ∈ N (S) in our sample T with probability π t by flipping a coin for it, i.e., we sample r t ∼ U (0, 1) and include t ∈ T if r t ≤ π t . In the end, E[|T |] = n and we can still use the Hajek estimator H ′′ s or the Horvitz Thomson estimator H ′ s to estimate H s . This way of doing layer sampling is unbiased by construction and achieves the same goal in linear time in constrast to the quadratic time debiasing approach explained in Chen et al. (2022) . In this case, the variance becomes (Williams et al., 1998) : Var(H ′′ s ) ≤ Var(H ′ s ) = 1 d 2 s t→s 1 π t - 1 d s (9) One can notice the existence of the minus term 1 ds , and it enables the variance to converge to 0 if all π t = 1 and we get the exact result. However, in the sampling with replacement case, the variance goes to 0 only as the sample size goes to infinity. This type of mimicking Layer Sampling with Poisson Sampling still has the disadvantage that ds varies wildly for different s. To overcome this and mimic Neighbor Sampling where E[ ds ] = min(d s , k), where k is a given fanout hyperparameter, we proceed as follows: for given π t ≥ 0, ∀t ∈ N (S) denoting unnormalized probabilities, for a given s, let us define c s as the quantity satisfying the following equality if k < d s , otherwise c s = max t→s 1 πt : 1 d 2 s t→s 1 min(1, c s π t ) - 1 d s = 1 k - 1 d s (10) Note that 1 k -1 ds is the variance when π t = k ds , ∀t ∈ N (s) so that E[ ds ] = k. Also note that: 1 k - 1 d s - d s -k d s -1 1 k = d s -k kd s - d s -k d s -1 1 k = d s -k k ( 1 d s - 1 d s -1 ) < 0 (11) meaning that the variance target we set through Eq. 10 is strictly better than Neighbor Sampling's variance in Eq. 8 and it will result in E[ ds ] ≥ k with strict equality in the uniform probability case. Then each vertex s ∈ S samples t → s with probability c s π t . To keep the collective decision making, we sample r t ∼ U (0, 1), ∀t ∈ N (S) and vertex s samples vertex t if and only if r t ≤ c s π t . Note that if we use a uniform random variable for each edge r ts instead of each vertex r t , and if π is uniformly initialized, then we get the same behaviour as Neighbor Sampling.

3.2. IMPORTANCE SAMPLING

Given the sampling procedure above, one wonders how different choices of π ≥ 0 will affect |T |, the total number of unique vertices sampled. In our case, it is extremely easy to compute: E[|T |] = t∈N (S) P(t ∈ T ) = t∈N (S) min(1, π t max t→s c s ) In particular, we need to find π * ≥ 0 minimizing E[|T |]: π * = arg min π≥0 t∈N (S) min(1, π t max t→s c s ) Note that for any given π ≥ 0, E[|T |] is the same for any vector multiple xπ, x ∈ R + , meaning that the objective function is homogenous of degree 0.

3.3. COMPUTING c AND π *

Note that c s was defined to be the scalar satisfying the following equality involving the variance of the estimator of H s : 1 d 2 s t→s 1 min(1, c s π t ) - 1 d s = 1 k - 1 d s If we rearrange the terms, we get: t→s 1 min(1, c s π t ) = d 2 s k (15) One can see that the left hand side of the equality is monotonically decreasing with respect to c s ≥ 0. Thus one can use binary search to find the c s satisfying the above equality to any precision needed. But we opt to use the following iterative algorithm to compute it: v (0) s = 0, c (0) s = k d 2 s t→s 1 π t (16) c (i+1) s = c (i) s d 2 s k -v (i) s -v (i) s + t→s 1 min(1, c s π t ) , v (i+1) s = t→s 1[c (i+1) s π t ≥ 1] This iterative algorithm converges in at most d s steps and the convergence is exact and monotonic from below. One can also implement it in linear time O(d s ) if {π t | t → s} is sorted and making use of precomputed prefix sum arrays. Note that c = c(π), meaning that c is a function of the given probability vector π. To compute π * , we use a similar fixed point iteration as follows: π (0) = 1, ∀t ∈ N (S) : π (i+1) t = π (i) t max t→s c s (π (i) ) Thus, we alternate between computing c = c(π), meaning c is computed with the current π, and updating the π with the computed c values. Each step of this iteration is guaranteed to lower the objective function value in Eq. 13 until convergence to a fixed point, see the Appendix A.1. Modified formulation for a given nonuniform weight matrix A ts is explained in the Appendix A.3. The variance of Poisson Sampling when π t = k ds isfoot_0 k -1 ds . One might question why we are trying to match the variance of Neighbor Sampling and choose to use a fixed fanout for all the seed vertices. In the uniform probability case, if we have already sampled some set of edges for all vertices in S, and want to sample one more edge, the question becomes which vertex in S should we sample the new edge for? Our answer to this question is the vertex s, whose variance would improve the most. If currently vertex s has ds edges sampled, then sampling one more edge for it would improve its variance from 1 ds -1 ds to 1 1+ ds -1 ds . Since the derivative of the variance with respect to ds is monotonic, we are allowed to reason about the marginal improvements by comparing their derivatives, which is: ∂( 1 ds -1 ds ) ∂ ds = - 1 d2 s Notice that the derivative does not depend on the degree d s of the vertex s at all, and the greater the magnitude of the derivative, the more improvement the variance of a vertex gets by sampling one more edge. Thus, choosing any vertex s with least number of edges sampled would work for us, that is: s = arg min s ′ ∈S ds ′ . In light of this observation, one can see that it is optimal to sample an equal number of edges for each vertex in S. This is one of the reasons LADIES is not efficient with respect to the number of edges it samples. On graphs with skewed degree distributions, it samples thousands of edges for some seed vertices, which contribute very small amounts to the variance of the estimator since it is already very close to 0.

4. EXPERIMENTS

In this section, we empirically evaluate the performance of each method in the node-prediction setting on the following datasets: reddit (Hamilton et al., 2017) , products (Hu et al., 2020a) , yelp, flickr (Zeng et al., 2020) . Details about these datasets are given in Table 1 . We compare LABOR variants LABOR-0, LABOR-1 and LABOR-*, where 0, 1, * stand for the number of fixed point iterations applied to optimize 13 respectively, against NS (Neighbor Sampling), LADIES and PLADIES sampling methods, where PLADIES is the unbiased Poisson Sampling variant of LADIES introduced in Section 3.1. We do not include Fast-GCN in our comparisons as it is superseeded by the LADIES paper. 2021) are not included in the comparisons because they either take into account additional information such as historical embeddings or their magnitudes or they have a different sampling structure such as a vertex cache to sample from. Also the techniques in these papers are mostly orthogonal to the sampling problem and algorithms discussed in this paper. We evaluate all the baselines on the GCN model in Eq. 2 with 3 layers, with 256 hidden dimension and residual skip connections enabled. We use the Adam optimizer (Kingma & Ba, 2014) with 0.001 learning rate. We carried out our experiments using the DGL framework (Wang et al., 2019) with the Pytorch backend (Paszke et al., 2019) 1 . Experiments were repeated 100 times and averages are presented. We will first show that despite the different number of sampled vertices, LABOR and NS training loss curves are almost the same in Section 4.1 with the same fanout and batch size hyperparameters. We will match the hyperparameters of LADIES with the number of vertices sampled on LABOR and 2 . see whose batches have better quality. Then, we will show what happens when different sampling algorithms are given the same budget and compare their vertex sampling efficiency in Section 4.2. Section 4.3 shows the reduction in the number of vertices sampled with each fixed point iteration.

4.1. COMPARISON AGAINST NEIGHBOR SAMPLING AND LADIES

In this experiment, we set the batch size to 1,000 and the fanout k = 10 for LABOR and NS methods to see the difference in the sizes of the sampled subgraphs and also whether convergence behaviour is the same. In Figure 1 , we can see that the convergence curves of both NS and LABOR variants are almost the same showing that sampling smaller subgraphs does not really affect the batch quality. Table 2 shows the difference of the sampled subgraph sizes in each layer. One can see that on reddit, LABOR-* samples 6.9× fewer vertices in the 3rd layer while keeping the same convergence behaviour. On the flickr dataset however, LABOR-* samples only 1.3× fewer vertices. The amount of difference depends on two factors. The first is the amount of overlap of neighbors among the vertices in S. If the neighbors of vertices in S did not overlap at all, then one obviously can not do better than NS. The second is the average degree of the graph. With a fanout of 10, both Neighbor Sampling and LABOR has to copy the whole neighborhood of a vertex s with degree d s ≤ 10. Thus for such graphs, it is expected that there is a small difference because for a lot of the vertices, their whole neighborhood is copied. If we look at Table 1 , the average degree of the flickr graph is 10.09, and thus there is only a small difference between LABOR and NS. In Table 2 , the number of sampled edges is another important metric to look at. We can see that LABOR-0 reduces both the number of vertices and edges sampled. On the other hand, when importance sampling is enabled, the number of vertices sampled goes down while number of edges sampled goes up. This is because when importance sampling is used, inclusion probabilities become nonuniform and it takes more edges per seed vertex to get a good approximation (see Eq. 10). The hyperparameters of LADIES and PLADIES were picked to match LABOR-* so that all methods have the same sampling budget in each layer (see Table 2 ). Figure 1 shows that, in terms of loss curve, LADIES and PLADIES perform almost the same on all but the flickr dataset, in which case there is a big difference between the two in favor of PLADIES. We also see that LABOR variants either match the quality of PLADIES on reddit or outperform it on products, yelp and flickr. Looking at Table 2 , we can see that LABOR-0 has the best runtime performance across all datasets. This is both due to lack of the overhead of performing the fixed point iterations and also it samples the fewest edges, compared to the other LABOR variants. By design, all LABOR variants should have the same convergence curves, as seen in Figure 1 . Then, the decision of which variant to use depends on one factor: feature access speed. If vertex features were stored on a slow storage medium Table 2 : Average number of vertices and edges sampled in different layers (All the numbers are in thousands, lower is better). Last column shows iterations(minibatches) per second (it/s) (higher is better). The hyperparameters of LADIES and PLADIES were picked to roughly match the number of vertices sampled by the LABOR-* to get a fair comparison. The convergence curves can be found in Figure 1 . The timing information was measured on an NVIDIA T4 GPU. Green stands for best, red stands for worst results, with a 5% cutoff. Dataset Algo. (such as, on host memory accessed over PCI-E), then minimizing number of sampled vertices would become the highest priority, in which case, one should pick LABOR-*. Depending on the relative vertex feature access performance and the performance of the training processor, one can choose to use LABOR-j, the faster feature access, the lower the j. |V 3 | |E 3 | |V 2 | |E 2 | |V 1 | |E 1 | |V 0 | it/

4.2. EVALUATION OF VERTEX SAMPLING EFFICIENCY

In this experiment, we set a limit on the number of sampled vertices and modify the batch size to match the given vertex budget. The budgets used were picked around the same magnitude with numbers in the Table 2 in the |V 3 | column and can be found in Table 1 . Figure 2 displays the result of this experiment. Table 3 shows that the more vertex efficient the sampling method is, the larger batch size it can use during training. Number of sampled vertices is not a function of the batch size for the LADIES algorithm so we do not include it in this comparison. All of the experiments were repeated 100 times and their averages were plotted, that is why our convergence plots are smooth and differences are clear. The most striking result in this experiment is that there can be upto 112× difference in batch sizes of LABOR-* and NS algorithms on the reddit dataset, which translates into faster convergence as the training loss and validation F1-score curves in Figure 2 show.

4.3. IMPORTANCE SAMPLING, NUMBER OF FIXED POINT ITERATIONS

In this section, we look at the convergence behaviour of the fixed point iterations described in Section 3.3. Table 4 shows the number of sampled vertices in the last layer with respect to the number of fixed point iterations applied. In this table, the ∞ stands for applying the fixed point iterations Figure 2 : Validation F1-score and the training loss curves to evaluate vertex sampling efficiency under the same sampling budget. The batch size is chosen so that the # sampled vertices matches the vertex budget for each dataset and method and it is given in Table 3 . until convergence, and convergence occurs in at most 15 iterations in practice before the relative change in the objective function is less than 10 -4 . One can see that most of the reduction in the objective function 13 occurs after the first iteration, and the remaining iterations have diminishing returns. Full convergence can save from 14% -33% depending on the dataset. The monotonically decreasing numbers provide empirical evidence for the presented proof in the Appendix A.1. 

5. CONCLUSIONS

In this paper, we introduced LABOR sampling, a novel way to combine layer and neighbor sampling approaches using a vertex-variance centric framework. We then transform the sampling problem into an optimization problem where the constraint is to match neighbor sampling variance for each vertex while sampling the fewest number of vertices. We show how to minimize this new objective function via fixed-point iterations. On datasets with dense graphs like Reddit, we show that our approach can sample a subgraph with 7× fewer vertices without degrading the batch quality. We also show that compared to LADIES, LABOR converges faster with same sampling budget. 

A APPENDIX

A.1 FIXED POINT ITERATIONS Given any π (0) > 0 and one iteration to get π (1) and one more iteration using c(π (1) ) to get π (2) , then we have the following observation: π (1) t = π (0) t max t→s c s (π (0) ) 1 d 2 s t→s 1 min(1, c s (π (0) )π (0) t ) - 1 d s = 1 k - 1 d s 1 d 2 s t→s 1 min(1, c s (π (1) )π (1) t ) = 1 d 2 s t→s 1 min(1, c s (π (1) ) max t→s ′ c s ′ (π (0) )π (0) t ) Now, note that for a given t ∈ N (s), max t→s ′ c s ′ (π (0) ) ≥ c s (π (0) ) since s ∈ {s ′ | t → s ′ }. This implies that max t→s ′ c s ′ (π (0) )π (0) t ≥ c s (π (0) )π (0) t . Note that for π ′ t = c s (π (0) )π (0) t , ∀t → s, we have c s (π ′ ) = 1 for any given s. Since max t→s ′ c s ′ (π (0) )π (0) t ≥ c s (π (0) )π (0) t = π ′ t , ∀t → s, this let's us conclude that the c s (π (1) ) ≤ 1, because the expression is monotonically increasing with respect to any of the π t . By induction, this means that c (i) s ≤ 1, ∀i ≥ 1. Since π (i) t = π (0) t i-1 j=0 max t→s ′ c s ′ (π (i-1) ), π (i) t is monotonically decreasing. This means that the objective value in Eq. 13 is also monotonically decreasing and is clearly bounded from below by 0. Any monotonically decreasing sequence bounded from below has to converge, so our fixed point iteration procedure is convergent as well. The intuition behind the proof above is that after updating π via Eqn. 18, the probability of each edge t → s : π t c s goes up because the update takes the maximum c s over all possible s. This means each vertex gets higher quality estimators than the set variance target so we now have room to reduce the number of vertices sampled by choosing the appropriate c s ≤ 1. To summarize, π update step increases the quality of the batch which in turn let's the c s step to reduce the total number of vertices sampled.

A.2 EXACTLY SAMPLING A FIXED NUMBER OF NEIGHBORS

One can easily resort to Sequential Poisson Sampling by Ohlsson (1998) if one wants ds = min(k, d s ) instead of E[ ds ] = min(k, d s ) to get the exact same behaviour to Neighbor Sampling. Given π t , c s and r t , we pick the ds = min(k, d s ) smallest vertices t → s with respect to rt csπt , which can be computed in expected linear time by using the quickselect algorithm (Hoare, 1961) .

A.3 EXTENSION TO THE WEIGHTED CASE

If the given adjacency matrix A has nonuniform weights, then we want the estimate the following: H s = 1 A * s t→s A ts M t where A * s = t→s A ts . If we have a probability over each edge π ts , ∀(t → s) ∈ E, then the variance becomes: Var(H ′′ s ) ≤ Var(nH ′ s ) = 1 (A * s ) 2 t→s A 2 ts min(1, c s π ts ) - t→s A 2 ts ( ) In this case, we can still aim to reach the same variance target v s = 1 k -1 ds or any given custom target v s ∈ R + by finding c s that satisfies the following equality: 1 (A * s ) 2 t→s A 2 ts min(1, c s π ts ) - t→s A 2 ts = v s In this case, the objective function becomes: π * = arg min π≥0 t∈N (S) min(1, max t→s c s π ts ) Optimizing the objective function above will result into minimizing the number of vertices sampled. Given any π ts > 0, ∀(t → s), then the fixed point iterations proposed for the non-weighted case in Eq. 18 can be modified as follows: π (0) = A, ∀(t → s) : π (i+1) ts = max t→s ′ c s ′ (π (i) )π (i) ts ′ A more principled way to choose v s in the weighted case is by following the argument presented in Section 3.4. There, the discussion revolves around the derivative of the variance with respect to the expected number of vertices sampled for a given seed vertex s. If we apply the same argument, then we get: v s (c s ) = 1 (A * s ) 2 t→s A 2 ts min(1, c s π ts ) - t→s A 2 ts ∂v s ∂c s = -1 (A * s ) 2 t→s 1[c s π ts < 1]A 2 ts c 2 s π ts E[ ds ](c s ) = t→s min(1, c s π ts ) ∂E[ ds ] c s = t→s 1[c s π ts < 1]π ts Then to compute the derivative of the variance with respect to the expected number of vertices sampled for a given seed vertex s, which is ∂vs

∂E[ ds]

, we can use the chain rule and get: k 2 for some constant C ′ > 0. We leave this as future work. ∂v s ∂E[ ds ] = ∂v s ∂c s ∂c s ∂E[ ds ] = -1 (A * s) 2 t→s 1[csπts<1]A 2 ts c 2 s πts t→s 1[c s π ts < 1]π ts

A.4 CONVERGENCE SPEED WITH RESPECT TO WALL TIME

In this section, we perform hyperparameter optimization on Neighbor Sampler and Labor Sampler so that the training converges to a target validation accuracy as fast as possible. We leave LADIES out of this experiment because it is too slow as can be seen in the last column of Table 2 . We ran this experiment on an A100 GPU and stored the input features on the main memory, which were accessed over the PCI-e directly during training by pinning their memory. This kind of training scenario is commonly used when training on large datasets whose input features don't fit in the GPU memory. We use larger of the two datasets we have for this experiment, products and yelp. For Figure 3 : The runtimes to reach a validation F1-score of 91.5% on products and 60% on yelp, belonging to runs tried by the HEBO hyperparameter tuner, sorted with respect to their runtimes. Having a lower curve means that a method is faster overall compared to the others. HEBO could not find any hyperparameter configuration to reach the set target on products, hence its curve is left out. products, the validation accuracy target we set is 91.5%, and for yelp it is 60%. We tune the learning rate between [10 -4 , 10 -1 ], the batch size between [2 10 , 2 15 ] and the fanout for each layer between [5, 25] . We use the same model used in Section 4. For LABOR, we additionally tune over the number of importance sampling iterations i between [0, 3] so that it can switch between LABOR-i and also a layer dependency boolean parameter that makes LABOR use the same random variates r t for different layers when enabled, which has the effect of increasing the overlap of sampled vertices across layers. We use the state of the art hyperparameter tuner HEBO Cowen-Rivers et al. (2020) , the winning submission to the NeurIPS 2020 Black-Box Optimisation Challenge, to tune the parameters of the sampling algorithm with respect to runtime required to reach the target validation accuracy with a timeout of 300 seconds, terminating the run if the configuration doesn't reach the target accuracy. We let HEBO run overnight and collect the minimum runtimes required to achieve the target accuracies. For products, the fastest hyperparameter corresponding to the 38.2s runtime had fanouts (18, 5, 25) , batch size 10500, learning rate 0.0145, used LABOR-1, layer dependency False. For Neighbor Sampler, the fastest hyperparameter corresponding to 43.82s runtime had fanouts (15, 5, 21) , batch size 12000, learning rate 0.0144. For the Yelp dataset, the fastest hyperparameter corresponding to the 41.60s runtime had fanouts (6, 5, 7), batch size 5400, learning rate 0.000748, used LABOR-1 and layer dependency True. For Neighbor Sampler, the fastest hyperparameter corresponding to the 47.40s runtime had fanouts (5, 6, 6), batch size 4600, learning rate 0.000931. These results clearly indicate that training with LABOR is faster compared to Neighbor Sampling when it comes to time to convergence. We run the same experiment with GraphSAINT (Zeng et al., 2020) using the DGL example code, both on ogbn-products and yelp using the same model architecture. We used their edge sampler, the edge sampling budget between [2 10 , 2 15 ] and learning rate between [10 -4 , 10 -1 ]. We disabled batch normalization to make it a fair comparison with NS and LABOR since the models they are using does not have batch normalization. We use HEBO to tune these hyperparameters to reach the set validation accuracy, and let it run overnight. The results show that HEBO was not able to find any configuration of the hyperparameters reaching 91.5% accuracy on products faster than 1500 seconds. For the Yelp dataset, the fastest runtime to reach 60% accuracy was 92.2s, with an edge sampling budget of 12500 and learning rate 0.0214.



The code will be contributed to the DGL framework after the blind review process.



The works of Liu et al. (2020); Zhang et al. (2021); Huang et al. (2018); Cong et al. (2021); Dong et al. (

Figure 1: The validation F1-score and training loss curves on different datasets with same batch size. The soft edges represent the confidence interval. Number of sampled vertices and edges can be found in Table2.

25) Then, what one would do is to choose a constant C(k) as a function of the fanout parameter k and set ∂vs ∂E[ ds] = C(k) and solve for c s . C(k) would be a negative quantity whose absolute value decreases as k increases. It would probably look like C(k) = -C ′

Datasets used in experiments, numbers of vertices, edges, avg. degree, features, sampling budget used, training, validation and test vertex split.

The batch sizes used in Figure2. These were chosen such that in expectation, each method samples with the same budget given in Table1. Having a larger batch-size speeds up convergence.

Number of vertices (in thousands)  in 3rd layer w.r.t # fixed point iterations (its). ∞ denotes applying the fixed point iterations until convergence, i.e., LABOR-*, 1 its stands for LABOR-1 etc.

Difan Zou, Ziniu Hu, Yewen Wang, Song Jiang, Yizhou Sun, and Quanquan Gu. Layer-dependent importance sampling for training deep and large graph convolutional networks. Advances in Neural Information Processing Systems, 32(NeurIPS), 2019.

