ON BATCH SIZE SELECTION FOR STOCHASTIC TRAIN-ING FOR GRAPH NEURAL NETWORKS Anonymous

Abstract

Batch size is an important hyper-parameter for training deep learning models with stochastic gradient decent (SGD) method, and it has great influence on the training time and model performance. We study the batch size selection problem for training graph neural network (GNN) with SGD method. To reduce the training time while keeping a decent model performance, we propose a metric that combining both the variance of gradients and compute time for each mini-batch. We theoretically analyze how batch-size influence such a metric and propose the formula to evaluate some rough range of optimal batch size. In GNN, gradients evaluated on samples in a mini-batch are not independent and it is challenging to evaluate the exact variance of gradients. To address the dependency, we analyze an estimator for gradients that considers the randomness arising from two consecutive layers in GNN, and suggest a guideline for picking the appropriate scale of the batch size. We complement our theoretical results with extensive empirical experiments for ClusterGCN, FastGCN and GraphSAINT on 4 datasets: Ogbn-products, Ogbnarxiv, Reddit and Pubmed. We demonstrate that in contrast to conventional deep learning models, GNNs benefit from large batch sizes.

1. INTRODUCTION

Training large neural networks is often time consuming. In many real world scenarios training might take hours or even days to converge Radford et al. (2018) ; Devlin et al. (2018) . As a consequence, the identification of strategies to reduce the training time while retaining accuracy is an important research objective. The most popular training algorithms for deep learning are Stochastic Gradient Descent (SGD) and its variants such as RMSProp or Adam Graves (2013) ; Kingma & Ba (2014) . These algorithms work in an iterative manner, such that in each epoch, the data is first partitioned into minibatches and then weight updates are calculated using only the data in each minibatch. It has been observed that the size of the minibatches plays a crucial role in the network's accuracy, generalization capability and converge time (Keskar et al. (2016) ; He et al. (2019) ; McCandlish et al. (2018) ; Radiuk (2017)). For typical deep learning tasks, practitioners have observed that small batch sizes, e.g., {4, 16, . . . , 512}, lead to a better generalization performance and training efficiency Keskar et al. (2016) . For Graph Neural Networks (GNNs) selecting the appropriate batch size remains more of a mystery, and to the best of our knowledge, there has been no published work that focuses on batch size selection for GNNs. The small batch size guidelines for conventional NNs do not carry over because the batches are used to approximate the graph aggregations or convolutions. The approximation error propagates and leads to a much more substantial variance in the gradients than is observed for NNs. In practice, based on released code, we see that implementations tend to either use the largest batch size that can fit into memory Li et al. (2020) or use a small batch size similar to those for non-graph settings Chen et al. (2018) ; Zou et al. (2019) . In this work, we explore the choice of batch size for graph neural networks. By means of a theoretical investigation, we develop guidelines for the choice of batch size that depend on the average degree and number of nodes of the graph. These guidelines lead to intermediate batch sizes, considerably larger than the small NN batch sizes but much smaller than the maximum size dictated by memory limits of a modern GPU. We provide empirical results that demonstrate that the batch sizes derived using our guidelines provide an excellent trade-off between training time and accuracy. Substantially smaller sizes may lead to faster convergence but reduced accuracy; using larger sizes can achieve similar accuracy but training may take much longer to converge.

2. RELATED WORK

Graph Neural Networks (GNNs) have become increasingly popular in addressing graph-based tasks (Kipf & Welling, 2016; Hamilton et al., 2017; Defferrard et al., 2016; Gilmer et al., 2017; Ying et al., 2018) . One major line of research aims to improve the expressiveness of GNNs via 1) advanced aggregation functions (Veličković et al., 2017; Monti et al., 2017; Liu et al., 2019; Qu et al., 2019; Pei et al., 2020) 2) deeper architecture (Li et al., 2019; 2020) ; and 3) adaptive graph structure (Li et al., 2018; Vashishth et al., 2019; Zhang et al., 2019) . However, training a large-scale GNN model remains challenging because of the large memory consumption, long convergence time, and heavy computation (Chiang et al., 2019) . Full-batch gradient descent training scheme was commonly used in the earlier GNN research. While this is suitable for relatively small graphs, it requires storing all intermediate embeddings, which is not scalable for large graphs. The convergence can be slow since the parameters are updated only once per epoch. Hamilton et al. (2017) and Ying et al. (2018) proposed the training of GNNs with mini-batch stochastic gradient descent (SGD) methods. Mini-batch SGD training suffers from the neighborhood expansion and leads to time-complexity that grows exponential with respect to the GNN depth. To reduce the exponential complexity of receptive nodes, Chen et al. (2018) , Huang et al. (2018), and Zou et al. (2019) proposed layer-wise sampling, where a fixed number of nodes are sampled in each layer. Importance sampling techniques were incorporated to reduce variance. Unfortunately the overhead of the iterative neighborhood sampling strategy is still significant and becomes worse as GNNs become progressively deeper. Chiang et al. (2019) and Zeng et al. (2020) proposed graph-wise sampling to further improve the sampling efficiency. This can be viewed as a special case of layer-wise sampling where the same set of nodes is sampled across all layers. Chen et al. (2017) and Cong et al. (2020) proposed variance reduction stochastic training frameworks that maintain a cache for the intermediate embeddings of all nodes. This can improve convergence but results in large memory requirements, stretching the capabilities of GPUs when training over large graphs. Due to this drawback, we do not consider such approaches in this paper, but it is an intrguing direction for future work. Most existing graph neural network papers do not clearly address how they set the batch size. Experimentally, we observe that batch size is a critical hyper-parameter and can significantly influence training time and test accuracy. The importance of the batch size has been recognized for non-graph deep learning models. Keskar et al. (2016) , He et al. (2019) and Masters & Luschi (2018) have shown that smaller batch sizes, in the range {4, 16, . . . , 512}, can achieve better generalization performance. The randomness of small batches proves beneficial. McCandlish et al. (2018) suggested that the batch size should be selected so that a balance is achieved between the "noise" and "signal" of the gradient. Radiuk (2017) showed that using larger batch sizes, of the order of 1024, can be beneficial when training convolutional neural network models. Gower et al. (2019) , Alfarra et al. (2020), and Smith (2018) introduced adaptive batch size approaches to further improve the convergence rate and generalization performance. To the best of our knowledge, no existing work has directly addressed the selection of the batch size for stochastic training for graph neural networks, and the objective of this paper is to fill that gap and provide guidelines for the GNN setting.

3. PRELIMINARIES

We represent a graph G = (V, E) with a set of nodes V = {v 1 , . . . , v n } and set of edges E = {e 1 , . . . , e M } by an adjacency matrix A ∈ R n×n . For node v ∈ V , we let N (v) be the set of neighbors of v. In addition, we associate each node v to a feature vector x v ∈ R 1×F , and let X ∈ R n×F be the corresponding feature matrix. Let D be the degree matrix of the graph G, where D i,i = j A i,j and D i,j = 0 if i = j. To ease the presentation, we use symbols such as R T to denote that there exists an absolute constant c such that R ≤ c • T .

3.1. GRAPH NEURAL NETWORK MODELS

Graph neural networks (GNNs) can be applied to node prediction, link prediction and graph prediction tasks. In this work, we focus on the node prediction task. We are given labels of nodes from a training set and we need to predict the labels for nodes in a testing set. One paradigm for solving this problem is to learn representations for all the nodes and then map the representations to labels. Graph neural network aggregate the representations of neighbors into each node in order to integrate the graph structure into each node's representation. Specifically, let H l ∈ R N ×F l be the representation for layer l, where the i th row h l i is the representation for node i at layer l and F l is the dimension of the representation (H 0 is set as the original node features X). The forward propagation for hidden states are defined as: H l+1 = σ( Hl+1 ) and Hl+1 = ÃH l W l , where W l ∈ R F l ×F l+1 are trainable parameters, σ(•) is an activation function, and Ã is a normalization of adjacency matrix A, e.g., the random walk normalization Ã = D -1 A, or the symmetric normalization Ã = D -1/2 AD -1/2 . The equations in (1) can be expressed for each node i as: h l+1 i = σ( hl+1 i ) and hl+1 i = j∈N (i) Ãij h l j W l . (2)

3.2. STOCHASTIC TRAINING FOR GNN MODELS

Sampling in training GNNs: Label sampling and neighbor sampling. In conventional deep learning models, every sample in a mini-batch contributes independently to the approximated gradient. Including more samples thus reduces the variance of the gradient estimate by statistical power. However, in GNN models, samples in a mini-batch are no longer independent. In fact, we have two different concepts of sampling. First, we sample a mini-batch of nodes in the training set and we call this label sampling. Since the representation of nodes also depends on neighbor nodes, the receptive field for each selected node grows exponentially as the number of layers increases. Neighbor sampling is adopted to constrain the number of receptive neighbor nodes. Existing frameworks use three main approaches to handle sampling for GNNs: node-wise, layer-wise and graph-wise sampling. Node-wise sampling. Hamilton et al. (2017) and Ying et al. (2018) adopt a uniformly random sampling of the labels. For neighbor sampling, they recursively sample a certain number of neighbors for each layer. Specifically, to evaluate the aggregation for layer l + 1, for each node i, a set of nodes S l i is sampled from the neighbors of node i, and equation (2) is evaluated as hl+1 i = |N (i)| |S l i | j∈S l i A ij h l j W l . |S l i | is predefined to limit the sampled nodes, but the size of the receptive field for each included mini-batch node still grows exponentially as the number of layers increases. Layer-wise sampling. Chen et al. (2018) , Huang et al. (2018 ),and Zou et al. (2019) propose a different neighbor sampling method to reduce the number of receptive nodes. Nodes are sampled at the layer level instead of the node level. Importance sampling is adopted to reduce the variance of sampling and further improve convergence. Specifically, given the set of sampled nodes S l+1 in layer l + 1, nodes in layer l are sampled with some probability distribution q l (i|S l+1 ) that is derived from minimizing variance of gradients Chen et al. (2018) , where i is the index of the node. For brevity, we denote the sampling distribution as q l (i). From Chen et al. (2018) and Zou et al. (2019) , the forward propagation in equation ( 2) is defined as: hl+1 i = 1 |S l | j∈S l A ij q l (i) h l j W l . Note that by controlling |S l |, the number of receptive nodes only grow linearly with respect to layer size. In this framework, uniformly random sampling is used for labels. Graph-wise sampling. Zeng et al. (2020) and Chiang et al. (2019) propose graph-wise sampling. This can be regarded as a special case of layer-wise sampling, which uses the same set of nodes as both the sampled labels and sampled neighbors across all layers. Importance sampling is adopted in Zeng et al. (2020) and normalization is applied on both loss and neighbor aggregations to obtain an unbiased estimator for the gradients. Since in practice, layer-wise sampling and graph-wise sampling are much more efficient in practice than node-wise sampling, we focus on the influence of the batch size for layer-wise sampling and graph-wise sampling. For layer-wise sampling, the number of included label samples (the batch size) can differ from the number of neighbours sampled at each layer. In our analysis, we focus on the case where these values are equal; see the supplementary material for further discussion.

4. BATCH SIZE SELECTION: THEORETICAL ANALYSIS

When we use SGD training, we sample a small proportion of samples to approximate the true distribution of samples and estimate the gradients. In the non-graph setting, samples in the mini-batch contribute (approximately) independently to the estimates of gradients. In GNN models, the impact is more complicated, because node representations are derived using a sample of the neighborhood. If all nodes were included, a node's representation would be calculated based on its entire neighborhood; for small batch sizes, only a few neighbours are included, and the approximation of the aggregation can be very poor. This propagates through the layers leading to highly erroneous gradient estimates. In this section, we consider the selection of the batch size for layer-wise and graph-wise sampling in GNN training and derive guidelines. Our approach is to analyse a metric which captures both the variance of gradients and compute time for training a mini-batch.

4.1. VARIANCE OF GRADIENT AND VARIANCE ESTIMATOR

The essential goal of sampling in SGD is to approximate the gradients. We aim to obtain an unbiased approximation and minimize the variance of the gradients in each mini-batch. Recall the definition of the activation of a node j from (2) and let L be a loss function. By the chain rule, the gradient with respect to the variables in layer l, when all nodes are included (i.e., no sampling), is: ∂L ∂W l = 1 |V l+1 | i∈V l+1 ∂L ∂ hl+1 i ∂ hl+1 i ∂W l = 1 |V l+1 | i∈V l+1 ∂L ∂ hl+1 i j∈N (i) A ij h l j , where V l denotes the receptive nodes in layer l. Under layer-wise sampling, the gradient can be expressed as: ∂L ∂W l = 1 |S l+1 | i∈S l+1 ∂L ∂ hl+1 i j∈S l A ij |S l |q l j h l j , where S l and S l+1 are the sets of sampled nodes for layer l and layer l + 1, respectively. Since h l j is evaluated recursively on the samples of former layers, it introduces more randomness and analyzing the exact variance in general is difficult. Instead, we analyze intermediate estimators, with the view that these can act as a valuable proxy for the variance of gradients. In most existing work, the neighbor aggregation terms j∈S l Aij |S l |q l j h l j is used as the proxy estimator (Chen et al., 2018; Huang et al., 2018; Zou et al., 2019; Zeng et al., 2020) . This proxy estimator does not adequately capture the correlation between layers. The variance of the gradients for the variables in layer l is highly related to the nodes sampled in both layer l + 1 and layer l. To address this issue, we analyze a different estimator which considers the randomness arising from two consecutive layers. Specifically, we consider the analysis of the following estimator. Definition 4.1 Let S 1 , S 2 ⊆ V such that each vertex in V is selected to S 1 (respectively S 2 ) with probability p (respectively q). For weight matrix W , we define our estimator as: ξ = 1 |S 1 | v∈S1 1 N (v)∩S2 =∅ |S 2 ∩ N (v)| u∈N (v)∩S2 Ãv,u • W • x u , where 1 Z is the indicator random variable for the event Z.

4.2. PSEUDO PRECISION RATE

In practice, one of the major purposes of sampling is to improve training efficiency and reduce the time for training the model. Existing methods only aim to reduce the variance of the estimators and do not take the computational cost into account Chen et al. (2018) ; Huang et al. (2018) ; Zou et al. (2019) ; Zeng et al. (2020) . When we evaluate the impact of batch size, a larger batch size will generate better approximation of gradient, but this comes at the cost of significantly more computation. Therefore, we need a better metric that can better capture the trade-off between variance reduction and computation cost. McCandlish et al. (2018) propose a metric that balances the noise scale and gradient value in each minibatch to determine the optimal batch size, but it does not explicitly model the computational cost in the metric. In the context of variance reduction for Monte Carlo sampling, Owen (2013) introduces an efficiency metric for an estimator. If there is a reference estimator that has compute time c 0 and achieves variance σ 2 0 , then the efficiency of an alternative estimator with variance σ 2 1 and compute time c 1 is defined as c0σ 2 0 c1σ 2 1 . Normalizing compute time so that the reference estimator satisfies c 0 σ 2 0 = 1, we derive the metric 1 cσ 2 , which we call the pseudo precision rate: Definition 4.2 Let ξ be an estimator with computation cost c > 0 and variance σ 2 > 0, then the pseudo precision rate of ξ is defined as ρ(ξ) = 1 cσ 2 . ( ) Intuitively, this metric characterizes how much we can reduce the variance per unit computation time. By maximizing the pseudo precision rate, we can achieve a balance between variance reduction and computational cost.

4.3. GUIDELINE FOR BATCH SIZE

We derive the guideline for selecting batch size by analyzing how the batch size influences the pseudo precision rate of the estimator ξ in Definition 4.2. The computation cost c is defined as the computation cost for training the model over the minibatch. This is approximately a constant times (|S 1 |+|S 2 |)• d, since we have to aggregate the neighbor information for nodes sampled in S 1 and S 2 . We derive a lower bound on the pseudo precision rate ρ(ξ) and observe that this bound converges to some value φ : (G, x v1 , . . . , x vn ) → R which is independent of the batch size. Therefore, for any batch size m < n, there is some monotone decreasing function δ(m) such that ρ(ξ) ≥ 1 φ(1+δ) . The proof of the following proposition is provided in the supplementary material. Proposition 4.1 Let Ã ∈ R n×n be the normalized adjacency matrix of a graph G = (V, E) with minimum degree d min > log n, and suppose that max v,u∈V | Ãu,v | = O(1) and for each v ∈ V the attribute x v = O(1). Let S 1 and S 2 be two random sets such that for every i ∈ {1, 2}, every v ∈ V is picked to S i with probability m/n > log n/d min , so that E Si [|S i |] = m. Let ξ be the estimator from Definition 4.1, where W is some weight matrix with max v,u |W v,u | = O(1). Then there exists φ : (G, x v1 , . . . , x vn ) → R and a monotone decreasing function δ(m) = 2 d (v,u)∈E Ã2 v,u Wv,ux 2 u |N (v)| 2 m • φ , such that for every m < n the pseudo precision of ξ is ρ(ξ) ≥ (φ(1 + δ(m))) -1 , ( ) where d is the average node degree of the graph G. Remark 4.1 Note that for simplicity of the presentation, we assumed that the layers are the same size and that all the attributes x v are scalars. The bound can be generalized to any dimension by summing the variances of the individual coordinates of ξ i . From the expression above, we can see that the bound on the pseudo precision converges to 1/φ as δ(m) decreases, so that for any accuracy δ > 0, there exists some m * such that the pseudo precision is at least 1/(φ(1 + δ)).  ρ(ξ) ≥ 1 φ(1 + δ) . For practical purposes, although the graphs we deal with are not d-regular, we propose to set the batch size to approximately n/ d, where d is the average degree of the graph. The intuition behind this guideline is that with this choice the bound on the pseudo precision rate reaches 1/2 of its maximum value. Beyond this setting, there are diminishing returns -the required compute time is increasing, but the variance has been reduced sufficiently so that additional decreases do not improve accuracy. 2020) (graph-wise sampling). We test each of the above algorithms against four public datasets: Pubmed, Reddit, Ogbn-arxiv and Ogbn-products Hu et al. (2020) . The statistics of each dataset are shown in Table 1 . Despite the different statistics among different datasets, we can evaluate the optimal scale of batch size from our theoretical result, which is shown in Table 2 . For Pubmed dataset, we repartition the data with a train/validation/test split ratio of 6 : 2 : 2. We keep the original partition for the remaining datasets. For all the tests, we use 3 layers of GCN. We use adam optimizer with an initial learning rate of 0.01 and the default values for remaining hyper parameters. For Pubmed and Reddit, we run training for 100 epochs.

5. EXPERIMENTS

For Ogbn-arxiv and Ogbn-products, we run training for 200 epochs. We use "node" sampler in GraphSAINT. The remaining settings are shown in Table 2 . For ClusterGCN and GraphSAINT, we conduct our experiments based on their published github repositories. For FastGCN, we implemented a version that can utilize GPU computation. All of our experiments are tested on a server equipped with a NVIDIA Tesla V100 GPU (32GB memory), and Intel Xeon Gold 6140 CPU (2.30GHz).

5.2. NUMERICAL RESULTS

Fig. 1 shows the relation between validation accuracy and training time with different batch size setting for various datasets with various algorithms. We also mark the positions that achieves best validation accuracy, 95% of best accuracy and 99% of best accuracy. Table 3 , Table 4 , Table 5 and Table 6 shows the detailed results at the epoch with best validation accuracy for Pubmed, Reddit, Ogbn-arxiv and Ogbn-products respectively. Under the batch size of 256, a typical setting for conventional deep learning models, all the scenarios show a slow convergence rate and a bad validation/test accuracy. When the batch size increases, both the training efficiency and testing accuracy improve substantially. The reason is that the accuracy of neighbor aggregation rapidly increases when batch size is small. For FastGCN and GraphSAINT, testing accuracy will increase as we increase the batch size. However, there is some turning point in each dataset, beyond which the testing accuracy does not increase too much. This confirms our theoretical conclusion about diminishing returns. Reddit is a typical example where testing accuracy is boosted to batch size of 8k and beyond that value, it grows slowly. On the other hand, there is a sweet-spot for the fastest convergence rate, not necessary aligned with the turning point of testing accuracy. Above sweet-spot, variance of gradients in each minibatch does not decrease too much as batch size grows but computation cost keep increasing. Below the sweet-spot, rapid reduction of variance results a faster convergence. Our suggested scale of optimal batch size usually falls close to those two important points. For CluserGCN, the testing accuracy does not change too much, but the sweet point for convergence is close to our suggested optimal batch size too. In general, our guideline suggests some batch size that is much larger than conventional batch size setting like 512 and it is close to the spot where we can get a decent testing accuracy with efficient convergence rate. (2019) report that graph-wise sampling (testing accuracy of 0.96+) performs much better than the layer-wise sampling (testing accuracy of around 0.93). We found that the difference mainly comes from the fact that experiments in graph-wise sampling have a better batch size setting (8k in Graph-SAINT) while the layer-wise experiment set a small batch size (400 in FastGCN). In our experiments, when we properly set the batch size, performances from layer-wise sampling and graph-wise sampling are close, which indicates the importance of batch size selection in GNN training.

6. CONCLUSION

We studied the batch size selection for SGD training of GNN models. We proposed pseudo precision rate metric that reflects training efficiency. We analyzed how the batch size influences this metric on an estimator that considers the randomness arising from two consecutive layers in GNN. By extensive experiments, we show that the batch size for GNN models should be much larger than typical setting of {4, 16, . . . , 512} from conventional deep learning models. With our suggested scale of batch size n/ d, n being the total number of nodes and d being the average node degree, GNN model can achieve decent testing performance efficiently.

A BATCH SIZE ANALYSIS

In this section, we prove Lemma 4.1 and Corollary 4.2. We start by stating a straightforward lemma from probability theory, whose proof is a direct application of Chernoff bounds. Lemma A.1 Let U be a finite set, and consider a random set S ⊆ U drawn such that each element in U is picked to S with probability p independently. Then, with probability at least 1-o(1/poly(n)) |S| ∈ p|U | -p|U | log n, p|U | + p|U | log n For a fixed vertex u, we let xu = W x u , so that our estimator ξ = 1 |S 1 | v∈S1 1 N (v)∩S2 =∅ |S 2 ∩ N (v)| u∈N (v)∩S2 Ãv,u • xu We start with the proof of Proposition 4.1. Note that we considered the more general case where S 1 is picked according to probability p = m 1 /n and S 2 is picked according to probability q = m 2 /n. Proof of Proposition 4.1: We start with computing the mean of our estimator. E S1,S2 1 |S 1 | v∈S1 χ v = E S1,S2   1 |S 1 | v∈S1 1 N (v)∩S2 =∅ |N (v) ∩ S 2 | u∈N (v)∩S2 Ãv,u xu   = E S1,S2   1 |S 1 | v,u∈V 1 v∈S1 • 1 N (v)∩S2 =∅ • 1 u∈S2∩N (v) • Ãv,u xu |S 2 ∩ N (v)|   = (v,u)∈E E S1,S2 1 v∈S1 |S 1 | 1 u∈S2 |N (v) ∩ S 2 | • Ãv,u xu , where for an event Z we let 1 Z denote the indicator random variable for Z. Note that by the fact that each vertex is picked to S 1 (respectively S 2 ) independently w.p p we can apply Lemma A.1 and conclude that with very high probability |S 1 | = pn ± √ pn log n = Θ(pn) By the independence of S 1 and S 2 , and an application of Jensen's inequality, we can establish the following bound: E S1,S2 1 v∈S1 |S 1 | 1 u∈S2 |N (v) ∩ S 2 | = E S1 1 v∈S1 |S 1 | E S2 1 u∈S2 |N (v) ∩ S 2 | ≥ Ω 1 n|N (v)| , leading to mean of Ω 1 n (v,u)∈E Ãv,u xu |N (v)| . Next, we compute the second moment of our estimator. E S1,S2   1 |S 1 | 2 v1,v2∈S1 χ v1 χ v2   = E S1,S2   1 |S 1 | 2 (v1,u1),(v2,u2)∈E 1 v1,v2∈S1 • 1 u1,u2∈S2 |N (v 1 ) ∩ S 2 ||N (v 2 ) ∩ S 2 | Ãv1,u1 xu1 Ãv2,u2 xu2   = (v1,u1),(v2,u2)∈E E S1,S2 1 v1,v2∈S1 • 1 u1,u2∈S2 |S 1 | 2 • α v1 α v2 |N (v 1 ) ∩ S 2 ||N (v 2 ) ∩ S 2 | • Ãv1,u1 xu1 Ãv2,u2 xu2 . Similarly to before, we inspect the above expectation. In here, there are four cases corresponding to the following sets. 1. C 1 = {(v 1 , u 1 ), (v 2 , u 2 ) ∈ E | v 1 = v 2 , u 1 = u 2 }: in a similar way to before, by Lemma A.1, we obtain E S1,S2 1 v1,v2∈S1 • 1 u1,u2∈S2 |S 1 | 2 • |N (v 1 ) ∩ S 2 ||N (v 2 ) ∩ S 2 | = p 2 (pn) 2 E S2 1 u1,u2∈S2 |N (v 1 ) ∩ S 2 ||N (v 2 ) ∩ S 2 | = p 2 q 2 (pn) 2 E S2 1 |N (v 1 ) ∩ S 2 ||N (v 2 ) ∩ S 2 | , In order to analyze the above expectation, consider the random variable |S ∩ N (v 1 )| (the case corresponding to |S ∩ N (v 2 )| is identical) and note that S ∩ N (v 1 ) = (S ∩ (N (v 1 ) \ N (v 1 , v 2 )))) (S ∩ N (v 1 , v 2 )), and in particular, the sets N (v 1 ) \ N (v 1 , v 2 ) and N (v 1 , v 2 ) are disjoint so that |S ∩ N (v 1 )| = |S ∩ (N (v 1 ) \ N (v 1 , v 2 ))| + |S ∩ N (v 1 , v 2 )|. Now let's analyze each of the following terms separately. By applying Chernoff bounds (Lemma A.1) we get that with probability at least 1 -o(1/poly(n)) we have that |S ∩ (N (v 1 ) \ N (v 1 , v 2 ))| = q|N (v 1 ) \ N (v 1 , v 2 )| ± q|N (v 1 ) \ N (v 1 , v 2 )| log n |S ∩ N (v 1 , v 2 )| = q|N (v 1 , v 2 )| ± q|N (v 1 , v 2 )| log n. Which implies that, |S ∩ N (v 1 )| = q|N (v 1 ) \ N (v 1 , v 2 )| + q|N (v 1 , v 2 )| + q|N (v 1 , v 2 )| log n + q|N (v 1 , v 2 )| log n = q|N (v 1 )| ± q log n |N (v 1 ) \ N (v 1 , v 2 )| + |N (v 1 , v 2 )| ≤ q|N (v 1 )| ± 2q|N (v 1 )| log n = Θ(q|N (v 1 )||), where the last inequality follows from the fact that q > log n/d min , which makes the first term the dominant one. Now, with this at hand, we can union bound over v 1 and v 2 and get that with probability at least 1 -o(1/poly(n)) E S 1 |S ∩ N (v 1 )||S ∩ N (v 2 )| 1 p|N (v 1 )| • p|N (v 2 )| , so overall E S1,S2 1 v1,v2∈S1 • 1 u1,u2∈S2 |S 1 | 2 |N (v 1 ) ∩ S 2 ||N (v 2 ) ∩ S 2 | 1 n 2 |N (v 1 )| • |N (v 2 )| . 2. C 2 = {(v 1 , u 1 ), (v 2 , u 2 ) ∈ E | v 1 = v 2 , u 1 = u 2 }: similarly to the previous case, E S1,S2 v1,v2∈S1 • 1 u1,u2∈S2 |S 1 | 2 • |N (v 1 ) ∩ S 2 ||N (v 2 ) ∩ S 2 | 1 qn 2 |N (v 1 )||N (v 2 )| . 3. C 3 = {(v 1 , u 1 ), (v 2 , u 2 ) ∈ E | v 1 = v 2 , u 1 = u 2 }. This case requires extra care, since in this case, the neighborhoods of v 1 and v 2 are correlated (actually the same). E S1,S2 1 v1,v2∈S1 • 1 u1,u2∈S2 |S 1 | 2 • |N (v 1 ) ∩ S 2 ||N (v 2 ) ∩ S 2 | pq 2 (pn) 2 E 1 |N (v) ∩ S 2 | 2 . By Lemma A.1, with very high probability |N (v) ∩ S 2 | ∈ [q|N (v)| -q|N (v)| log n, q|N (v)| + q|N (v)| log n], and by our constraint that q = Ω(log n/d min ) = Ω(log n/|N (v)|), we have that with high probability E S2 1 |S 2 ∩ N (v)| 2 1 (q|N (v)| ± q|N (v)| log n) 2 O 1 q 2 |N (v)| 2 . so that pq 2 (pn) 2 E 1 |N (v) ∩ S 2 | 2 1 pn 2 |N (v)| 2 Under review as a conference paper at ICLR 2020 4. C 4 = {(v 1 , u 1 ), (v 2 , u 2 ) ∈ E | v 1 = v 2 , u 1 = u 2 }. Similarly to the previous case, E S1,S2 1 v1,v2∈S1 • 1 u1,u2∈S2 |S 1 | 2 • |N (v 1 ) ∩ S 2 ||N (v 2 ) ∩ S 2 | = pq (pn) 2 E 1 |N (v) ∩ S| 2 1 n 2 pq|N (v)| 2 . Combining the above and subtracting the expectation squared yields, Var S1,S2 1 |S 1 | v∈S1 χ v 1 n 2 (v1,u1),(v2,u2)∈C2 Ãv1,u1 Ãv2,u1 x2 u1 q|N (v 1 )||N (v 2 )| + 1 n 2 (v1,u1),(v2,u2)∈C3 Ãv1,u1 xu1 Ãv1,u2 xu2 p|N (v)| 2 + 1 n 2 (v1,u1),(v2,u2)∈C4 Ã2 v,u x 2 u1 pq|N (v)| 2 -   1 n (v,u)∈E Ãv,u xu |N (v)|   2 After rearranging we get 1 n 2 (v1,u1),(v2,u2)∈C2 1 q|N (v 1 )||N (v 2 )| - 1 |N (v 1 )||N (v 2 )| Ãv1,u1 Ãv2,u1 x2 u1 + (v1,u1),(v2,u2)∈C3 1 p|N (v)| 2 - 1 |N (v)| 2 Ãv1,u1 xu1 Ãv1,u2 xu2 + (v,u)∈E 1 pq|N (v)| 2 - 1 |N (v)| 2 Ã2 v,u u . If we let p = m 1 /n and q = m 2 /n so that E S1 



EXPERIMENT SETTING We evaluate our theoretical findings regarding batch size selection using three state-of-the-art algorithms: ClusterGCN Chiang et al. (2019) (graph-wise sampling), FastGCN Chen et al. (2018) (layer-wise sampling) and GraphSAINT Zeng et al. (

Figure 1: Training time v.s. validation accuracy with different batch size settings for various datasets with various algorithms. : best validation ACC (BA), ♦: the first reach of 0.95 BA, •: the first reach of 0.99 BA. Interestingly, on Reddit dataset Chen et al. (2018); Zou et al. (2019); Zeng et al. (2020); Chiang et al. (2019) report that graph-wise sampling (testing accuracy of 0.96+) performs much better than the layer-wise sampling (testing accuracy of around 0.93). We found that the difference mainly comes from the fact that experiments in graph-wise sampling have a better batch size setting (8k in Graph-SAINT) while the layer-wise experiment set a small batch size (400 in FastGCN). In our experiments, when we properly set the batch size, performances from layer-wise sampling and graph-wise sampling are close, which indicates the importance of batch size selection in GNN training.

Figure 2: Box plot for training time and testing acc v.s. batch size.

Datasets statistics.

Hyper-parameter setting and our suggested optimal batch size scale.

Training performance on Pubmed.

Training performance on Reddit. The implementation from GraphSAINT will report a GPU memory error for the batch size setting of 128 k and full batch.

Training performance on ogbn-arxiv.

Training performance on ogbn-products.

annex

Ãv1,u1 xu1 Ãv1,u2 xu2and note that as m increase, the efficiency converges to φ.If we define δ aswe get that ρ(ξ) ≥ 1 φ(1+δ) , as claimed. Now we show that if we assume that the graph is d-regular, we can get a clean relation between the efficiency of our estimator and the size of the batch.Proof of Corollary 4.2: Fix any δ > 0. By Equation ( 12),By assuming that the graph is d-regular, and using the definition of φ m 2 d

B ADDITIONAL EXPERIMENT RESULTS

To better demonstrate the results shown in Table 3 , Table 4 , Table 5 and Table 6 , we create the box plot for training time and testing accuracy for different batch size settings in Fig. 2 .

