ON BATCH SIZE SELECTION FOR STOCHASTIC TRAIN-ING FOR GRAPH NEURAL NETWORKS Anonymous

Abstract

Batch size is an important hyper-parameter for training deep learning models with stochastic gradient decent (SGD) method, and it has great influence on the training time and model performance. We study the batch size selection problem for training graph neural network (GNN) with SGD method. To reduce the training time while keeping a decent model performance, we propose a metric that combining both the variance of gradients and compute time for each mini-batch. We theoretically analyze how batch-size influence such a metric and propose the formula to evaluate some rough range of optimal batch size. In GNN, gradients evaluated on samples in a mini-batch are not independent and it is challenging to evaluate the exact variance of gradients. To address the dependency, we analyze an estimator for gradients that considers the randomness arising from two consecutive layers in GNN, and suggest a guideline for picking the appropriate scale of the batch size. We complement our theoretical results with extensive empirical experiments for ClusterGCN, FastGCN and GraphSAINT on 4 datasets: Ogbn-products, Ogbnarxiv, Reddit and Pubmed. We demonstrate that in contrast to conventional deep learning models, GNNs benefit from large batch sizes.

1. INTRODUCTION

Training large neural networks is often time consuming. In many real world scenarios training might take hours or even days to converge Radford et al. (2018) ; Devlin et al. (2018) . As a consequence, the identification of strategies to reduce the training time while retaining accuracy is an important research objective. The most popular training algorithms for deep learning are Stochastic Gradient Descent (SGD) and its variants such as RMSProp or Adam Graves (2013); Kingma & Ba (2014) . These algorithms work in an iterative manner, such that in each epoch, the data is first partitioned into minibatches and then weight updates are calculated using only the data in each minibatch. It has been observed that the size of the minibatches plays a crucial role in the network's accuracy, generalization capability and converge time (Keskar et al. (2016); He et al. (2019); McCandlish et al. (2018) ; Radiuk (2017)). For typical deep learning tasks, practitioners have observed that small batch sizes, e.g., {4, 16, . . . , 512}, lead to a better generalization performance and training efficiency Keskar et al. (2016) . For Graph Neural Networks (GNNs) selecting the appropriate batch size remains more of a mystery, and to the best of our knowledge, there has been no published work that focuses on batch size selection for GNNs. The small batch size guidelines for conventional NNs do not carry over because the batches are used to approximate the graph aggregations or convolutions. The approximation error propagates and leads to a much more substantial variance in the gradients than is observed for NNs. In practice, based on released code, we see that implementations tend to either use the largest batch size that can fit into memory Li et al. (2020) or use a small batch size similar to those for non-graph settings Chen et al. (2018); Zou et al. (2019) . In this work, we explore the choice of batch size for graph neural networks. By means of a theoretical investigation, we develop guidelines for the choice of batch size that depend on the average degree and number of nodes of the graph. These guidelines lead to intermediate batch sizes, considerably larger than the small NN batch sizes but much smaller than the maximum size dictated by memory limits of a modern GPU. We provide empirical results that demonstrate that the batch sizes derived using our guidelines provide an excellent trade-off between training time and accuracy. Substantially smaller sizes may lead to faster convergence but reduced accuracy; using larger sizes can achieve similar accuracy but training may take much longer to converge.

2. RELATED WORK

Graph Neural Networks (GNNs) have become increasingly popular in addressing graph-based tasks (Kipf & Welling, 2016; Hamilton et al., 2017; Defferrard et al., 2016; Gilmer et al., 2017; Ying et al., 2018) . One major line of research aims to improve the expressiveness of GNNs via 1) advanced aggregation functions (Veličković et al., 2017; Monti et al., 2017; Liu et al., 2019; Qu et al., 2019; Pei et al., 2020) 2) deeper architecture (Li et al., 2019; 2020) ; and 3) adaptive graph structure (Li et al., 2018; Vashishth et al., 2019; Zhang et al., 2019) . However, training a large-scale GNN model remains challenging because of the large memory consumption, long convergence time, and heavy computation (Chiang et al., 2019) . Full-batch gradient descent training scheme was commonly used in the earlier GNN research. While this is suitable for relatively small graphs, it requires storing all intermediate embeddings, which is not scalable for large graphs. The convergence can be slow since the parameters are updated only once per epoch. 2020) proposed variance reduction stochastic training frameworks that maintain a cache for the intermediate embeddings of all nodes. This can improve convergence but results in large memory requirements, stretching the capabilities of GPUs when training over large graphs. Due to this drawback, we do not consider such approaches in this paper, but it is an intrguing direction for future work. Most existing graph neural network papers do not clearly address how they set the batch size. Experimentally, we observe that batch size is a critical hyper-parameter and can significantly influence training time and test accuracy. The importance of the batch size has been recognized for non-graph deep learning models. Keskar et al. (2016 ), He et al. (2019) and Masters & Luschi (2018) have shown that smaller batch sizes, in the range {4, 16, . . . , 512}, can achieve better generalization performance. The randomness of small batches proves beneficial. McCandlish et al. (2018) suggested that the batch size should be selected so that a balance is achieved between the "noise" and "signal" of the gradient. Radiuk (2017) showed that using larger batch sizes, of the order of 1024, can be beneficial when training convolutional neural network models. Gower et al. (2019 ), Alfarra et al. (2020 ), and Smith (2018) introduced adaptive batch size approaches to further improve the convergence rate and generalization performance. To the best of our knowledge, no existing work has directly addressed the selection of the batch size for stochastic training for graph neural networks, and the objective of this paper is to fill that gap and provide guidelines for the GNN setting.

3. PRELIMINARIES

We represent a graph G = (V, E) with a set of nodes V = {v 1 , . . . , v n } and set of edges E = {e 1 , . . . , e M } by an adjacency matrix A ∈ R n×n . For node v ∈ V , we let N (v) be the set of neighbors of v. In addition, we associate each node v to a feature vector x v ∈ R 1×F , and let X ∈ R n×F be the corresponding feature matrix. Let D be the degree matrix of the graph G, where D i,i = j A i,j and D i,j = 0 if i = j. To ease the presentation, we use symbols such as R T to denote that there exists an absolute constant c such that R ≤ c • T .



Hamilton et al. (2017)  andYing et al. (2018)  proposed the training of GNNs with mini-batch stochastic gradient descent (SGD) methods. Mini-batch SGD training suffers from the neighborhood expansion and leads to time-complexity that grows exponential with respect to the GNN depth. To reduce the exponential complexity of receptive nodes, Chen et al. (2018), Huang et al. (2018), and Zou et al. (2019) proposed layer-wise sampling, where a fixed number of nodes are sampled in each layer. Importance sampling techniques were incorporated to reduce variance. Unfortunately the overhead of the iterative neighborhood sampling strategy is still significant and becomes worse as GNNs become progressively deeper. Chiang et al. (2019) and Zeng et al. (2020) proposed graph-wise sampling to further improve the sampling efficiency. This can be viewed as a special case of layer-wise sampling where the same set of nodes is sampled across all layers. Chen et al. (2017) and Cong et al. (

