ON BATCH SIZE SELECTION FOR STOCHASTIC TRAIN-ING FOR GRAPH NEURAL NETWORKS Anonymous

Abstract

Batch size is an important hyper-parameter for training deep learning models with stochastic gradient decent (SGD) method, and it has great influence on the training time and model performance. We study the batch size selection problem for training graph neural network (GNN) with SGD method. To reduce the training time while keeping a decent model performance, we propose a metric that combining both the variance of gradients and compute time for each mini-batch. We theoretically analyze how batch-size influence such a metric and propose the formula to evaluate some rough range of optimal batch size. In GNN, gradients evaluated on samples in a mini-batch are not independent and it is challenging to evaluate the exact variance of gradients. To address the dependency, we analyze an estimator for gradients that considers the randomness arising from two consecutive layers in GNN, and suggest a guideline for picking the appropriate scale of the batch size. We complement our theoretical results with extensive empirical experiments for ClusterGCN, FastGCN and GraphSAINT on 4 datasets: Ogbn-products, Ogbnarxiv, Reddit and Pubmed. We demonstrate that in contrast to conventional deep learning models, GNNs benefit from large batch sizes.

1. INTRODUCTION

Training large neural networks is often time consuming. In many real world scenarios training might take hours or even days to converge Radford et al. (2018) ; Devlin et al. (2018) . As a consequence, the identification of strategies to reduce the training time while retaining accuracy is an important research objective. The most popular training algorithms for deep learning are Stochastic Gradient Descent (SGD) and its variants such as RMSProp or Adam Graves (2013); Kingma & Ba (2014) . These algorithms work in an iterative manner, such that in each epoch, the data is first partitioned into minibatches and then weight updates are calculated using only the data in each minibatch. It has been observed that the size of the minibatches plays a crucial role in the network's accuracy, generalization capability and converge time (Keskar et al. ( 2016 For typical deep learning tasks, practitioners have observed that small batch sizes, e.g., {4, 16, . . . , 512}, lead to a better generalization performance and training efficiency Keskar et al. (2016) . For Graph Neural Networks (GNNs) selecting the appropriate batch size remains more of a mystery, and to the best of our knowledge, there has been no published work that focuses on batch size selection for GNNs. The small batch size guidelines for conventional NNs do not carry over because the batches are used to approximate the graph aggregations or convolutions. The approximation error propagates and leads to a much more substantial variance in the gradients than is observed for NNs. In practice, based on released code, we see that implementations tend to either use the largest batch size that can fit into memory Li et al. (2020) or use a small batch size similar to those for non-graph settings Chen et al. (2018); Zou et al. (2019) . In this work, we explore the choice of batch size for graph neural networks. By means of a theoretical investigation, we develop guidelines for the choice of batch size that depend on the average degree and number of nodes of the graph. These guidelines lead to intermediate batch sizes, considerably larger than the small NN batch sizes but much smaller than the maximum size dictated by memory limits of a modern GPU. We provide empirical results that demonstrate that the batch sizes derived using our guidelines provide an excellent trade-off between training time and accuracy. Substantially



); He et al. (2019); McCandlish et al. (2018); Radiuk (2017)).

