UNDERSTANDING THE EFFECTS OF DATA PARALLEL-ISM AND SPARSITY ON NEURAL NETWORK TRAINING

Abstract

We study two factors in neural network training: data parallelism and sparsity; here, data parallelism means processing training data in parallel using distributed systems (or equivalently increasing batch size), so that training can be accelerated; for sparsity, we refer to pruning parameters in a neural network model, so as to reduce computational and memory cost. Despite their promising benefits, however, understanding of their effects on neural network training remains elusive. In this work, we first measure these effects rigorously by conducting extensive experiments while tuning all metaparameters involved in the optimization. As a result, we find across various workloads of data set, network model, and optimization algorithm that there exists a general scaling trend between batch size and number of training steps to convergence for the effect of data parallelism, and further, difficulty of training under sparsity. Then, we develop a theoretical analysis based on the convergence properties of stochastic gradient methods and smoothness of the optimization landscape, which illustrates the observed phenomena precisely and generally, establishing a better account of the effects of data parallelism and sparsity on neural network training.

1. INTRODUCTION

Data parallelism is a straightforward and common approach to accelerate neural network training by processing training data in parallel using distributed systems. Being model-agnostic, it is applicable to training any neural networks, and the degree of parallelism equates to the size of mini-batch for synchronized settings, in contrast to other forms of parallelism such as task or model parallelism. While its utility has attracted much attention in recent years, however, distributing and updating large network models at distributed communication rounds still remains a bottleneck (Dean et al., 2012; Hoffer et al., 2017; Goyal et al., 2017; Smith et al., 2018; Shallue et al., 2019; Lin et al., 2020) . Meanwhile, diverse approaches to compress such large network models have been developed, and network pruning -the sparsification process that zeros out many parameters of a network to reduce computations and memory associated with these zero values -has been widely employed (Reed, 1993; Han et al., 2015) . In fact, recent studies discovered that pruning can be done at initialization prior to training (Lee et al., 2019; Wang et al., 2020) , and by separating the training process from pruning entirely, it not only saves tremendous time and effort in finding trainable sparse networks, but also facilitates the analysis of pruned sparse networks in isolation. Nevertheless, there has been little study concerning the subsequent training of these sparse networks, and various aspects of the optimization of sparse networks remain rather unknown as of yet. In this work, we focus on studying data parallelism and sparsity 1 , and provide clear explanations for their effects on neural network training. Despite a surge of recent interest in their complementary benefits in modern deep learning, there is a lack of fundamental understanding of their effects. For example, Shallue et al. (2019) provide comprehensive yet empirical evaluations on the effect of data parallelism, while Zhang et al. ( 2019) use a simple noisy quadratic model to describe the effect; for sparsity, Lee et al. ( 2020) approach the difficulty of training under sparsity solely from the perspective of initialization. In this regard, we first accurately measure their effects by performing extensive metaparameter search independently for each and every study case of batch size and sparsity level. As a result, we find a general scaling trend as the effect of data parallelism in training sparse neural networks, across varying sparsity levels and workloads of data set, model and optimization algorithm. Also, the critical batch size turns out to be no less with sparse networks, despite the general difficulty of training sparse networks. We formalize our observation and theoretically prove the effect of data parallelism based on the convergence properties of generalized stochastic gradient methods irrespective of sparsity levels. We take this result further to understand the effect of sparsity based on Lipschitz smoothness analysis, and find that pruning results in a sparse network whose gradient changes relatively too quickly. Notably, this result is developed under standard assumptions used in the optimization literature and generally applied to training using any stochastic gradient method with nonconvex objective and learning rate schedule. Being precise and general, our results could help understand the effects of data parallelism and sparsity on neural network training.

2. SETUP

We follow closely the experiment settings used in Shallue et al. ( 2019). We describe more details including the scale of our experiments in Appendix B, and provide additional results in Appendix D. The code can be found here: https://github.com/namhoonlee/effect-dps-public Experiment protocol. For a given workload (data set, network model, optimization algorithm) and study (batch size, sparsity level) setting, we measure the number of training steps required to reach a predefined goal error. We repeat this process for a budget of runs while searching for the best metaparameters involved in the optimization (e.g., learning rate, momentum), so as to record the lowest number of steps, namely steps-to-result, as our primary quantity of interest. To this end, we regularly evaluate intermediate models on the entire validation set for each training run. Workload and study. We consider the workloads as the combinations of the followings: (data set) MNIST, Fashion-MNIST, CIFAR-10; (network model) Simple-CNN, ResNet-8; (optimization algorithm) SGD, Momentum, Nesterov with either a fixed or decaying learning rate schedule. For the study setting, we consider a batch size from 2 up to 16384 and a sparsity level from 0% to 90%. Metaparameter search. We perform a quasi-random search to tune metaparameters efficiently. More precisely, we first generate Sobol low-discrepancy sequences in a unit hypercube and convert them into metaparameters of interest, while taking into account a predefined search space for each metaparameter. The generated values for each metaparameter is in length of the budget of trials, and the search space is designed based on preliminary experimental results. Pruning. Sparse networks can be obtained by many different ways, and yet, for the purpose of this work, they must not undergo any training beforehand so as to measure the effects of data parallelism while training from scratch. Recent pruning-at-initialization approaches satisfy this requirement, and we adopt the connection sensitivity criterion in Lee et al. (2019) to obtain sparse networks.

3.1. MEASURING THE EFFECT OF DATA PARALLELISM

First of all, we observe in each and every sparsity level across different workloads a general scaling trend in the relationship between batch size and steps-to-result for the effects of data parallelism (see the 1st and 2nd columns in Figure 1 ): Initially, we observe a period of linear scaling where doubling the batch size reduces the steps to achieve the goal error by half (i.e., it aligns closely with the dashed line), followed by a region of diminishing returns where the reduction in the required number of steps by increasing the batch size is less than the inverse proportional amount (i.e., it starts to digress from the linear scaling region), which eventually arrives at a maximal data parallelism (i.e., it hits a



For the purpose of this work, we equate data parallelism and sparsity to increasing batch size and pruning model parameters, respectively; we explain these more in detail in Appendix E.

