LOCAL SGD MEETS ASYNCHRONY

Abstract

Distributed variants of stochastic gradient descent (SGD) are central to training deep neural networks on massive datasets. Several scalable versions of data-parallel SGD have been developed, leveraging asynchrony, communicationcompression, and local gradient steps. Current research seeks a balance between distributed scalability-seeking to minimize the amount of synchronization needed-and generalization performance-seeking to achieve the same or better accuracy relative to the sequential baseline. However, a key issue in this regime is largely unaddressed: if "local" data-parallelism is aggressively applied to better utilize the computing resources available with workers, generalization performance of the trained model degrades. In this paper, we present a method to improve the "local scalability" of decentralized SGD. In particular, we propose two key techniques: (a) shared-memory based asynchronous gradient updates at decentralized workers keeping the local minibatch size small, and (b) an asynchronous non-blocking in-place averaging overlapping the local updates, thus essentially utilizing all compute resources at all times without the need for large minibatches. Empirically, the additional noise introduced in the procedure proves to be a boon for better generalization. On the theoretical side, we show that this method guarantees ergodic convergence for non-convex objectives, and achieves the classic sublinear rate under standard assumptions. On the practical side, we show that it improves upon the performance of local SGD and related schemes, without compromising accuracy.

1. INTRODUCTION

In this paper, we consider the classic problem of minimizing an empirical risk, defined simply as min x∈R d i∈[I] f i (x), where d is the dimension, x ∈ R d denotes the set of model parameters, [I] is the training set, and f i (x) : R d → R is the loss on the training sample i ∈ [I]. Stochastic gradient descent (SGD) (Robbins & Monro, 1951 ) is an extremely popular iterative approach to solving this problem: x k+1 = x k -α k ∇f B k (x k ), (2) where ∇f B k (x k ) = 1 |B k | i∈B k ∇f i (x k ) is the sum of gradients computed over samples, typically selected uniformly and randomly as a minibatch B k ⊆ [I], and α k is the learning rate at iteration k. 1.1 BACKGROUND ON DECENTRALIZED DATA-PARALLEL SGD For better or worse, SGD and its variants currently represent the computational backbone for many large-scale optimization tasks, most notably the training of deep neural networks (DNNs). Arguably the most popular SGD variant is minibatch SGD (MB-SGD) (Bottou ( 2012)). In a distributed setting with decentralized workers q ∈ [Q], it follows the iteration x k+1 = x k -α k 1 Q Q q=1 ∇f B q k , where B q k ⊆ [I] is a local minibatch selected by worker q ∈ [Q] at iteration k. This strategy is straightforward to scale in a data-parallel way, as each worker can process a subset of the samples in parallel, and the model is then updated by the average of the workers' gradient computations. For convenience, we assume the same batch size per worker. This approach has achieved tremendous popularity recently, and there has been significant interest in running training with increasingly large batch sizes aggregated over a large number of GPUs, e.g. Goyal et al. (2017) . x q j,t+1 = x q j,t -α j,t ∇f B q j,t , 0 ≤ t < K j ; x q j+1,0 = 1 Q q x q j,Kj , where x q j,t denotes the local model at worker q ∈ [Q] after j synchronization rounds followed by t local gradient updates and B q j,t is the local minibatch sampled at the same iteration. K j denotes the number of local gradient update steps before the j th synchronization. Essentially, workers run SGD without any communication for several local steps, after which they globally average the resulting local models. This method is intuitively easy to scale, since it reduces the frequency of the communication. Recently, a variant called post local SGD (PL-SGD) (Lin et al. ( 2020)), was introduced to address the issue of loss in generalization performance of L-SGD, wherein the averaging frequency during the initial phase of training is high and is reduced later when optimization stabilizes. Although very popular in practice, these two approaches suffer from the same limitation: their generalization accuracy decreases for larger local batch sizes, as would be appropriate to fully utilize the computation power offered by the workers. We illustrate this in Table 1 : examine the results of training RESNET-20 (He et al. ( 2016)) over CIFAR-10 (Krizhevsky ( 2009)) with MB-SGD and PL-SGD over a workstation packing two Nvidia GeForce RTX 2080 Ti GPUs (a current standard). We observe that as the local batch size B loc grows, the throughput improves significantly, however, the optimization results degrade sharply, more glaringly with PL-SGD. Clearly, these methods can not tolerate a larger B loc , though the GPUs can support them. This shortcoming of the existing methods in harnessing the growing data-parallelism is also identified via empirical studies (Golmant et al. (2018) ; Shallue et al. ( 2019)) existing in literature. To our knowledge no effective remedy (yet) exists to address this challenge. | = B loc , ∀q ∈ [Q], k ≥ 0 for MB-SGD and |B q j,t | = B loc , ∀q ∈ [Q], Notice that, here our core target is maximally harnessing the local data-parallelism and therefore the larger local batch size, as against the existing trend in the literature wherein large number of GPUs are deployed to have a large aggregated global batch size with a relatively small B loc . For example, refer to the performance of MB-SGD and PL-SGD as listed in Table 1 of Lin et al. (2020) . Notice that with 16 GPUs, each with B loc = 128, thus totaling the minibatch size as 2048, identical to the one with 2 GPUs each with B loc = 1024 as above, with exactly the same LR scaling and warmup strategy, both MB-SGD and PL-SGD do not face generalization degradation. However, unfortunately, such an implementation setting would incur excessive wastage of available data-parallel compute resources on each of the GPUs. Indeed, the existing specific techniques such as LARS (You et al. (2017) ) to address the issue of poor generalization for global large batch training are insufficient for the larger local minibatch size; we empirically describe it in Section 3 (Table 11 ).

1.2. LOCALLY-ASYNCHRONOUS PARALLEL SGD

Now, consider an implementation scheme as the following: 1. In a decentralized setting of L-SGD, i.e. wherein each worker q ∈ [Q] has a local model x q undergoing local SGD updates as described earlier, multiple local concurrent processes u ∈ U q share the model x q . Processes u ∈ U q perform asynchronous concurrent gradient updates locally. 2. The workers average their models whenever any one of them would have had at least K j local shared updates, where K j is as that in Equation 4. The averaging is performed asynchronously and in a non-blocking way by the (averaging-) processes a q on behalf of each worker q ∈ [Q]. Essentially, the decentralized workers run shared-memory-based asynchronous SGD locally and periodically synchronize in a totally non-blocking fashion. More formally, consider Algorithm 1. The model x q on a GPU q ∈ [Q] is shared by the processes p ∈ P q = {{a q }∪U q } locally. The processes p ∈ P q also maintain a shared counter S q , initialized to 0. The operation read-and-inc implements an atomic (with lock) read and increment of S q , whereas, read provides an atomic read. S q essentially enables ordering the shared gradient updates. In turn, this order streamlines the synchronization among workers, thereby determines the averaging rounds j. The (updater) processes u ∈ U q asynchronously and lock-freely update x q with



An alternative approach is parallel or local SGD (L-SGD) (Zinkevich et al. (2010); Zhang et al. (2016c); Lin et al. (2020)):

j, t ≥ 0 for PL-SGD. The LR is warmed up for the first 5 epochs to scale from α0 to α0 ×B loc ×Q B loc, where B loc = 128, Q is the number of workers, 2 here, and α0 = 0.1. In PL-SGD, we average the model after each gradient update for first 150 epochs and thereafter averaging frequency K is set to 16 as in Lin et al. (2020); other HPs are identical to theirs. The listed results are average of 3 runs with different seeds.

RESNET-20/CIFAR-10 training for 300 epochs on 2 GPUs. Throughout the training the local BS is kept constant across workers i.e.

