LOCAL SGD MEETS ASYNCHRONY

Abstract

Distributed variants of stochastic gradient descent (SGD) are central to training deep neural networks on massive datasets. Several scalable versions of data-parallel SGD have been developed, leveraging asynchrony, communicationcompression, and local gradient steps. Current research seeks a balance between distributed scalability-seeking to minimize the amount of synchronization needed-and generalization performance-seeking to achieve the same or better accuracy relative to the sequential baseline. However, a key issue in this regime is largely unaddressed: if "local" data-parallelism is aggressively applied to better utilize the computing resources available with workers, generalization performance of the trained model degrades. In this paper, we present a method to improve the "local scalability" of decentralized SGD. In particular, we propose two key techniques: (a) shared-memory based asynchronous gradient updates at decentralized workers keeping the local minibatch size small, and (b) an asynchronous non-blocking in-place averaging overlapping the local updates, thus essentially utilizing all compute resources at all times without the need for large minibatches. Empirically, the additional noise introduced in the procedure proves to be a boon for better generalization. On the theoretical side, we show that this method guarantees ergodic convergence for non-convex objectives, and achieves the classic sublinear rate under standard assumptions. On the practical side, we show that it improves upon the performance of local SGD and related schemes, without compromising accuracy.

1. INTRODUCTION

In this paper, we consider the classic problem of minimizing an empirical risk, defined simply as ) is an extremely popular iterative approach to solving this problem: min x∈R d i∈[I] f i (x), x k+1 = x k -α k ∇f B k (x k ), (2) where ∇f B k (x k ) = 1 |B k | i∈B k ∇f i (x k ) is the sum of gradients computed over samples, typically selected uniformly and randomly as a minibatch B k ⊆ [I], and α k is the learning rate at iteration k. 1.1 BACKGROUND ON DECENTRALIZED DATA-PARALLEL SGD For better or worse, SGD and its variants currently represent the computational backbone for many large-scale optimization tasks, most notably the training of deep neural networks (DNNs). Arguably the most popular SGD variant is minibatch SGD (MB-SGD) (Bottou ( 2012)). In a distributed setting with decentralized workers q ∈ [Q], it follows the iteration x k+1 = x k -α k 1 Q Q q=1 ∇f B q k , where B q k ⊆ [I] is a local minibatch selected by worker q ∈ [Q] at iteration k. This strategy is straightforward to scale in a data-parallel way, as each worker can process a subset of the samples in parallel, and the model is then updated by the average of the workers' gradient computations. For convenience, we assume the same batch size per worker. This approach has achieved tremendous popularity recently, and there has been significant interest in running training with increasingly large batch sizes aggregated over a large number of GPUs, e.g. Goyal et al. (2017) . x q j,t+1 = x q j,t -α j,t ∇f B q j,t , 0 ≤ t < K j ; x q j+1,0 = 1 Q q x q j,Kj ,



where d is the dimension, x ∈ R d denotes the set of model parameters, [I] is the training set, and f i (x) : R d → R is the loss on the training sample i ∈ [I]. Stochastic gradient descent (SGD) (Robbins & Monro, 1951

An alternative approach is parallel or local SGD (L-SGD) (Zinkevich et al. (2010); Zhang et al. (2016c); Lin et al. (2020)):

