WHY DOES DECENTRALIZED TRAINING OUTPER-FORM SYNCHRONOUS TRAINING IN THE LARGE BATCH SETTING?

Abstract

Distributed Deep Learning (DDL) is essential for large-scale Deep Learning (DL) training. Using a sufficiently large batch size is critical to achieving DDL runtime speedup. In a large batch setting, the learning rate must be increased to compensate for the reduced number of parameter updates. However, a large batch size may converge to sharp minima with poor generalization, and a large learning rate may harm convergence. Synchronous Stochastic Gradient Descent (SSGD) is the de facto DDL optimization method. Recently, Decentralized Parallel SGD (DPSGD) has been proven to achieve a similar convergence rate as SGD and to guarantee linear speedup for non-convex optimization problems. While there was anecdotal evidence that DPSGD outperforms SSGD in the large-batch setting, no systematic study has been conducted to explain why this is the case. Based on a detailed analysis of the DPSGD learning dynamics, we find that DPSGD introduces additional landscape-dependent noise, which has two benefits in the large-batch setting: 1) it automatically adjusts the learning rate to improve convergence; 2) it enhances weight space search by escaping local traps (e.g., saddle points) to find flat minima with better generalization. We conduct extensive studies over 12 stateof-the-art DL models/tasks and demonstrate that DPSGD consistently outperforms SSGD in the large batch setting; and DPSGD converges in cases where SSGD diverges for large learning rates. Our findings are consistent across different application domains, Computer Vision and Automatic Speech Recognition, and different neural network models, Convolutional Neural Networks and Long Short-Term Memory Recurrent Neural Networks.

1. INTRODUCTION

Deep Learning (DL) has revolutionized AI training across application domains: Computer Vision (CV) (Krizhevsky et al., 2012; He et al., 2015) , Natural Language Processing (NLP) (Vaswani et al., 2017) , and Automatic Speech Recognition (ASR) (Hinton et al., 2012) . Stochastic Gradient Descent (SGD) is the fundamental optimization method used in DL training. Due to massive computational requirements, Distributed Deep Learning (DDL) is the preferred mechanism to train large scale Deep Learning (DL) tasks. In the early days, Parameter Server (PS) based Asynchronous SGD (ASGD) training was the preferred DDL approach (Dean et al., 2012; Li et al., 2014) as it did not require strict system-wide synchronization. Recently, ASGD has lost popularity due to its unpredictability and often inferior convergence behavior (Zhang et al., 2016b) . Practitioners now favor deploying Synchronous SGD (SSGD) on homogeneous High Performance Computing (HPC) systems. The degree of parallelism in a DDL system is dictated by batch size: the larger the batch size, the more parallelism and higher speedup can be expected. However, large batches require a larger learning rate and overall they may negatively affect model accuracy because 1) large batch training usually converges to sharp minima which do not generalize well (Keskar et al., 2016) and 2) large learning rates may violate the conditions (i.e., the smoothness parameter) required for convergence in nonconvex optimization theory (Ghadimi & Lan, 2013) . Although training longer with large batches could lead to better generalization (Hoffer et al., 2017) , doing so gives up some or all of the speedup we seek. Through meticulous hyper-parameter design (e.g., learning rate) tailored to each specific task, SSGD-based DDL systems have enabled large batch training and shortened training time for some challenging CV tasks (Goyal et al., 2017; You et al., 2017) and NLP tasks (You et al., 2019) from weeks to hours or less. However, it is observed that SSGD with large batch size leads to large training loss and inferior model quality for ASR tasks (Zhang et al., 2019b) , as illustrated in Figure 1a (red curve). In this paper we found for other types of tasks (e.g. CV) and DL models, large batch SSGD has the same problem (Figure 1b and Figure 1c ). The cause of this problem could be that training gets trapped at saddle points since large batches reduce the magnitude of noise in the Figure 1 : SSGD (red) does not converge in the large batch setting. Figure 1a plots the heldout-loss, the lower the better. Figure 1b and Figure 1c plot the model accuracy, the higher the better. By injecting Gaussian noise, SSGD might escape early traps but result in much worse model (blue) compared to DPSGD (green) in the large batch setting. The detailed task descriptions and training recipes are described in Section 4.3. BS stands for Batch-Size. stochastic gradient and prevent the algorithm from exploring the whole parameter space. To solve this problem, one may add isotropic noise (e.g., spherical Gaussian) to help SSGD escape from saddle points (Ge et al., 2015) . However, this is not a good solution for high-dimensional DL training as shown in the blue curves of Figure 1 . One possible reason is that the complexity of escaping a saddle point by adding isotropic noise has a polynomial dependency on the dimension of the parameter space, so adding such noise in a high dimensional space (such as deep learning) does not bring significant benefits. In this paper, we have found that Decentralized Parallel SGD (DPSGD) (Lian et al., 2017b) greatly improves large batch training performance, as illustrated in the green curves in Figure 1 . Unlike SSGD, where each learner updates its weights by taking a global average of all learners' weights, DPSGD updates each learner's weights by taking a partial average (i.e., across a subset of neighboring learners). Therefore, in DPSGD, each learner's weights differ from the weights of other learners. 1 The key difference among SSGD, SSGD with Gaussian noisefoot_1 and DPSGD is the source of noise during the update, and this noise directly affects performance in deep learning. This naturally motivates us to study Why decentralized training outperform synchronous training in the large batch setting? More specifically, we try to understand whether their performance difference is caused by their different noise. We answer these questions from both theoretical and empirical perspectives. Our contributions are: • We analyze the dynamics of DDL algorithms, including both SSGD and DPSGD. We show, both theoretically and empirically, that the intrinsic noise in DPSGD can 1) reduce the effective learning rate when the gradient is large to help convergence; 2) enhance the search in weight space for flat minima with better generalization. • We conduct extensive empirical studies of 12 CV and ASR tasks with state-of-the-art CNN and LSTM models. Our experimental results demonstrate that DPSGD consistently outperforms SSGD, across application domains and Neural Network (NN) architectures in the large batch setting, without any hyper-parameter tuning. To the best of our knowledge, we are unaware of any generic algorithm that can improve SSGD large batch training on this many models/tasks. The remainder of this paper is organized as follows. Section 2 details the problem formulation and learning dynamics analysis of SSGD, SSGD+Gaussian, and DPSGD; Section 3 and Section 4 detail the empirical results; and Section 5 concludes the paper.

2. ANALYSIS OF STOCHASTIC LEARNING DYNAMICS AND EFFECTS OF LANDSCAPE-DEPENDENT NOISE

We first formulate the dynamics of an SGD based learning algorithm with multiple (n > 1) learners indexed by j = 1, 2, 3, ...n following the same theoretical framework established for a single learner (Chaudhari & Soatto, 2018) . At each given time (iteration) t, each learner has its own weight vector wj (t), and the average weight vector wa (t) is defined as: wa (t) ⌘ n 1 P n j=1 wj (t).



The detailed DPSGD algorithm and its learning dynamics are described in Section 2 We use the terms "SSGD with Gaussian noise" and "SSGD ⇤ " interchangeably in this paper.



(a) LSTM, SWB300, BS 8192 (b) EfficientNet, CIFAR-10, BS 8192 (c) SENet-18, CIFAR-10, BS 8192

