WHY DOES DECENTRALIZED TRAINING OUTPER-FORM SYNCHRONOUS TRAINING IN THE LARGE BATCH SETTING?

Abstract

Distributed Deep Learning (DDL) is essential for large-scale Deep Learning (DL) training. Using a sufficiently large batch size is critical to achieving DDL runtime speedup. In a large batch setting, the learning rate must be increased to compensate for the reduced number of parameter updates. However, a large batch size may converge to sharp minima with poor generalization, and a large learning rate may harm convergence. Synchronous Stochastic Gradient Descent (SSGD) is the de facto DDL optimization method. Recently, Decentralized Parallel SGD (DPSGD) has been proven to achieve a similar convergence rate as SGD and to guarantee linear speedup for non-convex optimization problems. While there was anecdotal evidence that DPSGD outperforms SSGD in the large-batch setting, no systematic study has been conducted to explain why this is the case. Based on a detailed analysis of the DPSGD learning dynamics, we find that DPSGD introduces additional landscape-dependent noise, which has two benefits in the large-batch setting: 1) it automatically adjusts the learning rate to improve convergence; 2) it enhances weight space search by escaping local traps (e.g., saddle points) to find flat minima with better generalization. We conduct extensive studies over 12 stateof-the-art DL models/tasks and demonstrate that DPSGD consistently outperforms SSGD in the large batch setting; and DPSGD converges in cases where SSGD diverges for large learning rates. Our findings are consistent across different application domains, Computer Vision and Automatic Speech Recognition, and different neural network models, Convolutional Neural Networks and Long Short-Term Memory Recurrent Neural Networks.

1. INTRODUCTION

Deep Learning (DL) has revolutionized AI training across application domains: Computer Vision (CV) (Krizhevsky et al., 2012; He et al., 2015) , Natural Language Processing (NLP) (Vaswani et al., 2017) , and Automatic Speech Recognition (ASR) (Hinton et al., 2012) . Stochastic Gradient Descent (SGD) is the fundamental optimization method used in DL training. Due to massive computational requirements, Distributed Deep Learning (DDL) is the preferred mechanism to train large scale Deep Learning (DL) tasks. In the early days, Parameter Server (PS) based Asynchronous SGD (ASGD) training was the preferred DDL approach (Dean et al., 2012; Li et al., 2014) as it did not require strict system-wide synchronization. Recently, ASGD has lost popularity due to its unpredictability and often inferior convergence behavior (Zhang et al., 2016b) . Practitioners now favor deploying Synchronous SGD (SSGD) on homogeneous High Performance Computing (HPC) systems. The degree of parallelism in a DDL system is dictated by batch size: the larger the batch size, the more parallelism and higher speedup can be expected. However, large batches require a larger learning rate and overall they may negatively affect model accuracy because 1) large batch training usually converges to sharp minima which do not generalize well (Keskar et al., 2016) and 2) large learning rates may violate the conditions (i.e., the smoothness parameter) required for convergence in nonconvex optimization theory (Ghadimi & Lan, 2013) . Although training longer with large batches could lead to better generalization (Hoffer et al., 2017) , doing so gives up some or all of the speedup we seek. Through meticulous hyper-parameter design (e.g., learning rate) tailored to each specific task, SSGD-based DDL systems have enabled large batch training and shortened training time for some challenging CV tasks (Goyal et al., 2017; You et al., 2017) and NLP tasks (You et al., 2019) from weeks to hours or less. However, it is observed that SSGD with large batch size leads to large training loss and inferior model quality for ASR tasks (Zhang et al., 2019b) , as illustrated in Figure 1a (red curve). In this paper we found for other types of tasks (e.g. CV) and DL models, large batch SSGD has the same problem (Figure 1b and Figure 1c ). The cause of this problem could be that training gets trapped at saddle points since large batches reduce the magnitude of noise in the 1

