GRADIENTMIX: A SIMPLE YET EFFECTIVE REGULAR-IZATION FOR LARGE BATCH TRAINING

Abstract

Stochastic gradient descent (SGD) is the core tool for training deep neural networks. As modern deep learning tasks become more complex and state-of-the-art architectures grow as well, network training with SGD takes a huge amount of time; for example, training ResNet on the ImageNet dataset or BERT pre-training can take days to dozens of days. To reduce the network training time, distributed learning using a large batch size for SGD has been one of the main active research areas in recent years, but this approach entails a significant degradation in generalization. To address this issue, in this paper, we propose a simple yet effective regularization technique, GradientMix, for large-scale distributed learning. GradientMix can enhance the generalization in large batch regimes by giving appropriate noise through a mixup of local gradients computed at multiple devices, which is contrary to the conventions that simply average local gradients. Furthermore, GradientMix is optimizer-agnostic, hence can be applied to any popular optimization algorithm as long as the overall loss is expressed as the sum of the subgroup losses. Our extensive experiments show the effectiveness in both small and large-scale problems, and especially we consistently achieve state-of-the-art performance for various optimizers on training ResNet-50 on the ImageNet dataset with 32K batch size.

1. INTRODUCTION

Stochastic gradient descent (SGD) is a classical training approach commonly used for training most deep learning models, in which model parameters are updated in the opposite direction of the gradient of the loss function computed on a mini-batch. The recent developments in hardware such as GPU and TPU enables several complex and state-of-the-art models to be trained using large batch whose gradient is computed efficiently via data parallelism. Unfortunately, however, it is known that large batch training typically suffers from severe degradation in generalization compared to small batch training. The reason for poor generalization of large batch training still has not been fully uncovered, which leads many researchers to investigate this phenomenon actively in several directions. For that reason, many efforts have been made to shed light on whether generalization has to do with sharp minima in the perspective of loss landscape. As one of pioneering work, Keskar et al. (2016) argued that the generalization gap is primarily due to the sharp local minima obtained from large batch training. Another study He et al. (2019a) further showed that local minima as well as sharp minima lie on an asymmetric valley and that a local minimum biased towards the flat side generalizes better than the exact empirical minimizer. Despite some evidence of the relationship between sharp minima and generalization, Dinh et al. (2017) substantiated that the performance degradation in large batch training has nothing to do with sharp minima by showing that all minima of the neural networks can be arbitrarily sharp according to the reparametrization that does not change the network output. Given that there are limitations to enhance generalization using the curvature information of the loss function, designing an optimization algorithm specific to large batch regime arises as another dimension. The first representative in this line of work is LARS optimizer You et al. (2017) , which exploits the ratio of the parameter size to gradient size. Yet, as pointed out in Zhang et al. ( 2020), LARS performs poorly for attention-based model such as Transformer on the ground that LARS is an algorithm based on vanilla SGD. In order to address the issue, the second representative optimizer LAMB is proposed in You et al. (2019) , which inherits the spirit of LARS and one of the adaptive gradient methods, Adam Kingma & Ba (2015) . The aforementioned optimizers make it possible to train ResNet-50 on the ImageNet dataset and BERT around an hour by scaling the batch size to more than 32K. However, Nado et al. ( 2021) empirically confirmed that traditional optimizers such as Nesterov can achieve as high performance as large batch optimizers (e.g. LARS, LAMB, AdaScale SGD, and etc.) if one puts the same effort into the hyperparameter tuning for each optimizer. Owing to such an observation, it is debatable whether the superiority of large-batch optimizers like LARS and LAMB is due to their well-designed concepts or just merely due to dense hyperparameter tuning. Last but not least, there have been several studies to elucidate how the noise can enhance generalization. Bottou (1991) In the light of previous studies regarding gradient noise, one can say "certain noise can regularize the gradient descent procedure well". However, most of the aforementioned studies on gradient noise are restricted to the Gaussian-type noise. Therefore, it is still questionable whether other types of noise could be beneficial in the aspect of generalization especially for a large batch regime. In this paper, we focus on the approach to improve generalization for large batch training with the use of the mixup Zhang et al. ( 2018) technique that has been recently highlighted in the data augmentation context. Toward this, we propose the novel way of adding noise for the optimizer by mixing up the gradients and show that this technique in fact encourages convergence to a flat minimum. Our main contributions with some details are as follows: • We propose a simple yet effective regularization GradientMix for large batch training. Gradient-Mix performs a linear combination of local gradients computed at each device (or each sample) with arbitrary mixing rates. Our GradientMix is optimizer-agnostic, hence it can be applied to any optimization algorithms including LARS or LAMB for large batch training. • We mathematically investigate that the optimization with GradientMix could reduce the trace of generalized Gauss-Newton matrix of the objective. Then, we provide the convergence analysis of the optimization with GradientMix in a non-convex regime under the standard assumptions. • We validate the GradientMix popular problems in deep learning communities. Our extensive experiments show that GradientMix could result in better generalization especially for the large batch settings. Specifically, various optimizers with GradientMix consistently achieve state-ofthe-art performance on the task of training ResNet-50 on the ImageNet with 32K batch size.

2. METHOD

In this section, we introduce our main regularization technique, GradientMix. Then, we show that GradientMix encourages reducing the trace of generalized Gauss-Newton (GGN) of the loss function. , θ t+1 = θ t -η t g final (2)



; Neelakantan et al. (2015) showed that noise can play a significant role in training neural networks although they do not focus on large batch training. In addition, Zhou et al. (2019) theoretically demonstrated that adding noise to the gradient is helpful in circumventing a spurious local optimum, thus allowing for converging to a global optimum, and Liu et al. (2021) showed that noisy gradient descent finds a flat minimum in a non-convex matrix factorization. Smith et al. (2020) empirically corroborated that noise in stochastic gradients can enhance generalization and Wen et al. (2018) dipped into how injecting curvature noise can be useful in large batch training.

GRADIENTMIX: MIXING GRADIENTS WITH RANDOM MIXING RATEAs a motivating example, we start from the distributed training with multiple GPU machines. Let D be the training dataset with n inputs X = {x : (x, y) ∈ D} and the corresponding labels Y = {y : (x, y) ∈ D}. We study solving the optimization problem under the distributed environment:(x i ; θ), y i ) (1)where θ ∈ R d is the model parameter and ℓ( y, y) denotes the instance-wise loss function between the prediction y and the true label y. Given K devices, the vanilla SGD updates the model parameter θ asg final = 1 K

