GRADIENTMIX: A SIMPLE YET EFFECTIVE REGULAR-IZATION FOR LARGE BATCH TRAINING

Abstract

Stochastic gradient descent (SGD) is the core tool for training deep neural networks. As modern deep learning tasks become more complex and state-of-the-art architectures grow as well, network training with SGD takes a huge amount of time; for example, training ResNet on the ImageNet dataset or BERT pre-training can take days to dozens of days. To reduce the network training time, distributed learning using a large batch size for SGD has been one of the main active research areas in recent years, but this approach entails a significant degradation in generalization. To address this issue, in this paper, we propose a simple yet effective regularization technique, GradientMix, for large-scale distributed learning. GradientMix can enhance the generalization in large batch regimes by giving appropriate noise through a mixup of local gradients computed at multiple devices, which is contrary to the conventions that simply average local gradients. Furthermore, GradientMix is optimizer-agnostic, hence can be applied to any popular optimization algorithm as long as the overall loss is expressed as the sum of the subgroup losses. Our extensive experiments show the effectiveness in both small and large-scale problems, and especially we consistently achieve state-of-the-art performance for various optimizers on training ResNet-50 on the ImageNet dataset with 32K batch size.

1. INTRODUCTION

Stochastic gradient descent (SGD) is a classical training approach commonly used for training most deep learning models, in which model parameters are updated in the opposite direction of the gradient of the loss function computed on a mini-batch. The recent developments in hardware such as GPU and TPU enables several complex and state-of-the-art models to be trained using large batch whose gradient is computed efficiently via data parallelism. Unfortunately, however, it is known that large batch training typically suffers from severe degradation in generalization compared to small batch training. The reason for poor generalization of large batch training still has not been fully uncovered, which leads many researchers to investigate this phenomenon actively in several directions. For that reason, many efforts have been made to shed light on whether generalization has to do with sharp minima in the perspective of loss landscape. As one of pioneering work, Keskar et al. (2016) argued that the generalization gap is primarily due to the sharp local minima obtained from large batch training. Another study He et al. (2019a) further showed that local minima as well as sharp minima lie on an asymmetric valley and that a local minimum biased towards the flat side generalizes better than the exact empirical minimizer. Despite some evidence of the relationship between sharp minima and generalization, Dinh et al. (2017) substantiated that the performance degradation in large batch training has nothing to do with sharp minima by showing that all minima of the neural networks can be arbitrarily sharp according to the reparametrization that does not change the network output. Given that there are limitations to enhance generalization using the curvature information of the loss function, designing an optimization algorithm specific to large batch regime arises as another dimension. The first representative in this line of work is LARS optimizer You et al. (2017) , which exploits the ratio of the parameter size to gradient size. Yet, as pointed out in Zhang et al. (2020) , LARS performs poorly for attention-based model such as Transformer on the ground that LARS is an algorithm based on vanilla SGD. In order to address the issue, the second representative optimizer LAMB is proposed in You et al. (2019) , which inherits the spirit of LARS and one of the adaptive gradient methods, Adam Kingma & Ba (2015) . The aforementioned optimizers make it possible to train ResNet-50 on the ImageNet dataset and BERT around an hour by scaling the batch size to more than 32K. However, Nado et al. (2021) empirically confirmed that traditional optimizers such as Nesterov can achieve as high performance as large batch optimizers (e.g. LARS, LAMB, AdaScale SGD, and etc.) if one puts the same effort into the hyperparameter tuning for each optimizer. Owing to such an observation, it is debatable whether the superiority of large-batch optimizers like LARS and LAMB is due to their well-designed concepts or just merely due to dense hyperparameter tuning. Last but not least, there have been several studies to elucidate how the noise can enhance generalization. Bottou (1991) ; Neelakantan et al. (2015) showed that noise can play a significant role in training neural networks although they do not focus on large batch training. In addition, Zhou et al. (2019) theoretically demonstrated that adding noise to the gradient is helpful in circumventing a spurious local optimum, thus allowing for converging to a global optimum, and Liu et al. (2021) showed that noisy gradient descent finds a flat minimum in a non-convex matrix factorization. Smith et al. (2020) empirically corroborated that noise in stochastic gradients can enhance generalization and Wen et al. (2018) dipped into how injecting curvature noise can be useful in large batch training. In the light of previous studies regarding gradient noise, one can say "certain noise can regularize the gradient descent procedure well". However, most of the aforementioned studies on gradient noise are restricted to the Gaussian-type noise. Therefore, it is still questionable whether other types of noise could be beneficial in the aspect of generalization especially for a large batch regime. In this paper, we focus on the approach to improve generalization for large batch training with the use of the mixup Zhang et al. (2018) technique that has been recently highlighted in the data augmentation context. Toward this, we propose the novel way of adding noise for the optimizer by mixing up the gradients and show that this technique in fact encourages convergence to a flat minimum. Our main contributions with some details are as follows: • We propose a simple yet effective regularization GradientMix for large batch training. Gradient-Mix performs a linear combination of local gradients computed at each device (or each sample) with arbitrary mixing rates. Our GradientMix is optimizer-agnostic, hence it can be applied to any optimization algorithms including LARS or LAMB for large batch training. • We mathematically investigate that the optimization with GradientMix could reduce the trace of generalized Gauss-Newton matrix of the objective. Then, we provide the convergence analysis of the optimization with GradientMix in a non-convex regime under the standard assumptions. • We validate the GradientMix popular problems in deep learning communities. Our extensive experiments show that GradientMix could result in better generalization especially for the large batch settings. Specifically, various optimizers with GradientMix consistently achieve state-ofthe-art performance on the task of training ResNet-50 on the ImageNet with 32K batch size.

2. METHOD

In this section, we introduce our main regularization technique, GradientMix. Then, we show that GradientMix encourages reducing the trace of generalized Gauss-Newton (GGN) of the loss function.

2.1. GRADIENTMIX: MIXING GRADIENTS WITH RANDOM MIXING RATE

As a motivating example, we start from the distributed training with multiple GPU machines. Let D be the training dataset with n inputs X = {x : (x, y) ∈ D} and the corresponding labels Y = {y : (x, y) ∈ D}. We study solving the optimization problem under the distributed environment: minimize θ∈R d L(θ) := 1 n n i=1 ℓ(f (x i ; θ), y i ) (1) where θ ∈ R d is the model parameter and ℓ( y, y) denotes the instance-wise loss function between the prediction y and the true label y. Given K devices, the vanilla SGD updates the model parameter θ as g final = 1 K K k=1 1 B B b=1 ∇ℓ k b f (θ) local gradient , θ t+1 = θ t -η t g final (2) Algorithm 1 A general optimization framework with GradientMix 1: Input: Training dataset D, stepsize {η t } T t=1 , variance κ > 0, noise distribution P(π) satisfying Eq. 12, and optimization algorithm A (e.g. SGD, Adam, ...), momentum parameter µ ∈ [0, 1). 2: Initialize: Model parameter θ 1 ∈ R d , (Nesterov) momentum v 1 ∈ R d . 3: for t = 1, 2, . . . , T do 4: θ t+ 1 2 ← θ t + µv t 5: for b = 1, 2, • • • , B do 6: Randomly sample a b-th datapoint x (b) t from D ▷ This can be done in a parallel way 7: g (b) t ← ∇ θ L(x (b) t ; θ t+ 1 2 ) ▷ Sample-wise gradient 8: end for 9: Sampling random noise from noise distribution, (π 1 , π 2 , • • • , π B ) ∼ P(π) 10: g t ← B b=1 π b g (b) t ▷ GradientMix in a sample-wise manner 11: v t+1 ← µv t -η t B b=1 π (b) t ∇f b (θ t+ 1 2 ) ▷ Nesterov momentum construction 12: θ t+1 ← θ t -A(g 1 , g 2 , • • • , g t ; η t ) ▷ Parameter update 13: end for 14: Output: θ T +1 where B is the local batch size for each device and k b represents the b-th datapoint at device k (hence, b ∈ [B]). For convenience, we abbreviate the gradient ∇ℓ(f (x i ; θ), y i ) evaluated at the datapoint (x i , y i ) as ∇ℓ i f (θ) . In Eq. 2, the local gradient computed at device k are averaged across all devices to yield the final gradient. Inspired by several studies (Srivastava et al., 2014; Zhu et al., 2019; Simsekli et al., 2019; Smith et al., 2020) on the importance of noise on generalization, we propose mixing up local gradients at each device k. Toward this, we first generate the random noise π k from some distribution P(π) at each device k every iteration and perform linear combination of local gradients using π k as a mixing coefficient. In summary, the final gradient is computed by π k ∼ P(π) with E[π k ] = 1/K ∀k ∈ [K], g final = K k=1 π k B B b=1 ∇ℓ k b f (θ) Preserving the unbiasedness of the stochastic gradient, we further assume that each noise π k satisfies E[π k ] = 1/K in Eq. 3, for example, Gaussian distribution with mean 1/K. In the one hand, the gradient mixing in Eq. 3 could also be applied in a sample-wise manner. Suppose that the gradient is computed with the batch size B, then the sample-wise gradient mixing would be π b ∼ P(π) with E[π b ] = 1/B ∀b ∈ [B], g final = B b=1 π b ∇ℓ b f (θ) (4) where π b is a sample-wise noise. The mixing gradients in a sample-wise manner Eq. 4 is a generalized extension of the batch-wise version in Eq. 3. We call the procedures in Eq. 4 as GradientMix. Note that GradientMix is optimizer-agnostic since it does not require any specific update rule; therefore, it can be applicable to any popular optimizers such as Adam, LARS, LAMB, and etc. Thus, we consider a unified framework for GradientMix. We let A be a randomized optimization algorithm (e.g. SGD) and A(g 1 , g 2 , • • • , g t ) produce the update vector at time t where g τ means the gradient computed at time τ . Algorithm 1 summarizes an optimization framework with GradientMix. One of the important properties of GradientMix is that the variance of mixed gradient is always higher than the traditional averaged gradient since the variance of mixed gradient in Eq. 4 is smallest when the noise π b in Eq. 4 satisfies π b = 1/B for all b ∈ [B]. We will investigate how this noisy property affect the loss landscape and the convergence of optimization in the following section.

2.2. GRADIENTMIX CAN REDUCE THE TRACE OF GENERALIZED GAUSS-NEWTON MATRIX

In this section, we provide more intuition of GradientMix. To this end, we consider an L-layer deep neural network following the notations in (Lee et al., 2019) defined by h (l+1) = x (l) W (l+1) + b (l+1) , x (l+1) = ϕ(h (l+1) ) for x (0) = x and l = 0, 1, • • • , L. Here, ϕ(•) is an element-wise (non-linear) activation function such as ReLU, W (l) and b (l) are the weight and bias parameter of appropriate shape at l-th layer, respectively. For simplicity, let θ ∈ R d be the vector of all network parameters and f (x; θ) = h (L+1) (x) ∈ R k denote the network output (or logit). Under this setup, we first introduce the generalized Gauss-Newton (GGN) matrix. Toward this, we revisit an optimization problem in Eq. 1: minimize θ L(θ) := 1 n n i=1 ℓ f (x i ; θ), y i = 1 n n i=1 ℓ i f (θ) . Thanks to (Schraudolph, 2002; Kunstner et al., 2019) , the Hessian of the objective (with respect to θ) can be written as H(θ) := ∇ 2 θ L(θ) = 1 n n i=1 J θ (x i ) T ∇ 2 f ℓ i f (θ) J θ (x i ) G(θ) + Remaining term (5) where J θ (x i ) represents the sample-wise Jacobian ∂f ∂θ evaluated at single datapoint (x i , y i ) and ∇ 2 f ℓ i f (θ) is the Hessian of ℓ i with respect to the network output. The matrix G(θ) defined in Eq. 5 is known to be the generalized Gauss-Newton (GGN) matrix. In case of convex loss ℓ, the GGN is always positive (semi)definite and well-justified approximation of the Hessian H(θ) Kunstner et al. (2019) , so it is a popular curvature approximation in the non-convex problems. We study the relationship between the GradientMix and the GGN matrix. The gradient of objective function can be written as ∇ θ L(θ) = 1 n n i=1 ∇ θ ℓ i f (θ) = 1 n n i=1 J θ (x i ) T ∇ f ℓ i f (θ) = Az (6) where A is the matrix of all sample-wise gradients A = [∇ θ ℓ 1 f (θ) | • • • | ∇ θ ℓ n f (θ) ] ∈ R d×n and z = (1/n, • • • , 1/n) = 1 n 1 ∈ R n is a vector of 1 n 's. Under this formulation, GradientMix in effect replaces the deterministic vector z with the noisy one z as ∇ θ L(θ) = A z (7) where the vector z is a random variable satisfying E[ z] = 1 n 1 for the unbiasedness of the gradient estimator as in Eq. 3. We call Eq. 7 a mixed gradient. In the optimization analysis, the convergence is generally measured by how fast the gradient norm ∥∇ θ L(θ)∥ 2 approaches to zero. GradientMix would change this convergence criterion to ∥∇ θ L(θ)∥ 2 → 0. Plugging the mixed gradient Eq. 7 into Eq. 8, the convergence measure Eq. 8 is nothing but E z T A T A z → 0 (9) Provided that z satisfies Cov[ z] = κ 2 I for some positive constant κ > 0, owing to Hutchinson's trace estimator Hutchinson (1989), Eq. 9 is equivalent to Tr(A T A) → 0 Rearranging the matrix in the trace operator, we finally arrive at the following: Tr n i=1 J θ (x i ) T ∇ f ℓ i f (θ) ∇ f ℓ i f (θ) T J θ (x i ) → 0 (11) In case of a negative log-likelihood loss ℓ (as is many cases in deep learning problems), the matrix in the trace in equation Eq. 11 is the GGN matrix G(θ) (refer to Proposition 1 in Kunstner et al. (2019) ). Consequently, the convergence with GradientMix roughly boils down to reducing the trace of GGN matrix, which could represents the shaprness of minima to some extent as in several previous work (Zhu et al., 2019; Simsekli et al., 2019; Lin et al., 2020) . Hence, it might be possible that GradientMix could help to find the flatter solution than the one obtained via usual averaged gradient. Note that this equivalence does not rely on the batch size, but we expect that GradientMix would be most helpful in a large batch regime, which is crucially vulnerable to the sharpness of the curvature. The remaining question is that which noise class can reduce the trace of GGN matrix and whether the gradient norm really goes to zero when the optimization is equipped with GradientMix. For the first question, we already assume the mild conditions on the noise given the batch size B: E[π] = 1 B 1 B and Cov[π] = κ 2 I B ( ) where 1 B ∈ R B is just a vector of ones. There are several distributions that satisfy Eq. 12, and we adopt two practical distributions: (i) Gaussian and (ii) Rademacher. Based on our mathematical intuition, we give the empirical evidence on the trace of GGN matrix. We train ResNet-18 on CIFAR-10 dataset with 8K batch size for large batch setting. The table 1 report the trace of GGN matrix of the final models trained with each method. As expected, GradientMix indeed reduces the trace of GGN to a meaningful extent, which means that GradientMix could find a flatter solution than the usual average approach with almost no additional computation overhead. We show the effectiveness of GradientMix with these noise structures in large batch training in the experimental section. Now, we answer the second question in the next section.

3. CONVERGENCE ANALYSIS

In this section, we provide the convergence analysis of optimization with GradientMix in Algorithm 1. Our goal is to find an ϵ-stationary point for the problem Eq. 1. For this purpose, we need the standard assumptions for the analysis in non-convex optimization: (C-1) (L-smoothness) The loss function f is differentiable, L-smooth, and lower-bounded: ∀x, y, ∥∇f (x) -∇f (y)∥ ≤ L∥x -y∥ and f (x * ) > -∞ for the optimal solution x * . (C-2) (Bounded variance) The stochastic gradient g t at time t is unbiased and has the bounded variance: E ξ g t = ∇f (θ t ), E ξ ∥g t -∇f (θ t )∥ 2 ≤ σ 2 where ξ represents the randomness from data distribution. The conditions (C-1) and (C-2) are standard in the analysis of non-convex optimization Ghadimi & Lan (2013) ; Zaheer et al. (2018) ; Chen et al. (2019) ; Yun et al. (2022) . Also, GradientMix preserves the unbiasedness of the stochastic gradient, thereby satisfying the condition (C-2). Since Algorithm 1 presents the general framework with GradientMix, we specifically provide the convergence analysis for vanilla SGD with Nesterov momentum, which is the state-of-the-art optimizer used in deep learning communities. The update rule for SGD + Nesterov momentum with GradientMix would be θ t+ 1 2 = θ t + µv t , v t+1 = µv t -η B b=1 π (b) t ∇f b (θ t+ 1 2 ), θ t+1 = θ t + v t+1 ( ) where µ denotes the momentum parameter, π (b) t is the sampled noise of b-th datapoint at time t satisfying Eq. 12. We further define the quantities γ t and γ as γ t = B B b=1 (π (b) t ) 2 , γ = 1 T T -1 t=0 γ t where T is the total iterations. Note that the quantity γ t roughly measures the total amount of noise injected to the gradient at time t and γ means the average amount of noise over all time. Now, we are ready to state our main theorem. Theorem 1 (Convergence for SGD + Nesterov momentum with GradientMix). Let θ a denote an iterate uniformly randomly chosen from {θ 1 2 , • • • , θ T + 1 2 }. Under the conditions (C-1) and (C-2) with the stepsize η ≤ 2(1-µ) 2 L(µ 3 +1) , the Algorithm 1 with SGD + Nesterov momentum yields E a ∥∇f (θ a )∥ 2 ≤ O L∆(µ 3 + 1) T (1 -µ) + 2L∆σ 2 γ BT (1 -µ) where ∆ = f (θ 0 ) -f (θ * ) with optimal point θ * . Remarks. Our Theorem 1 is a generalized version of Theorem 4.2 in Lin et al. (2020) . Importantly, the second term in the upper bound in Eq. 14 is asymptotically dominant under the batch size condition B = O 1-µ (µ 3 +1) 2 × σ 2 T γ L∆ . In this regime, we can obtain a linear speedup as increasing the batch size with the total iteration of order O Lσ 2 ∆γ Bϵ 2 for ϵ-stationary point. Note that the second term in the batch size condition involves the average noise γ attaining the smallest value 1 (so, γ ≥ 1) when the sampled probability is π (b) t = 1/B for all b and t, which corresponds to the traditional averaged gradients. In other words, the mixed gradient with random noise π (b) t has always higher value of γ than the averaged gradient, which in effect allows larger critical batch size but with requiring more iterations. For this reason, GradientMix might show performance degradation in a tight budget of epochs, but tuning the variance κ in Eq. 12, which affects γ in result, can play a role to balance the batch size and the number of total iterations. Since GradientMix do not hurt the unbiasedness of the stochastic gradient but with a little higher variance, one can easily show the convergence of each optimizer with GradientMix upon the previous results. We defer the convergence analysis on LARS and LAMB with GradientMix in the Appendix.

4. RELATED WORK

Large batch training. Many researchers have paid attention to large-scale training to speed up training and achieve as high accuracy as small minibatch training. To reduce the generalization gap, Goyal et al. (2017) propose a linear scaling rule to increase the learning rate proportionally to the minibatch size. As exploiting a large learning rate can cause a neural network to diverge, You et al. (2017) suggest Layer-wise Adaptive Rate Scaling (LARS) to scale updates by multiplying the ratio of the ℓ 2 -norm of weights to that of gradients for each layer. Furthermore, You et al. (2019) apply this strategy to the Adam optimizer (Kingma & Ba, 2015) , which is called LAMB, so as to accelerate training attention-based models, especially BERT, by using large batch sizes. In contrast to (You et al., 2017; 2019) , without any layer-wise normalization, Nado et al. (2021) show that standard optimizers are enough for large batch training given sufficient hyperparmeter tuning. Apart from them, in order to narrow the generalization gap, Lin et al. (2020) employ an extragradient technique for smoothing and stabilizing optimization dynamics, and Johnson et al. (2020) propose AdaScale SGD that reliably and automatically adjusts the learning rate depending on the variance of gradients. However, all these methods mentioned above are based on the average of gradients, which is likely to preclude the possibility of exploration during training. Mixup. Mixup Zhang et al. ( 2018) is one of the popular methods for enhancing the generalization and robustness of a neural network. Taking advantage of the beta distribution, Zhang et al. (2018) interpolates two inputs and their corresponding labels linearly to generate new data, and Verma et al. (2019) interpolates hidden representations linearly for a smooth decision boundary. Following these works, several extensions Yun et al. (2019) ; Kim et al. (2020) have been proposed to substitute some area of input for another input's patch, but all the aforementioned studies are discussed only in the context of data augmentation. Benefit of noise for network training. In recent years, there have been several attempts to figure out connections between noise and generalization. Zhu et al. (2019) showed that noise in stochastic gradients helps escape from sharp minima and converge to flat minima which generalize well. Simsekli et al. (2019) analyze the dynamics of stochastic gradient descent driven by noise and showed that stochatic gradient descent prefers wide minima that performs better than narrow minima on the test set. Smith et al. (2020) verify that the noise in the stochastic gradient can improve generalization. Wu et al. (2020) showed that injecting multiplicative noise to the full-batch gradient generalizes as the usual stochastic gradient evaluated on mini-batch, but the noise structure they consider is not applicable to reducing the trace of curvature as we discussed in Section 2.2. Also, the effect of noise on the generalization is guaranteed in theory only in terms of the linear regression. For more exploration in large batch training, instead of the averaged gradients, our work is the first in-depth study applying mixup to the local gradients using suitable distribution and we empirically justify that injecting appropriate noise into the gradient can bridge the generalization gap with a large batch as well in the next section. 

5. EXPERIMENTS

We consider two sets of experiments. The first set aims to purely see the effectiveness of Gradient-Mix on relatively small-scale problems and the second set is to evaluate GradientMix in a large scale. The details on experimental settings are provided at each section.

5.1. CIFAR CLASSIFICATION

We train ResNet-18 He et al. (2016) on the CIFAR datasets using two sets of optimizers, which is one of benchmark tasks in large batch training. The first set compares SGD with momentum and LARS, and the second set compares Adam and LAMB since LARS and LAMB are extensions of SGD and Adam respectively. For large batch training, the polynomial LR scheduling with gradual warmup Nado et al. ( 2021) is recommended, so we follow the linear LR scaling Goyal et al. (2017) with the base LR η base = 0.1 for the batch size 200 for SGD and LARS. Similarly, we choose η base = 10 -3 for the batch size 200 for Adam and LAMB. Throughout this experiment, we consider the total 200 epochs with 20 warmup epochs. In particular, the trust coefficient of LARS is fixed as 10 -3 . While our main goal is how much GradientMix can improve the performance in a large batch regime, we also include the results on the small batch size for better understanding. Figure 1 illustrates the results for the CIFAR-10 dataset. GradientMix shows the consistent improvement in generalization for all optimizers considered acrosss all batch sizes except the smallest batch size 0.2K. The interesting point in Figure 1 is that for all optimizers, the performance with GradientMix tends to increase slightly and then decrease (see 0.2K ∼ 2K at x-axis) as the batch size increases while the generalization of all baselines only gets worse as the batch size gets larger. At large batch size such as 2K or 5K, which is our main focus, GradientMix show consistent superiority to the baselines without overlapping error bars. This might be due to the fact that the small batch size has already enough noise, in which GradientMix can interfere with model generalization. The similar dynamics can be seen in Figure 2 for the CIFAR-100 dataset. Note that all the optimizers with GradientMix achieve a great improvement over the baseline for the largest batch size 5K as experiments for the CIFAR-10 dataset. Also, we can see the increase-then-decrease behavior in Figure 2 (see 0.2K ∼ 2K at x-axis) for all optimizers considered as in the CIFAR-10 experiment.

5.2. TRANSFORMER ON MULTI30K

In order to evaluate GradientMix on various problems in deep learning, we consider language modeling task. To this end, we train Transformer base model (Vaswani et al., 2017) on Multi30k dataset (Elliott et al., 2016) . Following the experimental settings in Lin et al. (2020) , we employ the linear LR scaling and gradual warmup scheme (Goyal et al., 2017) with inverse square root scheduling (Vaswani et al., 2017) . The warmup step is set to 4000 for the base batch size 64 and linearly decayed by the batch size. It is known that the adaptive gradient methods work well for attention models, so we compare the Adam and LAMB optimizers for this experiment. Figure 3 demonstrate the comparison of validation accuracy. Similar to the results in Section 5.1, GradientMix tends to be more effective as the batch size increases for both optimizers Adam and LAMB. Importantly, GradientMix achieve great improvements especially at large batch size such as 2K and 5K. Combined with the results in Section 5.1, we believe that GradientMix has the potential to generalize better for other several tasks and optimizers in deep learning.

5.3. RESNET-50 ON IMAGENET WITH 32K BATCH SIZE

Training ResNet-50 on ImageNet has been considered as standard benchmark in large batch optimization, so we evaluate our GradientMix on this task using various optimization algorithms. In large batch training, several regularization techniques and hyperparameters should be considered carefully in order to achieve the comparable performance of small batch training. Toward this, we first introduce the experiment settings for each optimizer. The recent study Nado et al. (2021) empirically shows that the traditional optimizers such as Nesterov can achieve the competitive performance to the optimizer specifically designed for large batch training, such as LARS or LAMB, with the same effort in the hyperparameter tuning. Inspired by this work, we investigate whether GradientMix can further enhance the generalization even under highly fine-tuned hyperparameters. Recommended in Nado et al. (2021) , the weight decay should be applied only to the network weight parameters except the bias and batch normalization parameters (You et al., 2017; Goyal et al., 2017) . One of the most essential tricks to improve the generalization (He et al., 2019b; Nado et al., 2021) is to tune the initial scale parameter γ 0 of the final batch normalization layer of each residual block (Goyal et al., 2017) . We use the polynomial LR scheduling (Nado et al., 2021) with tuning each hyperparameter. We summarize the hyperparameter details in the Appendix. For Adam and LAMB, unfortunately, there is no previous reference using highly fine-tuned hyperparamaters for ResNet-50 training, so we follow the experimental pipelines in (You et al., 2019) . We use the momentum parameters (β 1 , β 2 ) = (0.9, 0.999) with ϵ = 10 -6 for both Adam and LAMB optimizers as in (You et al., 2019) and the polynomial LR scheduling with gradual warmup (Nado et al., 2021) is employed. The number of warmup epochs t warmup is set to 20 among total 90 epochs. Table 2 illustrates the Top-1 accuracy of each method for 32K batch size. For all baseline optimizers, GradientMix could successfully achieve the state-of-the-art generalization even under highly finetuned hyperparameters (see Nesterov and LARS) as well as under standard hyperparameter settings (see Adam and LAMB). As seen in Table 2 , the performance gain for GradientMix with Nesterov and Adam is slightly larger than that for GradientMix with LARS and LAMB respectively. This might be due to the fact that LARS and LAMB algorithms are already optimized designs to some extent specifically for large batch training, but our results prove that GradientMix is indeed optimizeragnostic. We emphasize that our improvement (without overlapping confidence intervals) is NOT marginal since the performance of baselines is already highly optimized for ResNet-50 architecture. Also, our improvements are bigger than the ones achieved in recent studies related to regularization in a large batch regime such as Yuan et al. (2020) ; Huo et al. (2021) .

6. CONCLUSION

We proposed an optimizer-agnostic and effective regularization for large-batch training, GradientMix, which is the mixup of local gradients with arbitrarily sampled noise. We also show that Gradient-Mix could reduce the sharpness of the loss landscape. Finally, we empirically verify the effectiveness of GradientMix on various datasets/models and achieve state-of-the-art performance on the benchmark task, training ResNet-50 on ImageNet dataset. As future work, we plan to investigate how the various type of noise affects the performance for large-batch training both in theory and practice.

SUPPLEMENTARY MATERIALS

A HYPERPARAMETER DETAILS FOR NESTEROV AND LARS IN SECTION 5.3 As described in Section 5.3, we follow the same hyperparameter settings in Nado et al. (2021) for Nesterov and LARS optimizers. First, we employ the polynomial learning rate scheduling designed for large-batch training: ηt =    ηinit + (ηpeak -ηinit) t twarmup pwarmup , t ≤ twarmup ηfinal + (ηpeak -ηfinal) T -t T -twarmup p decay , t > twarmup (15) We train ResNet-50 on ImageNet dataset with the batch size B = 32768 (32K). We summarize the detailed values for each hyperparameter introduced in Section 5.3 in the following table. 

B PROOF OF THEOREM 1

Our analysis is based on the convergence proof of SGD + Nesterov momentum Lin et al. (2020) . We first derive the convergence for batch-wise GradientMix and then present the analysis for sample-wise GradientMix. Recall that the iterate of SGD + Nesterov momentum with GradientMix is θ t+ 1 2 = θt + µvt, vt+1 = µvt - η B K k=1 π (k) t B b=1 ∇f k b (θ t+ 1 2 ), θt+1 = θt + vt+1 where π (k) t means that the sampled probability of k-th device at time t and µ is the momentum parameter. For simplicity, we define the gt as gt = 1 B K k=1 π (k) t B b=1 ∇f k b (θt) Then, the momentum update would become vt+1 = µvt -ηg t+ 1 2 We also define the quantities ζt and ζ as ζt = K k=1 (π (k) t ) 2 , ζ = 1 T T -1 t=0 ζt The main changes in our analysis is how to bound the term E 1 T T -1 t=0 ∥g t+ 1 2 ∥ 2 . Toward this, we have E g t+ 1 2 2 = E 1 B K k=1 π (k) t B b=1 ∇f k b (θ t+ 1 2 ) 2 (16) = Var 1 B K k=1 π (k) t B b=1 ∇f k b (θ t+ 1 2 ) + E 1 B K k=1 π (k) t B b=1 ∇f k b (θ t+ 1 2 ) 2 (17) = ζt B σ 2 + ∇f (θ t+ 1 2 ) 2 Hence, we obtain E 1 T T -1 t=0 g t+ 1 2 2 ≤ ζ B σ 2 + 1 T T -1 t=0 ∇f (θ t+ 1 2 ) 2 Following the proofs in Lin et al. (2020) , we define an auxiliary sequence yt as yt = x 1 2 = x0 if t = 0 1 1-µ θ t+ 1 2 -µ 1-µ θ t-1 2 + ηµ 1-µ g t-1 2 if t ≥ 1 (20) Lemma 1 (Lemma A.1 in Lin et al. (2020) ). The sequence {yt} in equation 20 satisfies yt+1 -yt = - η 1 -η g t+ 1 2 Lemma 2 (Lemma A.2 in Lin et al. (2020) ). For a sequence {x t+ 1 2 } for t ≥ 0, the following holds T -1 t=0 yt -θ t+ 1 2 2 ≤ µ 4 η 2 (1 -µ) 4 T -1 t=0 g t+ 1 2 2 Now, we derive the convergence bound. From smoothness condition, we have E[f (yt+1) -f (yt)] ≤ E ⟨∇f (yt), yt+1 -yt⟩ + L 2 ∥yt+1 -yt∥ 2 = E - η 1 -µ ∇f (θ t+ 1 2 ) 2 - η 1 -µ ∇f (yt) -∇f (x t+ 1 2 ), ∇f (θ t+ 1 2 ) + L 2 η 1 -µ g t+ 1 2 2 by Lemma 1. Further we have - η 1 -µ ∇f (θt) -∇f (θ t+ 1 2 , ∇f (θ t+ 1 2 ) = - √ 1 -µ √ Lu 3/2 ∇f (yt) -∇f (θ t+ 1 2 ) , η √ Lu 3/2 (1 -µ) 3/2 ∇f (θ t+ 1 2 ) ≤ 1 -µ 2Lµ 3 ∇f (yt) -∇f (x t+ 1 2 ) 2 + η 2 Lµ 3 2(1 -µ) 3 ∇f (θ t+ 1 2 ) 2 by the inequality ⟨x, y⟩ ≤ 1 2 (∥x∥ 2 + ∥y∥ 2 ). Then, we get E[f (yt+1) -f (yt)] ≤ E - η 1 -µ ∇f (θ t+ 1 2 ) 2 + 1 -µ 2Lµ 3 ∇f (yt) -∇f (θ t+ 1 2 ) 2 + η 2 Lµ 3 2(1 -µ) 3 ∇f (θ t+ 1 2 ) 2 + η 2 L 2(1 -µ) 2 g t+ 1 2 2 ≤ E - η 1 -µ + η 2 Lµ 3 2(1 -µ) 3 ∇f (θ t+ 1 2 ) 2 + (1 -µ)L 2µ 3 yt -θ t+ 1 2 2 + η 2 L 2(1 -µ) 2 g t+ 1 2 2 By telescoping over t = 0 ∼ T -1, we obtain 1 T T -1 t=0 E[f (yt+1) -f (yt)] = 1 T E[f (yt) -f (y0)] ≤ - η 1 -µ + Lη 2 µ 3 2(1 -µ) 3 1 T T -1 t=0 ∇f (θ t+ 1 2 ) 2 + E (1 -µ)Lη 2 2µ 3 µ 4 (1 -µ) 4 1 T T -1 t=0 g t+ 1 2 2 + E η 2 L 2(1 -µ) 2 1 T T -1 t=0 g t+ 1 2 2 Here, we apply our derivations equation 18 then have E[f (yt) -f (y0)] ≤ - η 1 -µ + Lη 2 µ 3 2(1 -µ) 3 + η 2 L 2(1 -µ) 2 + Lµη 2 2(1 -µ) 3 1 T T -1 t=0 ∇f (θ t+ 1 2 ) 2 + η 2 L 2(1 -µ) 2 + Lµη 2 2(1 -µ) 3 ζσ 2 B By rearranging all the items, we have 1 T T -1 t=0 ∇f (θ t+ 1 2 ) 2 ≤ 1 1 -Lη(µ 3 +1) 2(1-µ) 2 1 T η 1-µ ∆ + ηL 2(1 -µ) 2 ζσ 2

B

In order to get the final results, we need the following lemmas from Lin et al. (2020) Lemma 3 (Lemma A.4 in Lin et al. (2020) ). For every non-negative sequence {rt} t≥0 and any parameters d ≥ 0, c ≥ 0, and T ≥ 0, there exists a constant η ≤ 1/d such that for any constant stepsizes ηt = η, it holds ΨT := 1 T + 1 T t=0 rt ηt - rt+1 ηt + cηt ≤ d∆ η(T + 1) + cη Then, using Ψ ′ T := ∆ T η 1-µ + ηL 2(1-µ) 2 ζσ 2 B in Lemma 3, we can arrive at the following results by case study where ∆B Lσ 2 ζT ≤ 1 L 2 or > 1 L 2 similar to Lin et al. (2020) , E 1 T T -1 t=0 ∇f (θ t+ 1 2 ) 2 = O L∆(µ 3 + 1) T (1 -µ) + 2L∆σ 2 ζ BT (1 -µ) Now, if we replace the K = B0 and B = 1 for some batch size B0, we obtain the convergence of sample-wise GradientMix. Going further, for batch size B0, the noise quantity ζt in equation 16 should be at least of order 1/B0, in other words ζt = B 0 b=1 (π (b) t ) 2 ≥ 1/B0 = Ω(1/B0) since ζt is smallest when all the noise satisfy π (b) t = 1/B0 (by Cauchy-Schwarz inequality). For clear analysis, in order to remove the dependency on the batch size B0, we newly define the following quantity γt = B0 B 0 b=1 (π (b) t ) 2 , γ = 1 T T -1 t=0 γt Then, the final convergence bound for Nesterov with sample-wise GradientMix would be E 1 T T -1 t=0 ∇f (θ t+ 1 2 ) 2 = O L∆(µ 3 + 1) T (1 -µ) + 2L∆σ 2 γ B0T (1 -µ) for the batch size B0 (in effect, replaces ζ with γ).

B.1 CONVERGENCE ANALYSIS OF LARS WITH GradientMix

From this section, we analyze the convergence of LARS/LAMB with gradient mixup based upon the previous analysis in You et al. (2019) . Here, we make the conditions specific to LARS and LAMB as follows. We assume that the loss function f (•) is L l -smooth with respect to the paramter of the l-th layer θ (l) for l ∈ [h], which means ∥∇ l f (x, s) -∇ l f (y, s)∥ ≤ Li∥x (l) -y (l) ∥, ∀x, y ∈ R d where s represents a random datapoint from data distribution. We use L = (L1, • • • , L h ) T . We also assume the bounded variance of stochastic gradient as E[∥∇ l f (x, s) -∇ l f (x)∥ 2 ] ≤ σ 2 l , ∀x ∈ R d for all layer i ∈ [h]. Furthermore, as in You et al. (2019) , we assume that E[∥[∇f (x, s)]j -[∇f (x)]j∥ 2 ] ≤ σ 2 i , ∀x ∈ R d for j ∈ [d]. Under this condition, we use the following notation for simplicity as σ = (σ1, • • • , σ h ) T , σ = ( σ1, • • • , σ d ) T Lastly, we assume the bounded gradient condition as [∇f (x, s)]j ≤ G for all j ∈ [d] and x ∈ R d . Recall that the iterate of LARS You et al. (2019) is where ∆ = f (θ1) -f (θ * ) with the optimal point θ * . θ (l) t+1 = θ (l) t -ηϕ(∥θ (l) t ∥) g (l) t ∥g (l) t ∥

B.2 CONVERGENCE ANALYSIS OF LAMB WITH GradientMix

In this section, we provide the convergence analysis of LAMB, which has the following update rule As in You et al. (2019) , we deal with the case of β1 = 0 and λ = 0, but β2 > 0. Then, the update vector would be r t,j )) Here, the mixed gradient has a different bound with the averaged gradient as E[T1] ≤ -ηα l L(1 -β2) G 2 d ∥∇f (θt)∥ 2 + ηαu ζt h l=1 d l j=1 σ l,j Therefore, we have E[f (θt+1)] ≤ f (θt) -ηα l L(1 -β2) G 2 d ∥∇f (θt)∥ 2 + ηαu ζt∥ σ∥1 + η 2 α 2 u ∥L∥1 Telescoping over t = 1 ∼ T yieids E[f (θt+1)] ≤ f (θ1) -ηα l L(1 -β2) G 2 d T t=1 E[∥∇f (θt)∥ 2 ] + ηT αu ζ∥ σ∥1 + η 2 α 2 u T 2 ∥L∥1 Finally, we have the following convergence guarantee of LAMB optimizer with gradient mixup as L(1 -β2) G 2 d 1 T T t=1 E[∥∇f (θt)∥ 2 ] ≤ ∆ T ηα l + αu ζ∥ σ∥1 α l + ηα 2 u 2α l ∥L∥1



Figure 1: Results on training ResNet-18 on the CIFAR-10 dataset with GradientMix.

Figure 3: Results on training Transformer on Multi30k dataset with GradientMix.

24)for all layer l ∈ [L] and θ (l) t represents the parameter of l-th layer at time t. The scaling function ϕ(•) satisfies α l ≤ ϕ(•) ≤ αu and in practice we use just a clipping function ϕ(z) = min{max{z, γ l }, γu}. Here, the stochastic gradient gt with gradient mixup is computed asgt = t ∇f (θt, s b ) (25)where s b denotes the b-th sample in a mini-batch of size B and {π(b)t } B b=1 is a unit simplex at time t. Note that this is more general setting of gradient mixup and datapoint-wise gradient mixup can surely cover the case of minibatch-wise gradient mixup. Here, we define the following quantities ζt and ζ similarly as Nesterov momentum ζt = ∇if (θt) means the gradient of loss function computed at time t with respect to the parameter θ (l) t of l-th layer. The above term can be easily bound under the bounded varaince condition as inequality (6) in the Appendix A in You et al. (2019) with our equation 31. Then, the following holds E[f (θt+1)] ≤ f (θt)t = 1 ∼ T , we have E[f (θT +1)] ≤ f (θ1) -

Similar to the convergence of LARS, the major changes in our analysis is in how to bound the following term from inequality (7) in the AppendixYou et al. (2019) l E ϕ(∥θt∥) × [∇ l f (θt)]j × g l f (θt)]j P(sign([∇ l f (θt)]j) ̸ = sign(g (l)

The trace comparison for trained ResNet-18 on CIFAR-10 dataset.

Top-1 accuracy with GradientMix using 32K batch size for training ResNet-50 on ImageNet.

The highly fine-tuned hyperparameter configuration for training ResNet-50 on ImageNet dataset for Nesterov and LARS optimizers.

