STABLE WEIGHT DECAY REGULARIZATION

Abstract

Weight decay is a popular regularization technique for training of deep neural networks. Modern deep learning libraries mainly use L 2 regularization as the default implementation of weight decay. Loshchilov & Hutter (2018) demonstrated that L 2 regularization is not identical to weight decay for adaptive gradient methods, such as Adaptive Momentum Estimation (Adam), and proposed Adam with Decoupled Weight Decay (AdamW). However, we found that the popular implementations of weight decay, including L 2 regularization and decoupled weight decay, in modern deep learning libraries usually damage performance. First, the L 2 regularization is unstable weight decay for all optimizers that use Momentum, such as stochastic gradient descent (SGD). Second, decoupled weight decay is highly unstable for all adaptive gradient methods. We further propose the Stable Weight Decay (SWD) method to fix the unstable weight decay problem from a dynamical perspective. The proposed SWD method makes significant improvements over L 2 regularization and decoupled weight decay in our experiments. Simply fixing weight decay in Adam by SWD, with no extra hyperparameter, can outperform complex Adam variants, which have more hyperparameters.

1. INTRODUCTION

Weight decay is a popular and even necessary regularization technique for training deep neural networks that generalize well (Krogh & Hertz, 1992) . People commonly use L 2 regularization as "weight decay" for training of deep neural networks and interpret it as a Gaussian prior over the model weights. This is true for vanilla SGD. However, Loshchilov & Hutter (2018) revealed that, when the learning rate is adaptive, the commonly used L 2 regularization is not identical to the vanilla weight decay proposed by Hanson & Pratt (1989) : θ t = (1 -λ 0 )θ t-1 -ηg t , where λ 0 is the weight decay hyperparameter, θ t is the model parameters at t-th step , η is the learning rate, and g t is the gradient of the minibatch loss function L(θ) at θ 



t-1 . Zhang et al. (2018) revealed three different roles of weight decay. However, the quantitative measures for weight decay are still missing.Adaptive gradient methods that use Adaptive Learning Rate, such as AdaGrad(Duchi et al., 2011),  RMSprop (Hinton et al., 2012), Adadelta (Zeiler, 2012)  and Adam(Kingma & Ba, 2015), are a class of most popular methods to accelerate training of deep neural networks. Loshchilov & Hutter (2018) reported that, in adaptive gradient methods, the correct implementation of weight decay should be applied to the weights directly and decoupled from the gradients. We show the different implementations of decoupled weight decay and L 2 regularization for Adam in Algorithm 3.It has been widely observed that adaptive gradient methods usually do not generalize as well as SGD(Wilson et al., 2017). A few Adam variants tried to fix the hidden problems in adaptive gradient methods, including AdamWLoshchilov & Hutter (2018), AMSGrad (Reddi et al., 2019)  and Yogi(Zaheer et al., 2018). A recent line of research, such as AdaBound(Luo et al., 2019), Padam (Chen &  Gu, 2018), and RAdam (Liu et al., 2019), believes controlling the adaptivity of learning rates may improve generalization. This line of research usually introduces extra hyperparameters to control the adaptivity, which requires more efforts in tuning hyperparameters.Although popular optimizers have achieved great empirical success for training of deep neural networks, we discover that nonoptimal weight decay implementations have been widely used in

