ADMETA: A NOVEL DOUBLE EXPONENTIAL MOVING AVERAGE TO ADAPTIVE AND NON-ADAPTIVE MOMEN-TUM OPTIMIZERS WITH BIDIRECTIONAL LOOKING

Abstract

Optimizer is an essential component for the success of deep learning, which guides the neural network to update the parameters according to the loss on the training set. SGD and Adam are two classical and effective optimizers on which researchers have proposed many variants, such as SGDM and RAdam. In this paper, we innovatively combine the backward-looking and forward-looking aspects of the optimizer algorithm and propose a novel ADMETA (A Double exponential Moving averagE To Adaptive and non-adaptive momentum) optimizer framework. For backward-looking part, we propose a DEMA variant scheme, which is motivated by a metric in the stock market, to replace the common exponential moving average scheme. While in the forward-looking part, we present a dynamic lookahead strategy which asymptotically approaching a set value, maintaining its speed at early stage and high convergence performance at final stage. Based on this idea, we provide two optimizer implementations, ADMETAR and ADMETAS, the former based on RAdam and the latter based on SGDM. Through extensive experiments on diverse tasks, we find that the proposed ADMETA optimizer outperforms our base optimizers and shows advantages over recently proposed competitive optimizers. We also provide theoretical proof of these two algorithms, which verifies the convergence of our proposed ADMETA.

1. INTRODUCTION

The field of training neural network is dominated by gradient decent optimizers for a long time, which use first order method. Typical ones include SGD (Robbins & Monro, 1951) and SGD with momentum (SGDM) (Sutskever et al., 2013) , which are simple yet efficient algorithms and enjoy even better resulting convergence than many recently proposed optimizers. However, it suffers the disadvantage of low speed in initial stage and poor performance in sparse training datasets. This shortcoming can not be ignored since with the development of deep learning, the amount of data becomes much larger, and the model becomes much more complex. Time to train a network is also considered an important metric when evaluating an optimizer. To address this issue, optimizers with adaptive learning rate have been proposed which use nonuniform stepsizes to scale the gradient while training, and the usual implementation is scaling the gradient by square roots of some kind of combination of the squared values of historical gradients. By far the most used are Adam (Kingma & Ba, 2014) and AdamW (Loshchilov & Hutter, 2017) due to their simplicity and high training speed in early stage. Despite their popularity, Adam and many variants like of it (such as RAdam (Liu et al., 2019) ) is likely to achieve worse generalization ability than non-adaptive optimizers, observing that their performance quickly plateaus on validation sets. To achieve a better tradeoff, researchers have made many improvements based on SGD and Adam family optimizers. One attempt is switching from adaptive learning rate methods to SGD, based on the idea of complementing each other's advantages. However, a sudden change from one optimizer to another in a set epoch or step is not applicable because different algorithms make characteristic choices at saddle points and tend to converge to final points whose loss functions nearby have different geometry (Im et al., 2016) . Therefore, many optimizers based on this idea seek for a smooth switch. The representative ones are AdaBound (Luo et al., 2019) and SWATS (Keskar & Socher, 2017) . The second attempt is proposing new method to further accelerate SGDM, including introducing power exponent (pbSGD (Zhou et al., 2020) ), aggregated momentum (AggMo (Lucas et al., 2018) ) and warm restarts (SGDR (Loshchilov & Hutter, 2016) ). The third attempt is modifying the process of optimizers with adaptive learning rate to achieve better local optimum, which is the most popular field in recent researches (Zhuang et al., 2020; Li et al., 2020) . Due to space constraints, please see more related work in Appendix A. We focus in this paper on the use of historical and future information about the optimization process of the model, both of which we argue are important for models to reach their optimal points. To this end, we introduce a bidirectional view, backward-looking and forward-looking. In the backward-looking view, EMA is an exponentially decreasing weighted moving average, which is used as a trend-type indicator in terms of the optimization process. And since the training uses a mini-batch strategy, each batch is likely to have deviations from the whole, so it may mislead the model to the local optimal point. Inspired by stock market indicators, DEMA (Mulloy, 1994) is an exponential average calculated on the traditional EMA and current input, which can effectively maintain the trend while reducing the impact caused by short-term bias. We thus replace the traditional exponential moving average (EMA) with double exponential moving average (DEMA). It is worth noting that our usage is not equivalent to the original DEMA, but rather a variant of it. In the forward-looking part, since we observe that a constant weight adopted by the original Lookahead optimizer (Zhang et al., 2019) to control the scale of fast weights and slow weights in each synchronization period makes the early stage training slow and lossy, we propose a new dynamic strategy which adopts an asymptotic weight for improvement. By applying these two ideas, we propose ADMETA optimizer with ADMETAR and ADMETAS implementations based on RAdam and SGDM respectively. Extensive experiments have been conducted on computer vision (CV), natural language processing (NLP) and audio processing tasks, which demonstrate that our method achieves better convergence results compared to other recently proposed optimizers. Further analysis show that ADMETAS achieves higher generalization ability than SGDM and ADMETAR achieves better convergence results and maintain high speed in initial stage compared to other adaptive learning rate methods. We further find that DEMA and dynamic looking strategy can improve performance compared to EMA and constant strategy, respectively. In addition, we provide convergence proof of our proposed ADMETA in convex and non-convex optimizations.

2.1. BACKGROUND

The role of the optimizer in model training is to minimize the loss on the training set and thus drive the learning of model parameters. Formally, consider a loss function f : R d → R that is bounded below greater than zero, where R represents the field of real numbers, d denotes the dimension of the parameter and thus R d denotes d-dimensional Euclidean space. The optimization problem can be formulated as: min θ∈F d f (θ), where θ indicates a parameter whose domain is F and F ⊂ R d . If we define the optimum parameter of the above loss function as θ * , then the optimization objective can be written as: θ * = arg min θ∈F d f (θ). Optimizers iteratively update parameters to make them close to the optimum as training step t increases, that is to make: lim t→∞ ∥θ t -θ * ∥ = 0. Stochastic gradient algorithm SGD (Robbins & Monro, 1951) optimizes f by iteratively updating parameter θ t at step t in the opposite direction of the stochastic gradient g(θ t-1 ; ξ t ) where ξ t is the input variables of the t-th mini-batch in training datasets. For the sake of clarity, we abbreviate g(θ t-1 ; ξ t ) as g t for the rest of the paper unless specified. SGD optimization aims to calculate the updated model parameters based on the previous model parameters, the current gradient and the learning rate. Define learning rate as α t , the update process is summarized as follows: θ t = θ t-1 -α t g t . Original SGD tend to vibrate along the process due to the mini-batch strategy and not using of past gradients. What's more, this disadvantage also results in its long-time plateaus in valleys and saddle points, thus slowing the speed. To smooth the oscillation and speed up convergence rate, momentum, also known as Polyak's Heavy Ball (Polyak, 1964) , is introduced to modify SGD. Momentum at step

