ADMETA: A NOVEL DOUBLE EXPONENTIAL MOVING AVERAGE TO ADAPTIVE AND NON-ADAPTIVE MOMEN-TUM OPTIMIZERS WITH BIDIRECTIONAL LOOKING

Abstract

Optimizer is an essential component for the success of deep learning, which guides the neural network to update the parameters according to the loss on the training set. SGD and Adam are two classical and effective optimizers on which researchers have proposed many variants, such as SGDM and RAdam. In this paper, we innovatively combine the backward-looking and forward-looking aspects of the optimizer algorithm and propose a novel ADMETA (A Double exponential Moving averagE To Adaptive and non-adaptive momentum) optimizer framework. For backward-looking part, we propose a DEMA variant scheme, which is motivated by a metric in the stock market, to replace the common exponential moving average scheme. While in the forward-looking part, we present a dynamic lookahead strategy which asymptotically approaching a set value, maintaining its speed at early stage and high convergence performance at final stage. Based on this idea, we provide two optimizer implementations, ADMETAR and ADMETAS, the former based on RAdam and the latter based on SGDM. Through extensive experiments on diverse tasks, we find that the proposed ADMETA optimizer outperforms our base optimizers and shows advantages over recently proposed competitive optimizers. We also provide theoretical proof of these two algorithms, which verifies the convergence of our proposed ADMETA.

1. INTRODUCTION

The field of training neural network is dominated by gradient decent optimizers for a long time, which use first order method. Typical ones include SGD (Robbins & Monro, 1951) and SGD with momentum (SGDM) (Sutskever et al., 2013) , which are simple yet efficient algorithms and enjoy even better resulting convergence than many recently proposed optimizers. However, it suffers the disadvantage of low speed in initial stage and poor performance in sparse training datasets. This shortcoming can not be ignored since with the development of deep learning, the amount of data becomes much larger, and the model becomes much more complex. Time to train a network is also considered an important metric when evaluating an optimizer. To address this issue, optimizers with adaptive learning rate have been proposed which use nonuniform stepsizes to scale the gradient while training, and the usual implementation is scaling the gradient by square roots of some kind of combination of the squared values of historical gradients. By far the most used are Adam (Kingma & Ba, 2014) and AdamW (Loshchilov & Hutter, 2017) due to their simplicity and high training speed in early stage. Despite their popularity, Adam and many variants like of it (such as RAdam (Liu et al., 2019) ) is likely to achieve worse generalization ability than non-adaptive optimizers, observing that their performance quickly plateaus on validation sets. To achieve a better tradeoff, researchers have made many improvements based on SGD and Adam family optimizers. One attempt is switching from adaptive learning rate methods to SGD, based on the idea of complementing each other's advantages. However, a sudden change from one optimizer to another in a set epoch or step is not applicable because different algorithms make characteristic choices at saddle points and tend to converge to final points whose loss functions nearby have different geometry (Im et al., 2016) . Therefore, many optimizers based on this idea seek for a smooth switch. The representative ones are AdaBound (Luo et al., 2019) and SWATS (Keskar & Socher, 2017) . The second attempt is proposing new method to further accelerate SGDM, including introducing power

