ON THE MARGINAL REGRET BOUND MINIMIZATION OF ADAPTIVE METHODS Anonymous authors Paper under double-blind review

Abstract

Numerous adaptive algorithms such as AMSGrad and Radam have been proposed and applied to deep learning recently. However, these modifications do not improve the convergence rate of adaptive algorithms and whether a better algorithm exists still remains an open question. In this work, we propose a new motivation for designing the proximal function of adaptive algorithms, named as marginal regret bound minimization. Based on such an idea, we propose a new class of adaptive algorithms that not only achieves marginal optimality, but can also potentially converge much faster than any existing adaptive algorithms in the long term. We show the superiority of the new class of adaptive algorithms both theoretically and empirically using experiments in deep learning.

1. INTRODUCTION

Accelerating the convergence speed of optimization algorithms is one main concern of the machine learning community. After stochastic gradient descent (SGD) was introduced, quite a few variants of SGD have become popular, such as momentum (Polyak, 1964) and AdaGrad (Duchi et al., 2011) . Instead of directly moving parameters in the negative direction of the gradient, AdaGrad proposed to scale the gradient by a matrix, which was the matrix in the proximal function of the composite mirror descent rule (Duchi et al., 2011) . The diagonal version of AdaGrad designed this matrix to be the square root of the global average of the squared gradients. Duchi et al. (2011) proved that this algorithm could be faster than SGD when the gradients were sparse. However, AdaGrad's performance is known to deteriorate when the gradients are dense, especially in high dimensional problems such as deep learning (Reddi et al., 2018) . To tackle this issue, many new algorithms were proposed to boost the performances of AdaGrad. Most of these algorithms focused on changing the design of the matrix in the proximal function. For example, RMSProp (Tieleman & Hinton, 2012) and Adam (Kingma & Ba, 2015) changed the global average design in AdaGrad to the exponential moving average. However, Reddi et al. (2018) proved that such a modification had convergence issues in the presence of high frequency noises and added a max operation to the matrix of Adam, leading to the AMSGrad algorithm. Other modifications, such as Padam (Chen & Gu, 2018), AdaShift (Zhou et al., 2019 ), NosAdam (Huang et al., 2019 ), and Radam (Liu et al., 2019) , were based on various designs of this matrix as well. However, all aforementioned works did not improve the convergence rate of AdaGrad and simply supported their designs using experiments and synthetic examples. A theoretical foundation for the design of this matrix that improves the convergence and guides future adaptive algorithms is very much needed. In this work, we bring new insights to the design of the matrix in the proximal function. In particular, our major contributions in this paper are listed as follows • We propose a new motivation for designing the proximal function in adaptive algorithms. Specifically, we have found a marginally optimal design, which is the best matrix at each time step through minimizing the marginal increment of the regret bound. • Based on our proposal of marginal regret bound minimization, we create a new class of adaptive algorithms, named as AMX. We prove theoretically that AMX can converge with a regret bound of size Õ( √ τ ), where τ is smaller than T . Such a regret bound is potentially much smaller than those of common adaptive algorithms and can make AMX converge

