ON THE MARGINAL REGRET BOUND MINIMIZATION OF ADAPTIVE METHODS Anonymous authors Paper under double-blind review

Abstract

Numerous adaptive algorithms such as AMSGrad and Radam have been proposed and applied to deep learning recently. However, these modifications do not improve the convergence rate of adaptive algorithms and whether a better algorithm exists still remains an open question. In this work, we propose a new motivation for designing the proximal function of adaptive algorithms, named as marginal regret bound minimization. Based on such an idea, we propose a new class of adaptive algorithms that not only achieves marginal optimality, but can also potentially converge much faster than any existing adaptive algorithms in the long term. We show the superiority of the new class of adaptive algorithms both theoretically and empirically using experiments in deep learning.

1. INTRODUCTION

Accelerating the convergence speed of optimization algorithms is one main concern of the machine learning community. After stochastic gradient descent (SGD) was introduced, quite a few variants of SGD have become popular, such as momentum (Polyak, 1964) and AdaGrad (Duchi et al., 2011) . Instead of directly moving parameters in the negative direction of the gradient, AdaGrad proposed to scale the gradient by a matrix, which was the matrix in the proximal function of the composite mirror descent rule (Duchi et al., 2011) . The diagonal version of AdaGrad designed this matrix to be the square root of the global average of the squared gradients. Duchi et al. (2011) proved that this algorithm could be faster than SGD when the gradients were sparse. However, AdaGrad's performance is known to deteriorate when the gradients are dense, especially in high dimensional problems such as deep learning (Reddi et al., 2018) . To tackle this issue, many new algorithms were proposed to boost the performances of AdaGrad. Most of these algorithms focused on changing the design of the matrix in the proximal function. For example, RMSProp (Tieleman & Hinton, 2012) and Adam (Kingma & Ba, 2015) changed the global average design in AdaGrad to the exponential moving average. However, Reddi et al. (2018) proved that such a modification had convergence issues in the presence of high frequency noises and added a max operation to the matrix of Adam, leading to the AMSGrad algorithm. Other modifications, such as Padam (Chen & Gu, 2018), AdaShift (Zhou et al., 2019 ), NosAdam (Huang et al., 2019 ), and Radam (Liu et al., 2019) , were based on various designs of this matrix as well. However, all aforementioned works did not improve the convergence rate of AdaGrad and simply supported their designs using experiments and synthetic examples. A theoretical foundation for the design of this matrix that improves the convergence and guides future adaptive algorithms is very much needed. In this work, we bring new insights to the design of the matrix in the proximal function. In particular, our major contributions in this paper are listed as follows • We propose a new motivation for designing the proximal function in adaptive algorithms. Specifically, we have found a marginally optimal design, which is the best matrix at each time step through minimizing the marginal increment of the regret bound. • Based on our proposal of marginal regret bound minimization, we create a new class of adaptive algorithms, named as AMX. We prove theoretically that AMX can converge with a regret bound of size Õ( √ τ ), where τ is smaller than T . Such a regret bound is potentially much smaller than those of common adaptive algorithms and can make AMX converge much faster than any existing adaptive algorithms, depending on τ . In the worst case, we show it is at least as fast as AMSGrad and AdaGrad under the same assumptions • We evaluate AMX's empirical performance on different tasks in deep learning. All experiments show our algorithm can converge fast and achieve good testing performances.

2. BACKGROUND

Notation: We denote the set of all positive definite matrices in R d×d by S + d . For any two vectors a, b ∈ R d , we use √ a for element-wise square root, a 2 for element-wise square, |a| for element-wise absolute value, a/b for element-wise division, and max(a, b) for element-wise maximum between a and b. We also frequently use the notation g 1:T,i = [g 1,i , g 2,i , • • • , g T,i ], i.e. the vector of all the i-th elements of vectors g 1 , g 2 , • • • g T . For a vector a, we use diag(a) to represent the diagonal matrix whose diagonal entries are a. For two functions f (t), g(t), f (t) = o(g(t)) means f (t)/g(t) → 0 as t goes to infinity. We use Õ(•) to omit logarithm terms in big-O notations. We say a space X has a bounded diameter D ∞ if x -y ∞ ≤ D ∞ , ∀x, y ∈ X . Online Learning Framework. We choose the online learning framework to analyze all the algorithms in this paper. In this framework, an algorithm picks a new x t ∈ X according to its update rule at each iteration t, where X ⊆ R d is the set of feasible values of x t . The composite loss function f t + φ is then revealed, where φ is the regularization function that controls the complexity of x and f t can be considered as the instantaneous loss at t. In the convex setting, f t and φ are both convex functions. The regularized regret function is defined with respect to an optimal predictor x * as R(T ) = T t=1 f t (x t ) -f t (x * ) + φ(x t ) -φ(x * ). Our goal is to find algorithms that ensures a sub-linear regret, i.e. R(T ) = o(T ), which means that the average regret converges to zero. For example, online gradient descent is proved to have a regret of O( , 2003) , where d is the dimension size of X . Note that stochastic optimization and online learning are basically interchangeable (Cesa-Bianchi et al., 2004) . Therefore, we will refer to online algorithms and their stochastic counterparts using the same names. For example, we will use stochastic gradient descent (SGD) to represent online gradient descent as it is more well-known. √ dT ) (Zinkevich Composite Mirror Descent Setup. In this paper, we will revisit the general composite mirror descent method (Duchi et al., 2010b) used in the creation of the first adaptive algorithm, AdaGrad, to bring new insights into adaptive methods. Such a general framework is preferred because it covers a wide range of algorithms, including both SGD and all the adaptive methods, and thus simplifies the discussions. The composite mirror descent rule at the time step t + 1 is to solve for x t+1 = argmin x∈X {α t g t , x + α t φ(x) + B ψt (x, x t )}, where g t is the gradient, φ(x) is the regularization function in the dual space, and α t is the step size. Also, ψ t is a strongly convex and differentiable function, named as the proximal function and B ψt (x, x t ) is the Bregman divergence associated with ψ t defined as B ψt (x, y) = ψ t (x) -ψ t (y) -∇ψ t (y), x -y . The general update rule (1) is mostly determined by the function ψ t . We first observe that it becomes the projected SGD algorithm when ψ t (x) = x T x and φ(x) = 0. x t+1 = argmin x∈X {α t g t , x + x -x t 2 2 } = Π X (x t -α t g t ), where Π X (x) = argmin y∈X x -y 2 is the projection operation that ensures the updated parameter is in the original space. On the other hand, adaptive algorithms choose different proximal functions ψ t (x) = x, H t x , where H t can be any full or diagonal symmetric positive definite matrix. x t+1 = argmin x∈X {α t g t , x + α t φ(x) + x -x t , H t (x -x t ) }, (Adaptive) Another popular representation of adaptive algorithms is the generalized projection rule x t+1 = Π X ,Ht (x t -α t H -1 t g t ), where Π X ,Ht (x) = argmin y∈X H 1/2 t (x -y) 2 , which is used in a lot of

