ON THE MARGINAL REGRET BOUND MINIMIZATION OF ADAPTIVE METHODS Anonymous authors Paper under double-blind review

Abstract

Numerous adaptive algorithms such as AMSGrad and Radam have been proposed and applied to deep learning recently. However, these modifications do not improve the convergence rate of adaptive algorithms and whether a better algorithm exists still remains an open question. In this work, we propose a new motivation for designing the proximal function of adaptive algorithms, named as marginal regret bound minimization. Based on such an idea, we propose a new class of adaptive algorithms that not only achieves marginal optimality, but can also potentially converge much faster than any existing adaptive algorithms in the long term. We show the superiority of the new class of adaptive algorithms both theoretically and empirically using experiments in deep learning.

1. INTRODUCTION

Accelerating the convergence speed of optimization algorithms is one main concern of the machine learning community. After stochastic gradient descent (SGD) was introduced, quite a few variants of SGD have become popular, such as momentum (Polyak, 1964) and AdaGrad (Duchi et al., 2011) . Instead of directly moving parameters in the negative direction of the gradient, AdaGrad proposed to scale the gradient by a matrix, which was the matrix in the proximal function of the composite mirror descent rule (Duchi et al., 2011) . The diagonal version of AdaGrad designed this matrix to be the square root of the global average of the squared gradients. Duchi et al. (2011) proved that this algorithm could be faster than SGD when the gradients were sparse. However, AdaGrad's performance is known to deteriorate when the gradients are dense, especially in high dimensional problems such as deep learning (Reddi et al., 2018) . To tackle this issue, many new algorithms were proposed to boost the performances of AdaGrad. Most of these algorithms focused on changing the design of the matrix in the proximal function. For example, RMSProp (Tieleman & Hinton, 2012) and Adam (Kingma & Ba, 2015) changed the global average design in AdaGrad to the exponential moving average. However, Reddi et al. (2018) proved that such a modification had convergence issues in the presence of high frequency noises and added a max operation to the matrix of Adam, leading to the AMSGrad algorithm. Other modifications, such as Padam (Chen & Gu, 2018) , AdaShift (Zhou et al., 2019) , NosAdam (Huang et al., 2019) , and Radam (Liu et al., 2019) , were based on various designs of this matrix as well. However, all aforementioned works did not improve the convergence rate of AdaGrad and simply supported their designs using experiments and synthetic examples. A theoretical foundation for the design of this matrix that improves the convergence and guides future adaptive algorithms is very much needed. In this work, we bring new insights to the design of the matrix in the proximal function. In particular, our major contributions in this paper are listed as follows • We propose a new motivation for designing the proximal function in adaptive algorithms. Specifically, we have found a marginally optimal design, which is the best matrix at each time step through minimizing the marginal increment of the regret bound. • Based on our proposal of marginal regret bound minimization, we create a new class of adaptive algorithms, named as AMX. We prove theoretically that AMX can converge with a regret bound of size Õ( √ τ ), where τ is smaller than T . Such a regret bound is potentially much smaller than those of common adaptive algorithms and can make AMX converge much faster than any existing adaptive algorithms, depending on τ . In the worst case, we show it is at least as fast as AMSGrad and AdaGrad under the same assumptions • We evaluate AMX's empirical performance on different tasks in deep learning. All experiments show our algorithm can converge fast and achieve good testing performances.

2. BACKGROUND

Notation: We denote the set of all positive definite matrices in R d×d by S + d . For any two vectors a, b ∈ R d , we use √ a for element-wise square root, a 2 for element-wise square, |a| for element-wise absolute value, a/b for element-wise division, and max(a, b) for element-wise maximum between a and b. We also frequently use the notation g 1:T,i = [g 1,i , g 2,i , • • • , g T,i ], i.e. the vector of all the i-th elements of vectors g 1 , g 2 , • • • g T . For a vector a, we use diag(a) to represent the diagonal matrix whose diagonal entries are a. For two functions f (t), g(t), f (t) = o(g(t)) means f (t)/g(t) → 0 as t goes to infinity. We use Õ(•) to omit logarithm terms in big-O notations. We say a space X has a bounded diameter D ∞ if x -y ∞ ≤ D ∞ , ∀x, y ∈ X . Online Learning Framework. We choose the online learning framework to analyze all the algorithms in this paper. In this framework, an algorithm picks a new x t ∈ X according to its update rule at each iteration t, where X ⊆ R d is the set of feasible values of x t . The composite loss function f t + φ is then revealed, where φ is the regularization function that controls the complexity of x and f t can be considered as the instantaneous loss at t. In the convex setting, f t and φ are both convex functions. The regularized regret function is defined with respect to an optimal predictor x * as R(T ) = T t=1 f t (x t ) -f t (x * ) + φ(x t ) -φ(x * ). Our goal is to find algorithms that ensures a sub-linear regret, i.e. R(T ) = o(T ), which means that the average regret converges to zero. For example, online gradient descent is proved to have a regret of O( √ dT ) (Zinkevich, 2003) , where d is the dimension size of X . Note that stochastic optimization and online learning are basically interchangeable (Cesa-Bianchi et al., 2004) . Therefore, we will refer to online algorithms and their stochastic counterparts using the same names. For example, we will use stochastic gradient descent (SGD) to represent online gradient descent as it is more well-known. Composite Mirror Descent Setup. In this paper, we will revisit the general composite mirror descent method (Duchi et al., 2010b) used in the creation of the first adaptive algorithm, AdaGrad, to bring new insights into adaptive methods. Such a general framework is preferred because it covers a wide range of algorithms, including both SGD and all the adaptive methods, and thus simplifies the discussions. The composite mirror descent rule at the time step t + 1 is to solve for x t+1 = argmin x∈X {α t g t , x + α t φ(x) + B ψt (x, x t )}, where g t is the gradient, φ(x) is the regularization function in the dual space, and α t is the step size. Also, ψ t is a strongly convex and differentiable function, named as the proximal function and B ψt (x, x t ) is the Bregman divergence associated with ψ t defined as B ψt (x, y) = ψ t (x) -ψ t (y) -∇ψ t (y), x -y . The general update rule (1) is mostly determined by the function ψ t . We first observe that it becomes the projected SGD algorithm when ψ t (x) = x T x and φ(x) = 0. x t+1 = argmin x∈X {α t g t , x + x -x t 2 2 } = Π X (x t -α t g t ), (SGD) where Π X (x) = argmin y∈X x -y 2 is the projection operation that ensures the updated parameter is in the original space. On the other hand, adaptive algorithms choose different proximal functions ψ t (x) = x, H t x , where H t can be any full or diagonal symmetric positive definite matrix. x t+1 = argmin x∈X {α t g t , x + α t φ(x) + x -x t , H t (x -x t ) }, Another popular representation of adaptive algorithms is the generalized projection rule x t+1 = Π X ,Ht (x t -α t H -1 t g t ), where Π X ,Ht (x) = argmin y∈X H 1/2 t (x -y) 2 , which is used in a lot of recent literature such as Reddi et al. (2018) ; Huang et al. (2019) . We show that these two rules are actually equivalent when φ(x) = 0 in the Appendix A.1, so that the regret bounds found in different representations can be generalized. A few recent works have shown that adaptive algorithms work well with special designs of the step sizes α t (Choi et al., 2019; Vaswani et al., 2020) . In this work, we choose the more standard α t = α/ √ t as it is used in most analysis. Also, we restrain our discussions to diagonal matrix proximal functions in the main text, i.e. H t = diag(h t ), h t ∈ R d . Discussions on extending our results to full matrix proximal functions are provided in Appendix B. Different Designs of the Proximal Function. Recently, researchers have proposed numerous designs of H t = diag(h t ), such as AdaGrad (Duchi et al., 2011 ), Adam (Kingma & Ba, 2015) , AMSGrad (Reddi et al., 2018) and NosAdam (Huang et al., 2019) , to name a few. It's impossible to go over all the proposed designs in this section so we choose the two most famous designs to review. The first adaptive algorithm, AdaGrad, used the square root of the average of past gradient squares as the diagonal h t of the matrix in the proximal function (Duchi et al., 2011) , i.e. h t = ( t i=1 g 2 i t ) 1/2 (AdaGrad) Normally, a small constant is added to h t at each iteration. Some recent work have shown that tuning this constant can benefit the performance of adaptive algorithms (Zaheer et al., 2018; Savarese et al., 2019) . However, in this work, we assume it is small and fixed for simplicity, as it is originally designed to compute the pseudo-inverse, or equivalently, avoid division by zero in the generalized projected descent. Kingma & Ba (2015) proposed Adam to replace the simple average by exponential moving average in AdaGrad, but Reddi et al. (2018) showed that there was a mistake in Adam's convergence analysis, which lead to divergence of Adam even in simple convex problems. They therefore proposed the following simple modification of Adam to ensure its convergence. h t = max t { t i=1 (1 -β 2 )β t-i 2 g 2 i t ) 1/2 } (AMSGrad) where β 2 ∈ (0, 1) is a constant. We propose the following theorem that generalizes the regret bounds for most of the designs of the diagonal matrix proximal function. Theorem 2.1 Let the sequence {x t } be defined by the update rule (1) and for any x * , denote D 2 t,∞ = x t -x * 2 ∞ . When ψ t (x) = x, H t x , where H t = diag(h t,1 , h t,2 , • • • , h t,d ), assume without loss of generality that φ(x 1 ) = 0, H 0 = 0, if (h t,i /α t ) ≥ (h t-1,i /α t-1 ), then R(T ) ≤ T t=1 d i=1 ( h t,i α t - h t-1,i α t-1 )D 2 t,∞ + T t=1 d i=1 α t g 2 t,i 2h t,i . The proof is relegated to Appendix A.3. The above regret bound is suitable for any designs of h t that satisfy the constraint condition (h t,i /α t ) ≥ (h t-1,i /α t-1 ). Such a condition is crucial because if it is unsatisfied, the regret R(T ) might diverge. In fact, the divergence of Adam in simple convex problems results from not satisfying this constraint, which is proved by Reddi et al. (2018) . With α t = α/ √ t in Theorem 2.1, most of recent adaptive algorithms have a regret bound of the following form (Duchi et al., 2011; Reddi et al., 2018; Huang et al., 2019; Luo et al., 2019) . R(T ) ≤ C 1 √ T d i=1 h T,i + f (T ) d i=1 g 1:T,i 2 + C 2 (3) where C 1 , C 2 are constants and f (T ) = o( √ T ).foot_0 These algorithms are supposed to converge faster than SGD when the gradients are sparse or small, i.e. when 

3. THE MOTIVATION-MARGINAL REGRET BOUND MINIMIZATION

In this section, we introduce the motivation behind our new class of algorithms. Although we find it difficult to determine the optimal proximal function globally, we show that it is possible to find the best proximal function at each iteration through marginal regret bound minimization. Denote R(T ) to be the regret upper bound (the right hand side of inequality (2)) in Theorem 2.1. At time step T , we define the marginal regret bound increment ∆ R(T ) as follows. ∆ R(T ) := R(T ) -R(T -1) = d i=1 ( h T,i α T - h T -1,i α T -1 )D 2 T,∞ + d i=1 α T g 2 T,i 2h T,i . As shown in the definition, ∆ R(T ) will be the increment in the regret bound R(T ) after h T is determined. An important observation here is that h T -1 is a given constant at T , so ∆ R(T ) is only a function of h T . Therefore, the best design of h T we can find at this moment is the one that minimizes ∆ R(T ) and satisfy the constraint in Theorem 2.1. Consider the minimization problem min h T ∆ R(T ), s.t. h T,i α T ≥ h T -1,i α T -1 ≥ 0, We propose the following proposition that solves the problem above Proposition 3.1 With α t = α/ √ t, the minimum of problem ( 4) is obtained at h * T = max T -1 T h T -1 , ( α 2 2T D 2 T,∞ ) 1/2 |g T | . To see this, set the function L with the Lagrangian multiplier µ as follows L(h, α) = d i=1 ( h T,i α T - h T -1,i α T -1 )D 2 T,∞ + d i=1 α T g 2 T,i 2h T,i - h T α T - h T -1 α T -1 , µ . Take partial derivatives with respect to each h T,i , we can see that Dfoot_2 T,∞ = (α 2 T g 2 T,i )/2h 2 T,i + µ i . By the complementary slackness conditions, either µ i = 0 or h T,i = (α T /α T -1 )h T -1,i . When µ i = 0, h T,i = (α 2 T /2D 2 T,∞ ) 1/2 |g T | and the constraint condition h T,i ≥ (α T /α T -1 )h T -1,i needs to be satisfied. Hence, by setting α T = α/ √ T , we can get the solution in (5). Solution (5) is the best diagonal matrix proximal function in terms of regret bound increment at time T . Therefore, if we replace T by t in the subscripts, we can obtain a greedy choice of the proximal function h t that minimizes the marginal regret bound increment at each time step. Intuitively, the reason this solution achieves the minimum is that it balances the two terms of ∆ R(T ). On each dimension (i), it makes the first term of ∆ R(T ) zero when the derivative is small 2 , i.e. when we don't need to have a even larger h T,i to slow down. When the derivative is too large, h T,i adapts to the size of the derivative so that the second term of ∆ R(T ) is constrained. However, similar to the other greedy algorithms, solution (5) is only suboptimal as it minimizes the regret bound marginally instead of globally. Moreover, the parameter D t,∞ is often unknown during the optimization process because x * is usually unknown. Therefore, stronger theoretical motivation is needed to trust that solution (5) or similar algorithms can be useful and beneficial.

4. A NEW CLASS OF ADAPTIVE ALGORITHMS -AMX

Now, motivated by the greedy choice of h t in section 3, we focus on a more general design of the diagonal matrix in the proximal functions that has the following form and show why such greedy designs can be beneficial in the long term. Consider h t = max t -1 t h t-1 , c t |g t | . (AMX) Algorithm 1 AMX Algorithm (Diagonal, Composite Mirror Descent Form) 1: Input: x ∈ F, α t = α/ √ t, {c t } T t=1 , φ(x), 2: Initialize h 0 = 0, H 0 = 0 3: for t = 1 to T do 4: g t = ∇f t (x t ) 5: h t = max( t-1 t h t-1 , c t |g t |) + 6: x t+1 = argmin x∈X {α t g t , x + α t φ(x) + x -x t , diag(h t )(x -x t ) } 7: end for where c t is an arbitary function of t, for example solution (5) or as simple as c t = 1. The corresponding new class of adaptive algorithms is given in Algorithm 1, which we name as AMX. Note that the diagonal proximal function performs all operations coordinate-wisely, therefore we start our analysis from one dimension (i). Denote (i)-th dimension component of the regret bound as R(i) (T ) := T t=1 ( h t,i α t - h t-1,i α t-1 )D 2 t,∞ + T t=1 α t g 2 t,i 2h t,i . For a sequence of gradients g 1 , g 2 , • • • , g T , denote the time steps t when h t,i = c t |g t,i | as τ (i) 1 , τ (i) 2 , • • • τ (i) mi . Note that τ (i) j 's may be different across different dimensions, so the analysis apply to each dimension independently. These are the time steps when the gradient term c t |g t | dominates h t on the (i)-th dimension and h t,i = c t |g t,i | when t = τ (i) j , ∀j = 1, • • • , m i . Since h t,i is equal to t-1 t h t-1,i between τ (i) j 's, which is exactly the same as in section 3, the increment in the first term of the right hand side of ( 6) is always 0. Therefore, we have the following proposition Proposition 4.1 For any τ ∈ (τ (i) j , τ (i) j+1 ), the regret bound increment on the (i)-th dimension is R(i) (τ ) -R(i) (τ (i) j ) = τ t=τ (i) j +1 α t g 2 t,i 2h t,i . The above proposition indicates that the regret increments between the τ (i) j 's are only related to the second term of R(i) (T ). Note that the first term of R(i) (T ) is a major reason why the regret bound is O( √ T ) because there is a 1/α T term in the summation. Therefore, such designs of h t,i try to constrain the regret increments between τ (i) j 's and hence can potentially make the regret bound small. More importantly, Proposition 4.1 is true for the time steps between the last τ (i) mi and T + 1. Denote D ∞ to be the bounded diameter of the parameter space X , because the regret bound increment is only related to the second term after the last τ (i) mi , the bound in equation ( 6) becomes R(i) (T ) ≤ T t=1 ( h t,i α t - h t-1,i α t-1 )D 2 ∞ + T t=1 α t g 2 t,i 2h t,i ≤ D 2 ∞ τ (i) mi α h τ (i) m i ,i + T t=1 α 2 √ tc t |g t,i | (7) Since R(T ) = d i=1 R(i) (T ), so the total regret upper bound across all the dimensions is R(T ) ≤ d i=1 R(i) (T ) ≤ d i=1 D 2 ∞ τ (i) mi α h τ (i) m i ,i + d i=1 T t=1 α 2 √ tc t |g t,i | The first term of the right hand side can be considered as better than the O( √ T ) regret bound of SGD or common adaptive algorithms when some or all of the τ (i) mi are much smaller than T and h t,i 's are bounded. Therefore if we can ensure the second term also increases much slower than O( √ T ), which is decided by the design of c t , then the AMX class of algorithms is potentially much faster than SGD and the other adaptive algorithms. Note that we need h t,i to be bounded in the above arguments, therefore c t can be at most a constant. Fortunately, one simple yet very effective design that we have found is c t = 1. We formalize the above statements in the following theorem. Theorem 4.1 Let {x t } and {h t } be the sequences obtained from Algorithm 1, α t = α/ √ t, c t = 1. Let {τ (i) mi } d i=1 be the largest time steps t on each dimension when h t,i = c t |g t,i |. Assume that F has bounded diameter x -y ∞ ≤ D ∞ , ∀x, y ∈ F. Then we have the following bound on the regret. R(T ) ≤ d i=1 D 2 ∞ τ (i) mi α h τ (i) m i ,i + α 2 d i=1 1 + log τ (i) mi g 1:τ (i) m i ,i 2 + α 2 d i=1 τ (i) mi log( T τ (i) mi )|g τ (i) m i ,i |, An important remark here is that using a different constant c t = c in Theorem 4.1 is equivalent to tuning the step size by 1/c, as it magnifies all |g t | at the same time in the algorithm. Using a decaying c t can further improve the first term in Theorem 4.1, but it also enlarges the second and the third term so the regret may not be O( √ T ) anymore. We will focus on c t = 1 to prove AMX can potentially converge faster in the rest of this paper, but a detailed discussion about the possible choices of c t for future designs of the AMX algorithm is provided in Appendix A.5, which proves that c t cannot be O(1/ √ t) if we do not impose any further assumptions . Now, since most of the bound in Theorem 4.1 depends on the time steps {τ (i) mi } d i=1 instead of T (only a log term), the specific AMX algorithm can be potentially much faster than common adaptive algorithms. To further make our argument clearer, we propose the following corollary.  O(max( √ τ , √ τ log T τ )) The corollary indicates that the regret bound is approximately of size Õ( √ τ ) if we omit the log terms. As far as we are aware, this is the first algorithm that generates a regret bound that is not asymptotically O( √ T ), so AMX can be potentially much faster than any existing algorithms. The time step τ can be understood as "the time when the gradients start to converge" and whether it makes the convergence faster depends on the distribution of the gradients. For example, if τ = √ T , the regret bound is only of size O(T 1/4 log T ). We emphasize that a small τ is not an assumption on the gradient distribution, but rather a condition that once satisfied, the regret bound will only increase logarithmically and hence the algorithm converges very fast. Moreover, the regret bounds of the other adaptive algorithms go to O( √ T ) even under such conditions, because their regret bound increments are not minimized. We use a rather simple example to illustrate why AMX has this unique advantage. Example. Suppose that the domain is a hyper-cube X = { x ∞ ≤ 1}, then D ∞ = 2. Assume that on each dimension, the gradient decreases as |g t,i | = (1/ √ t)|g 1,i |, and |g 1,i | 1, ∀i. Note that this is one example where adaptive algorithms should work well since g 1:T,i 2 ≤ |g 1,i | (1 + log T ) √ T . A very important property for AMX in this case is that τ is the first time step, so its regret bound only increases logarithmically. However, the regret bounds of the other algorithms still goes to O( √ T ). We plot the regret bounds of AMX, AMSGrad and AdaGrad in Figure 4 . Note that the regret bound of AMX increases much slower than AdaGrad and AMSGrad, hence it is much faster than these algorithms in this example. One may argue that the example is extreme since τ = 1 rarely happens in real situations. However, the regret in this example can be understood as the regret increment after τ in real training processes, i.e. before τ , AMX is only asymptotically as fast as the other adaptive algorithms, but after τ , since the regret increment of AMX is very small, it converges very fast. More details of this example can be found in Appendix D.1. Besides, since the term √ τ log(T /τ ) in Corollary 4.1 is at most O( √ T ) 3 , the AMX algorithm is at least as fast as AdaGrad and AMSGrad under the same assumptions. We propose the following theorem that corresponds to the general results in section 2 to prove our claim: Theorem 4.2 Let {x t } and {h t } be the sequences obtained from Algorithm 1, α t = α/ √ t, c t = 1. Assume that F has bounded diameter x -y ∞ ≤ D ∞ , ∀x, y ∈ F. Then we have the following bound on the regret. R(T ) ≤ D 2 ∞ √ T α d i=1 h T,i + α 2 1 + log T d i=1 g 1:T,i 2 , The above bound can be considered as being better than the regret of SGD, i.e., O( Duchi et al., 2011) . Therefore, AMX can be at least much faster than SGD when the gradients are sparse or small, and it can be potentially even faster. To keep up with the current popular adaptive algorithms such as Adam, we also provide the detailed implementation of adding first order momentum into AMX and include some discussions on its convergence properties in Appendix C. Similar to Algorithm 1, the AMX with momentum algorithm has a regret bound that (mostly) depends on τ instead of T and hence enjoys the acceleration. √ dT ), when d i=1 h T,i √ d and d i=1 g 1:T,i 2 √ dT (

5. EXPERIMENTS

In this section, we evaluate the effectiveness of the specific AMX algorithm proposed in Section 4 (i.e. c t = 1) on different deep learning tasks. We relegate more details of parameter tuning and step size decay strategies to Appendix D.2-D.5. Moreover, an empirical analysis for different designs of {c t } T t=1 that show different {c t } T t=1 's generate different performances is provided in Appendix D.6. We compared our AMX algorithm with SGD with momentum (SGDM), Adam, AdaGrad and AMSGrad on different tasks in deep learning. The hyper-parameters in AMX were set to be c t = 1 in this subsection. For the language modeling and the neural machine translation tasks, because SGDM typically performs much worse than adaptive algorithms, we did not include it in the comparisons. Following Loshchilov & Hutter (2019) , we used decoupled weight decay in all the algorithms. Image Classification. We first conducted some experiments where τ was possibly very large and AMX was only as fast as the other adaptive algorithms, but it still achieved better testing performances. The image classification task was performed on the CIFAR (Krizhevsky et al., 2009) datasets. We used the publicly available code by Li et al. (2020) and DeVries & Taylor (2017) to train ResNet-20 and ResNet-18 (He et al., 2016) on CIFAR-10 and CIFAR-100 respectively using batch size of 128. We summarized the performances of different algorithms in Figure 2 and Table 1 . As observed, AMX started slightly slowly, but it quickly caught up with the other adaptive algorithms and converged much faster than SGDM. This was possibly because the time when gradients start to converge (the τ in section 4) was large in image classification tasks, and AMX could only converge asymptotically as fast as the other adaptive algorithms, corresponding to Theorem 4.1. However, its final testing performance was as good as SGDM, so it converged both fast and well at the same time. The other adaptive algorithms such as Adam and AMSGrad had faster training performances in the beginning, but they ended up with much worse final accuracy than SGDM and AMX. Image Segmentation. Next, more experiments proved our claim that AMX could be potentially much faster and generate even better testing performances. For the segmentation task, we used the publicly released implementation of the Deeplab model (Chen et al., 2016) by Kazuto1011 ( 2016) and evaluated the performances of different algorithms on the PASCAL VOC2012 Segmentation dataset (Everingham et al., 2014) . We used a small batch size of 4 and a polynomially decaying step size in 20k iterations. The trained models were evaluated at the 5k, 10k, 15k and 20k iterations and we used mean Intersection over Union (IoU) as the evaluation metric. The results were provided in Figure 3 (a), 3(b) and Table 1 . As shown in the figures and the table, AMX was not only the fastest adaptive algorithm but also achieved the best IoU score, which was comparable to that of SGDM. The other algorithms were not able to perform similarly. Language Modeling. We trained three-layer LSTMs (Hochreiter & Schmidhuber, 1997 ) on the character level Penn Tree Bank (PTB) dataset. The general setup in Merity et al. (2017) was adopted in our experiments. Specifically, we trained the model for 500 epochs with batch size 128. The validation perplexity curve and the final validation perplexity were shown in Figure 3 (c) and Table 2 . It can be observed that AMX was the fastest algorithm and achieved the lowest perplexity among all the adaptive algorithms, which proved our claim that AMX was potentially much faster. Neural Machine Translation. We utilized the publicly released code by pcyin (2018) and trained the basic attentional neural machine translation models (Luong et al., 2015) on the IWSLT'14 DE-EN (Ranzato et al., 2015) dataset. We used 64 as the batch size and decreased the step size by 2 every 5 iterations. The validation perplexity curve and the final BLEU score were reported in Figure 3 (d) and Table 2 . AMX not only had a much smoother validation perplexity curve, but also achieved the best BLEU score among all the adaptive algorithms, showing that AMX was indeed a better choice.

6. CONCLUSION

In this paper, we propose our design of the best proximal functions at each time step based on marginal regret bound minimization. We then show that a more general class of adaptive algorithms can not only achieve marginal optimality in some sense, but also converge much faster than any existing adaptive algorithms, depending on the distribution of the gradients. We evaluate one particular case of our new class of algorithms on different tasks in deep learning and show its effectiveness. This work provides a new framework for adaptive algorithms and can hopefully prevent the random searching process for better designs of the proximal function. Future researchers can concentrate on finding better choices of the sequence {c t } T t=1 to find better algorithms.

A PROOFS OF RESULTS IN THE MAIN TEXT

In this appendix, we provide proofs of all the results and theorems in the main text, except for Theorem 4.2, which is proved in section C as it is the same as β 1 = 0 in the momentum design. An important remark here is that adding to h t does not affect the proof here because when h t,i is placed on the denominator, 1/h t,i ≥ 1/(h t,i + ). When h t,i is placed on the numerator, we generally use h t,i instead of giving it a specific form. Also, we set to be very small in real experiments (1e-8). Further Notations. Except for the notations mentioned in the main text, the following notations are needed for this appendix. Given a norm • , its dual norm is defined to be y * = sup x ≤1 { x, y }. For example, the dual norm of the Mahalanobis norm • A = •, A• , A 0 is the Mahalabonis norm • A -1 . A function f is said to be 1-strongly convex with respect to the norm • if f (y) ≥ f (x) + ∇f (x), y -x + 1 2 x -y A.1 EQUIVALENCE BETWEEN COMPOSITE MIRROR DESCENT AND GENERALIZED PROJECTED ALGORITHMS Begining with the generalized projected adaptive algorithm, we can find that x t+1 = Π Ht X (x t -α t H -1 t g t ) := argmin x∈X x -(x t -α t H -1 t g t ) 2 Ht = argmin x∈X H 1/2 t (x -(x t -α t H -1 t g t )) 2 = argmin x∈X H 1/2 t (x -x t ) + α t H -1/2 t g t 2 = argmin x∈X {2α t g t , x -x t + x -x t , H t (x -x t ) } = argmin x∈X {2α t g t , x + x -x t , H t (x -x t ) } (12) Given ψ t (x) = x, H t x , it can be shown directly (also in A.3) that B ψt (x, x t ) = x-x t , H t (x-x t ) . Therefore comparing equation 12 with equation 1, we know that they are equivalent when the regularization term φ(x) = 0, except that we need to double the step sizes α t .

A.2 PROOF OF LEMMA A.1

Lemma A.1 Let {x t } be a sequence of outputs from the update rule (1) and assume that {ψ t } are strongly convex functions with respect to the norm • ψt . Let • ψ * t be the corresponding dual norm. Also without loss of generality let B ψ0 (x * , x 0 ) = 0. Then for any x * , we have T t=1 f t (x t ) -f t (x * ) + φ(x t ) -φ(x * ) ≤ T t=1 B ψt (x * , x t ) α t - B ψt-1 (x * , x t ) α t-1 + α 2 t 2 f t (x t ) 2 ψ * t . Proof. Since x t+1 satisfies equation ( 1), for all x ∈ X and φ (x t+1 ) ∈ ∂φ(x t+1 ). x - x t+1 , α t g t + α t φ (x t+1 ) + ∇ψ t (x t+1 ) -∇ψ t (x t ) ≥ 0 Also since f t 's are convex functions, we know that f t (x) ≥ f t (x t ) + f (x t ), x -x t and likewise for φ(x t+1 ). We therefore have the following α t (f t (x t ) -f t (x * )) + α t (φ(x t+1 ) -φ(x * )) ≤ α t f t (x t ), x t -x * + α t φ t (x t+1 ), x t+1 -x * = α t f t (x t ), x t+1 -x * + α t φ t (x t+1 ), x t+1 -x * + α t f t (x t ), x t -x t+1 = x * -x t+1 , ∇ψ t (x t+1 ) -∇ψ t (x t ) -α t f t (x t ) -α t φ t (x t+1 ) + x * -x t+1 , ∇ψ t (x t ) -∇ψ t (x t+1 ) + α t f t (x t ), x t -x t+1 ≤ x * -x t+1 , ∇ψ t (x t ) -∇ψ t (x t+1 ) + α t f t (x t ), x t -x t+1 where the first inequality is due to the convexity of φ t and f t . The second inequality is due to the positiveness in equation 14. Since by definition B ψt (x * , x t ) = ψ t (x * ) -ψ t (x t ) -∇ψ t (x t ), x * -x t B ψt (x * , x t+1 ) = ψ t (x * ) -ψ t (x t+1 ) -∇ψ t (x t+1 ), x * -x t+1 B ψt (x t+1 , x t ) = ψ t (x t+1 ) -ψ t (x t ) -∇ψ t (x t ), x t+1 -x t (16) Therefore α t (f t (x t ) -f t (x * )) + α t (φ(x t+1 ) -φ(x * )) ≤ B ψt (x * , x t ) -B ψt (x * , x t+1 ) -B ψt (x t+1 , x t ) + α t f t (x t ), x t -x t+1 ≤ B ψt (x * , x t ) -B ψt (x * , x t+1 ) -B ψt (x t+1 , x t ) + 1 2 x t -x t+1 2 ψt + α 2 t 2 f t (x t ) 2 ψ * t ≤ B ψt (x * , x t ) -B ψt (x * , x t+1 ) + α 2 t 2 f t (x t ) 2 ψ * t ( ) where the second inequality is due to the Fenchel's inequality on the conjugate functions 1 2 • 2 ψt and 1 2 • 2 ψ * t . The last inequality is due to the strong convexity of B ψt (•, •). Therefore the lemma can be proved by dividing α T on both sides and take the summation. A.3 PROOF OF THEOREM 2.1 Proof. When ψ t (x) = x, H t x , the dual norm • ψ * t is the Mahalanobis norm • H -1 t B ψt (x, y) = ψ t (x) -ψ t (y) -∇ψ t (y), x -y = x, H t x -y, t y -2 H t y, x -y = (x -y), H t (x -y) Therefore by Lemma A.1, T t=1 f t (x t ) -f t (x * ) + φ(x t ) -φ(x * ) ≤ T t=1 B ψt (x * , x t ) α t - B ψt (x * , x t+1 ) α t + α t 2 f t (x t ) 2 ψ * t = T t=1 H 1/2 t (x t -x * ) 2 α t - H 1/2 t (x t+1 -x * ) 2 α t + α t 2 f t (x t ) 2 ψ * t ≤ H 1/2 1 (x 1 -x * ) 2 α + T t=2 H 1/2 t (x t -x * ) 2 α t - H 1/2 t-1 (x t -x * ) 2 α t-1 + T t=1 α t 2 f t (x t ) 2 ψ * t ≤ d i=1 h 1,i α 1 (x 1,i -x * i ) 2 + T t=2 d i=1 ( h t,i α t - h t-1,i α t-1 )(x t,i -x * i ) 2 + T t=1 α t 2 f t (x t ) 2 ψ * t = T t=1 d i=1 ( h t,i α t - h t-1,i α t-1 )(x t,i -x * i ) 2 + T t=1 d i=1 α t 2 α t f t,i (x t ) 2 2h t,i ≤ T t=1 d i=1 ( h t,i α t - h t-1,i α t-1 )D 2 t,∞ + T t=1 d i=1 α t f t,i (x t ) 2 2h t,i where the second inequality is by re-arranging the sum and deleting the last negative term. A.4 PROOF OF THEOREM 4.1 Proof. Based on Theorem 2.1 and Proposition 4.1, take h t to be the design in Algorithm 1 with c t = 1. Then after the final τ (i) mi (which means the last gradient that is "large") for each dimension i, the first term doesn't increase and the second term can be bounded as follows. R(T ) ≤ T t=1 d i=1 ( h t,i α t - h t-1,i α t-1 )D 2 t,∞ + T t=1 d i=1 α t g 2 t,i 2h t,i ≤ d i=1 h τ (i) m i ,i α τ (i) m i D 2 ∞ + T t=1 d i=1 α t 2 |g t,i | = d i=1 h τ (i) m i ,i α τ (i) m i D 2 ∞ + τ (i) m i t=1 d i=1 α t 2 |g t,i | + d i=1 T t=τ (i) m i +1 α t 2 |g t,i | ≤ d i=1 h τ (i) m i ,i τ (i) mi α D 2 ∞ + τ (i) m i t=1 d i=1 α t 2 |g t,i | + d i=1 T t=τ (i) m i +1 α 2 τ (i) mi t |g τ (i) m i ,i | ≤ d i=1 h τ (i) m i ,i τ (i) mi α D 2 ∞ + τ (i) m i t=1 d i=1 α t 2 |g t,i | + d i=1 α 2 τ (i) mi |g τ (i) m i ,i | log( T τ (i) mi ) ≤ d i=1 D 2 ∞ τ (i) mi α h τ (i) m i ,i + d i=1 α 2 1 + log τ (i) mi g 1:τ (i) m i ,i 2 + d i=1 α 2 τ (i) mi |g τ (i) m i ,i | log( T τ (i) mi ) (20) where the first inequality is by Theorem 2.1 and the second one is by Proposition 4.1. The third inequality is by the fact that |g t,i | ≤ τ (i) m i t h τ (i) m i ,i = τ (i) m i t |g τ (i) m i ,i | for all t ≥ τ (i) mi + 1. The fourth inequality is by T t=τ (i) m i +1 1/t ≤ T t=τ (i) m i 1/t = log( T τ (i) m i ). The last inequality is by the fact that τ (i) m i t=1 d i=1 α t 2 |g t,i | ≤ d i=1 g 1:τ (i) m i ,i 2 τ (i) m i t=1 1 t ≤ d i=1 α 2 1 + log τ (i) mi g 1:τ (i) m i ,i 2 (21) where the first inequality is by the Cauchy-Schwarz Inequality.

A.5 DISCUSSIONS ABOUT THE POSSIBLE DESIGNS OF c t

We first show our claim that given a design of c t , scaling it by a constant a is the same as scaling the step size α t by 1/a. Note that the maximum operation can be commuted with constant scaling, therefore by unrolling the maximum operation Under review as a conference paper at ICLR 2021 h t = max t -1 t h t-1 , ac t |g t | = a • max 1 a t -1 t h t-1 , c t |g t | = a • max 1 a t -1 t max t -2 t -1 h t-2 , ac t-1 |g t-1 | , c t |g t | = a • max 1 a t -2 t h t-2 , t -1 t c t-1 |g t-1 |, c t |g t | = • • • = a • max t -j t c t-j |g t-j | t-1 j=0 Therefore we know that it is equivalent to scaling α t by 1/a since it magnifies all c t |g t | at the same time. Now given the discussions in section 4, we know that c t can be at most a constant, which is because it is strictly Ω(1), then the regret is strictly Ω( √ T ). Now we can possibly use a decaying c t . However, if c t is too small, then the regret may also be strictly Ω( √ T ) because the second term in Theorem 2.1 can be very large. We show that c t = Ω(1/ √ t) by giving a counter example of the sequence of g t such that the regret becomes very large. Suppose |g t | = (1/t) 1/4 |g 1 |, |g 1,i | > 0, ∀i, c t = 1/ √ t, then since the size of the gradients keep decreasing, hence h t = max t -j t c t-j |g t-j | t-1 j=0 = max t -j t 1 √ t -j |g t-j | t-1 j=0 = 1 √ t max {|g t-j |} t-1 j=0 = 1 √ t |g 1 | (23) The specific distribution of gradients meets the requirement for adaptive algorithms to work well because g 1: T,i 2 = |g 1,i | T t=1 1 √ t = O(T 1/4 ). Now, despite the first term in the regret bound in Theorem 2.1 becomes a constant, the second term becomes very large because R(T ) ≤ T t=1 d i=1 ( h t,i α t - h t-1,i α t-1 )D 2 t,∞ + T t=1 d i=1 α t g 2 t,i 2h t,i = d i=1 h 1,i α 1 D 2 ∞ + T t=1 d i=1 αg 2 1,i 2 √ t|g 1,i | = d i=1 h 1,i α 1 D 2 ∞ + d i=1 T t=1 α|g 1,i | 2 √ t (24) Now notice that √ T ≤ T t=1 1 √ t ≤ 2 √ T -1 (25) Therefore the regret upper bound there is already Θ( √ T ). Changing the gradients to be further larger, for example, |g t | = (1/t) 1/8 |g 1 |, will still satisfy the g 1:T,i 2

√

T condition, but also make the regret bound even larger (Θ(T 3/4 )). Therefore using c t = O(1/ √ t) is unacceptable in terms of the regret bound. Note that we can of course change the initial time step, the argument still holds if the first few c t 's are not smaller than Θ(1/ √ t) in order. Next, we show the regret bound is indeed achievable by the regret. We propose the following theorem: Theorem A.1 If c t = O(1/ √ t) in Algorithm 1, then there exists an online convex optimization problem where g 1:T,i 2 √ T and AMX has a regret of size Ω( √ T ) Proof. We recall the terms in the proof of Lemma A.1. α t (f t (x t ) -f t (x * )) + α t (φ(x t+1 ) -φ(x * )) ≤ α t f t (x t ), x t -x * + α t φ t (x t+1 ), x t+1 -x * = α t f t (x t ), x t+1 -x * + α t φ t (x t+1 ), x t+1 -x * + α t f t (x t ), x t -x t+1 If we set φ(x) = 0 in the above inequality and let f t (x) = g t x be a linear loss function, then the first inequality becomes an equality. Moreover, it's possible to make the term α t f t (x t ), x t+1 -x * > 0 for all t, as long as we change the loss function to be positive when x t+1 > x * and negative when x t+1 < x * , then we know the regret is R(T ) = T t=1 f t (x t ) -f t (x * ) = T t=1 f t (x t ), x t+1 -x * + f t (x t ), x t -x t+1 ≥ T t=1 f t (x t ), x t -x t+1 = T t=1 g t , α t 2 H -1 t g t = T t=1 d i=1 α t g 2 t,i 2h t,i Notice what we have claimed before Theorem A.1 are all about this term being larger than O( √ T ), therefore we prove our claim.

B FULL MATRIX PROXIMAL FUNCTION

When H t is not a diagonal matrix but a full matrix, we can similarly prove a general regret bound for any matrix in Theorem B.1 below. Theorem B.1 Let the sequence {x t } be defined by the update rule (1) and for any x * , denote D 2 t,2 = x t -x * 2 2 . When ψ t (x) = x, H t x , where H t ∈ S + d is a general matrix, assume without loss of generality that φ(x 1 ) = 0, H 0 = 0, if tr( Ht αt ) ≥ tr( Ht-1 αt-1 ), then R(T ) ≤ T t=1 D 2 t,2 tr( H t α t - H t-1 α t-1 ) + T t=1 α t 2 g T t H -1 t g t . ( ) Proof. When H t is a full matrix, similar to Theorem 2.1, we can get the regret bound R(T ) ≤ T t=1 B ψt (x * , x t ) α t - B ψt-1 (x * , x t ) α t-1 + α t 2 g t , H -1 t g t (29) Let λ max (M ) denote the largest eigenvalue of a matrix M , then  B ψt (x * , x t ) α t - B ψt-1 (x * , x t ) α t-1 = x * -x t , ( H t α t - H t-1 α t-1 )(x * -x t ) ≤ x * -x t 2 2 λ max ( H t α t - H t-1 α t-1 ) ≤ x * -x t 2 2 tr( H t α t - H t-1 α t-1 ) H t α t ) ≥ tr( H t-1 α t-1 ) where D t,2 = x t -x * 2 . Now, we propose a proposition that shows the optimal full matrix solution is the one that reaches the infimum of the problem.

Proposition B.1

The following matrix H * t = max    tr( t-1 t H t-1 ) tr(( α 2 t 2D 2 t,2 g t g T t ) 1/2 ) , 1    ( α 2 t 2D 2 t,2 g t g T t ) 1/2 , (Full) and its Moore-Penrose pseudoinverse H * - t gives an infimum to the problem (31). i.e. )≥tr( D H t-1 α t-1 ) D 2 t,2 tr( H t α t ) + α t 2 tr(H -1 t g t g T t ) Proof. Now similarly, we can first construct the Lagrangian for the marginal regret bound minimization problem. Let θ ≥ 0 denote a Lagrangian parameter for the trace constraint, and Z 0 for the positive definiteness constraint. Then the Lagrangian problem is L(H t , θ, Z) = D 2 t,2 tr( H t α t ) + α t 2 tr(H -1 t g t g T t ) -θtr(( H t α t - H t-1 α t-1 )) -tr(H t Z) Take derivative with respect to H t we can get D 2 t,2 α t I - α t 2 H -1 t g t g T t H -1 t -θI -Z = 0 ( ) where I is the identity matrix. If an invertible H t can be found, then the generalized complementarity condition (Boyd & Vandenberghe, 2004) implies that Z = 0 and either θ = 0 or tr( Ht αt ) = tr( Ht-1 αt-1 ). If θ = 0 H t = ( α 2 t 2D 2 t,2 g t g T t ) 1/2 and tr( H t α t ) ≥ tr( H t-1 α t-1 ) However, this is not an acceptable solution since it is not invertible. If θ = 0, note that we need θ to be real, hence there is no solution because g t g T t has rank at most 1, and even if θ = D 2 t,2 /α t , there does not exist a matrix that makes the term H -1 t g t g T t H -1 t zero. Instead, we propose the following matrix reaches the infimum of problem 31. H * t = max    tr( t-1 t H t-1 ) tr(( α 2 t 2D 2 t,2 g t g T t ) 1/2 ) , 1    ( α 2 t 2D 2 t,2 g t g T t ) 1/2 (37) It can be understood as a special maximum operation that makes sure tr(H * t ) = max tr( t -1 t H t-1 ), α t √ 2D t,2 tr((g t g T t ) 1/2 ) Now, a very important remark here is that the solution is not invertible, because g t g T t has rank at most . Therefore, the equation above is not acceptable as a direct solution to problem (31). However, setting H t to be as in the above equation gives a solution to the infimum problem in Proposition B.1. Let g t g T t be diagonally decomposed by g t g T t = Q v 0 0 0 Q T , v = g T t g t = g 2 t,i and define the matrices H t (δ) that can be written as, H t (δ) = max    tr( t-1 t H t-1 ) √ v + nδ , α t √ v √ 2D t,2 ( √ v + nδ)    Q √ v 0 0 δI Q T , Therefore we know that lim δ→0 H t (δ) = H * t and that D 2 t,2 tr( H t (δ) α t ) + α t 2 tr(H t (δ) -1 g t g T t ) = D 2 t,2 max tr( √ t -1 α H t-1 ), √ v √ 2D t,2 + min    α t ( √ v + nδ) 2tr( t-1 t H t-1 ) , √ 2D t,2 ( √ v + nδ) 2 √ v    tr(Q √ v 0 0 0 Q T ) = D 2 t,2 max tr( √ t -1 α H t-1 ), √ v √ 2D t,2 + min    ( √ v + nδ) √ vα t 2tr( t-1 t H t-1 ) , √ 2D t,2 ( √ v + nδ) 2    Hence lim δ→0 D 2 t,2 tr( H t (δ) α t ) + α t 2 tr(H t (δ) -1 g t g T t ) = D 2 t,2 max tr( √ t -1 α H t-1 ), √ v √ 2D t,2 + min    α t v 2tr( t-1 t H t-1 ) , √ 2D t,2 ( √ v) 2    = D 2 t,2 tr( H * t α t ) + α t 2 tr(H * - t g t g T t ) =              D 2 t,2 tr( √ t -1 α H t-1 ) + α t v 2tr( t-1 t H t-1 ) if tr( t-1 t H t-1 ) is larger √ 2D t,2 √ v if α t √ 2D t,2 tr((g t g T t ) 1/2 ) is larger (41) Now, let g(θ) = inf Ht (L(H t , θ, Z(θ))) be the dual of problem (31), where when D 2 t,2 αt > θ we define the matrices Z(θ) and H t (θ, δ) as Z(θ) = 0 0 0 ( D 2 t,2 αt -θ)I , H t (θ, δ) = α t 2(D 2 t,2 -θα t ) Q √ v 0 0 δI Q T , Then from the derivative with respect to H t , we know that ( D 2 t,2 α t -θ)I - α t 2 H t (θ, δ) -1 g t g T t H t (θ, δ) -1 -Z(θ) = 0 Therefore H t (θ, δ) achieves the minimum in the dual, moreover g(θ) = D 2 t,2 tr( H t (θ, δ) α t ) + α t 2 tr(H t (θ, δ) -1 g t g T t ) -θtr(( H t (θ, δ) α t - H t-1 α t-1 )) -tr(H t (θ, δ)Z(θ)) = D 2 t,2 ( √ v + δn) 2(D 2 t,2 -θα t ) + 2v(D 2 t,2 -θα t ) 2 -θ( ( √ v + δn) 2(D 2 t,2 -θα t ) -tr( H t-1 α t-1 )) - (n -1)δ(D 2 t,2 -θα t ) 2(D 2 t,2 -θα t ) = (D 2 t,2 -θ) ( √ v + δn) 2(D 2 t,2 -θα t ) + 2v(D 2 t,2 -θα t ) 2 + θtr( H t-1 α t-1 ) - (n -1)δ D 2 t,2 -θα t √ 2 Notice that take θ = 0, lim δ→0 g(θ) = √ 2D √ v, and take θ to be 2v(D 2 t,2 -θα t ) = α t ( √ v + nδ) tr( αt αt-1 H t-1 ) then g(θ) = D 2 t,2 tr( 1 α t-1 H t-1 ) + α t v 2tr( t-1 t H t-1 ) Note that they are equal to the two cases respectively, therefore the duality gap in this problem is zero and H * t is indeed the infimum solution. Similarly, H t is always better than any other full matrix proximal function at t in terms of regret. A unified algorithm is provided in Algorithm 2, where the max * operation is a special operation that shows how the diagonal AMX algorithm is similar to the full matrix one, in the sense that it tries to find the maximum trace between the two matrices, i.e.

tr(H

* t ) = max tr( t -1 t H t-1 ), tr((c t g t g T t ) 1/2 )

C DISCUSSIONS ON AMX ALGORITHM WITH MOMENTUM

We provide the detailed implementation of the diagonal AMX algorithm with momentum and decoupled weight decay here. The momentum term is implemented in line 5 in Algorithm 3, where m t = β 1t m t-1 +(1-β 1t )g t and {β 1t } T t=1 are called the momentum parameters. This implementation is exactly the same as the implementation of modern adaptive algorithms such as Kingma & Ba (2015) ; Reddi et al. (2018) ; Huang et al. (2019) and Li et al. (2020) . Algorithm 2 AMX Algorithm (Composite Mirror Descent Form) 1: Input: x ∈ F, {α t } T t=1 , {c t } T t=1 , φ(x) 2: Initialize h 0 = 0, H 0 = 0 3: for t = 1 to T do 4: g t = ∇f t (x t ) 5: if H t is a diagonal matrix then 6: h t = max( t-1 t h t-1 , (c t g 2 t ) 1/2 ) 7: H t = diag(h t ) + 8: else 9: H t = max * ( t-1 t H t-1 , (c t g t g T t ) 1/2 ) + I 10: end if 11: x t+1 = argmin x∈X {α t g t , x + α t φ(x) + x -x t , H t (x -x t ) } 12: end for Algorithm 3 AMX Algorithm with Momentum (Diagonal) 1: Input: x ∈ F, {α t } T t=1 , {β 1t } T t=1 , c t = 1, = 1e -8, λ = 5e -2 2: Initialize m 0 = 0, h 0 = 0 3: for t = 1 to T do 4: g t = ∇f t (x t ) 5: m t = β 1t m t-1 + (1 -β 1t )g t 6: h t = max( t-1 t h 2 t-1 , c t g 2 t ) + 7: H t = diag(h t,1 , h t,2 , • • • , h t, x t+1 = Π F ,Ht (x t -α t m t /h t -λα t x t ) 9: end for The following theorem provides a regret bound for algorithm 3 and shows its convergence. Theorem C.1 Let {x t } and {h t } be the sequences obtained from Algorithm 3, α t = α/ √ t, c t = 1, β 1,1 = β 1 , β 1,t ≤ β 1 , for all t ∈ [T ]. Assume that F has bounded diameter x -y ∞ ≤ D ∞ , ∀x, y ∈ F and ∇f t (x) ≤ G ∞ for all t ∈ [T ] and x ∈ F. Then for x t generated using Algorithm 3, we have the following bound on the regret. R(T ) ≤ D 2 ∞ √ T 2α(1 -β 1 ) d i=1 h T,i + D 2 ∞ 2(1 -β 1 ) T t=1 d i=1 β 1t h t,i α t + α √ 1 + log T (1 -β 1 ) 3 d i=1 g 1:T,i 2 , (47) Corollary C.1 follows immediately from the above theorem. Corollary C.1 Setting β 1t = β 1 λ t-1 in Theorem C.1, then we have R(T ) ≤ D 2 ∞ √ T 2α(1 -β 1 ) d i=1 h T,i + D 2 ∞ G ∞ 2(1 -β 1 )(1 -λ 2 ) + α √ 1 + log T (1 -β 1 ) 3 d i=1 g 1:T,i 2 Similarly, the bound above can be considered as better than the regret of SGD when 

√

dT (Duchi et al., 2011) , which is the same as the claims we make in Theorem 4.2. Another important observation is that Theorem 4.1 also holds here with similar arguments (because the first term is almost the same as the first term without momentum, the second term is a constant, and the third term is almost the same as well. We provide a similar theorem here. Theorem C.2 Let {x t } and {h t } be the sequences obtained from Algorithm 3, α t = α/ √ t, c t = 1, β 1,1 = β 1 , β 1t = β 1 λ t-1 , for all t ∈ [T ]. Assume that F has bounded diameter x -y ∞ ≤ D ∞ , ∀x, y ∈ F and ∇f t (x) ≤ G ∞ for all t ∈ [T ] and x ∈ F. Let {τ (i) mi } d i=1 be the largest time steps t on each dimension when h t,i = |g t,i |. Then for x t generated using Algorithm 3, we have the following bound on the regret. R(T ) ≤ D 2 ∞ 2α(1 -β 1 ) d i=1 h τ (i) m i ,i τ (i) mi + D 2 ∞ G ∞ 2(1 -β 1 )(1 -λ 2 ) + α (1 -β 1 ) 3 d i=1 1 + log τ (i) mi g 1:τ (i) m i ,i 2 , + α (1 -β 1 ) 3 d i=1 τ (i) mi log( T τ (i) mi )|g τ (i) m i ,i | The above regret bound can be considered as significantly better than O( √ T ) because it mostly depends on the time steps τ (i) mi instead of the number of time steps T . Given a distribution of gradients, it is possible that τ (i) mi 's are much smaller than T , and hence the regret is much smaller than common adaptive optimizers. The regret is at most O( √ T ), which is the same as AdaGrad and AMSGrad. We also conducted experiments to examine the effect of momentum on the convergence and performance of our algorithm on CIFAR-10. As shown in figure 4 , the choice of β 1 did not affect the convergence rate or the performance of AMX too much.  x t+1 = Π F ,Ht (x t -α t H -1 t m t ) = argmin x∈F H 1/2 t (x -(x t -α t H -1 t m t )) Using Lemma 4 in Reddi et al. (2018) with a direct substitution of z 1 = (x t -α t H -1 t m t ), Q = H t and z 2 = x * for x * ∈ F, the following inequality holds: H 1/2 t (u 1 -u 2 ) 2 = H 1/2 t (x t+1 -x * ) 2 ≤ H 1/2 t (x t -α t H -1 t m t -x * ) 2 = H 1/2 t (x t -x * ) 2 + α 2 t H -1/2 t m t 2 -2α t m t , (x t -x * ) = H 1/2 t (x t -x * ) 2 + α 2 t H -1/2 t m t 2 -2α t β 1t m t-1 + (1 -β 1t )g t , (x t -x * ) (51) where the first equality is due to Π F , Vt (x * ) = x * . Rearrange the last inequality, we obtain (1 -β 1t ) g t , (x t -x * ) ≤ 1 2α t H 1/2 t (x t -x * ) 2 -H 1/2 t (x t+1 -x * ) 2 + α t 2 H -1/2 t m t 2 -β 1t m t-1 , (x t -x * ) ≤ 1 2α t H 1/2 t (x t -x * ) 2 -H 1/2 t (x t+1 -x * ) 2 + α t 2 H -1/2 t m t 2 + β 1t α t 2 H -1/2 t m t-1 2 + β 1t 2α t H 1/2 t (x t -x * ) 2 (52) The second inequality is from applying the Cauchy-Schwarz and Young's inequality. By the convexity of f t and Lemma C.1 and Lemma C.2, we have T t=1 f t (x t ) -f t (x * ) ≤ T t=1 g t , (x t -x * ) ≤ T t=1 1 2α t (1 -β 1t ) H 1/2 t (x t -x * ) 2 -H 1/2 t (x t+1 -x * ) 2 + α t 2(1 -β 1t ) H -1/2 t m t 2 + β 1t α t 2(1 -β 1t ) H -1/2 t m t-1 2 + β 1t 2α t (1 -β 1t ) H 1/2 t (x t -x * ) 2 ≤ D ∞ 2 2α T (1 -β 1 ) d i=1 h T,i + T t=1 β 1t 2α t (1 -β 1 ) H 1/2 t (x t -x * ) 2 + α √ 1 + log T (1 -β 1 ) 3 d i=1 g 1:T,i 2 = D ∞ 2 2α T (1 -β 1 ) d i=1 h T,i + T t=1 1 2α t (1 -β 1 ) d i=1 β 1t (x t,i -x * i ) 2 h t,i + α √ 1 + log T (1 -β 1 ) 3 d i=1 g 1:T,i 2 ≤ D ∞ 2 2α T (1 -β 1 ) d i=1 h T,i + D 2 ∞ 2(1 -β 1 ) T t=1 d i=1 β 1t h t,i α t + α √ 1 + log T (1 -β 1 ) 3 d i=1 g 1:T,i 2 ≤ D 2 ∞ 2α T (1 -β 1 ) d i=1 h T,i + D 2 ∞ 2(1 -β 1 ) T t=1 d i=1 β 1t h t,i α t + α √ 1 + log T (1 -β 1 ) 3 d i=1 g 1:T,i 2 (53) To obtain the regret bound in 4.2, we can take β 1 = 0 and α = α/2 by equivalence in A.1 in the above bound, and the regret bound follows  R T ≤ D 2 ∞ √ T α d i=1 h T,i + α 2 1 + log T d i=1 g 1:T,i 2 , β 1t α t 2(1 -β 1t ) H -1/2 t m t 2 ≤ α √ 1 + log T (1 -β 1 ) 2 d i=1 g 1:T,i 2 (55) where C is a constant. Proof: We first analyze with the following process directly from the update rules, note that m T,i = T j=1 (1 -β 1j )Π T -j k=1 β 1(T -k+1) g j,i h 2 T,i = max( T -1 T h 2 T -1,i , g 2 T,i ) T t=1 α t H -1/2 t m t 2 = T -1 t=1 α t H -1/2 t m t 2 + α T d i=1 m 2 T,i h T,i = T -1 t=1 α t H -1/2 t m t 2 + α d i=1 ( T j=1 (1 -β 1j )Π T -j k=1 β 1(T -k+1) g j,i ) 2 T max( T -1 T h 2 T -1,i , g 2 T,i ) ≤ T -1 t=1 α t H -1/2 t m t 2 + α d i=1 ( T j=1 Π T -j k=1 β 1(T -k+1) )( T j=1 (1 -β 1j ) 2 Π T -j k=1 β 1(T -k+1) g 2 j,i ) T max( T -1 T h 2 T -1,i , g 2 T,i ) ≤ T -1 t=1 α t H -1/2 t m t 2 + α d i=1 ( T j=1 β T -j 1 )( T j=1 β T -j 1 g 2 j,i ) T max( T -1 T h 2 T -1,i , g 2 T,i ) ≤ T -1 t=1 α t H -1/2 t m t 2 + α (1 -β 1 ) d i=1 T j=1 β T -j 1 g 2 j,i T max( T -1 T h 2 T -1,i , g 2 T,i ) ≤ T -1 t=1 α t H -1/2 t m t 2 + α (1 -β 1 ) d i=1 T j=1 β T -j 1 g 2 j,i jg 2 j,i ≤ T -1 t=1 α t H -1/2 t m t 2 + α (1 -β 1 ) d i=1 T j=1 β T -j 1 |g j,i | √ j (57) where the first inequality is from the Cauchy-Schwarz inequality. The second inequality is due to the fact that β 1t ≤ β 1 , ∀t. The third inequality follows from T j=1 β T -j 1 ≤ 1/(1 -β 1 ) and that 1 -β 1j ≤ 1. The fourth one comes from the fact that h 2 t,i ≥ t-1 t h 2 t-1,i and that h 2 t,i ≥ g 2 t,i . Therefore T t=1 α t H -1/2 t m t 2 ≤ T t=1 α (1 -β 1 ) d i=1 t j=1 β t-j 1 |g j,i | √ j = α (1 -β 1 ) d i=1 T t=1 t j=1 β t-j 1 |g j,i | √ j = α (1 -β 1 ) d i=1 T t=1 |g t,i | √ t T j=t β j-t 1 ≤ α (1 -β 1 ) 2 d i=1 T t=1 |g t,i | √ t ≤ α (1 -β 1 ) 2 d i=1 g 1:T,i 2 T t=1 1 t ≤ α √ 1 + log T (1 -β 1 ) 2 d i=1 g 1:T,i 2 (58) The second equality is from rearranging the order of summation. The third inequality comes from the fact that T j=t β j-t 1 ≤ 1/(1 -β 1 ). The second last inequality is due to the Cauchy-Schwarz inequality. Lemma C.2 For the parameter settings and conditions assumed in Theorem C.1, we have T t=1 1 α t H 1/2 t (x t -x * ) 2 -H 1/2 t (x t+1 -x * ) 2 ≤ D 2 ∞ 2α T d i=1 h T,i Proof: By the definition of L2 norm, since ht,i αt ≥ ht-1,i αt-1 by the conditions in the problem (4) T t=1 1 α t H 1/2 t (x t -x * ) 2 -H 1/2 t (x t+1 -x * ) 2 ≤ 1 α 1 H 1/2 1 (x 1 -x * ) 2 + T t=2 H 1/2 t (x t -x * ) 2 α t - H 1/2 t-1 (x t -x * ) 2 α t-1 = 1 α 1 d i=1 h 1,i (x 1,i -x * i ) 2 + T t=2 d i=1 h t,i α t (x t,i -x * i ) 2 - h t-1,i α t-1 (x t,i -x * i ) 2 = 1 α 1 d i=1 h 1,i (x 1,i -x * i ) 2 + T t=2 d i=1 h t,i α t - h t-1,i α t-1 (x t,i -x * i ) 2 ≤ D 2 ∞ 2α T d i=1 h T,i where the first inequality is deleting the last negative term in the summation. The last inequality is from the telescopic summation and the bounded diameter that x -x * ≤ D ∞

D EXPERIMENT DETAILS AND MORE EXPERIMENTS

For all the algorithms, we used the default momentum hyper-parameters, that is γ = 0.9 for SGDM, (0.9, 0.999) for Adam and AMSGrad, and (β 1 , c t ) = (0.9, 1) for AMX in algorithm 3. The small number is set to be 1e-8 to avoid division by zero.

D.1 SYNTHETIC EXAMPLE

For the synthetic example in section 4, we used |g t,i | = 1 √ t |g 1,i |, ∀i, and |g 1,i | = 0.01 across all the dimensions. The dimension size is set to be d = 3, and the step sizes are set to be α = 0.5, α t = α/ √ t for all the algorithms. The design of step sizes is much larger than what we usually use in real applications, because it makes the increment in the regret bound of AMX much more visible. The increment in the first term of AMX is already zero, and the increment in the second term is very small. If we use a smaller step size α, we can barely see any increment in the regret bound of AMX in the figures. The hyper-parameter of AMSGrad is set to be β 2 = 0.999.

D.2 IMAGE CLASSIFICATION

For CIFAR-10 and CIFAR-100, the 32 × 32 images were zero-padded on the edges with 4 pixels on each margin and randomly cropped and horizontally flipped to generate the input. The input images were also normalized using the dataset mean and standard deviation. For the step sizes on CIFAR-10, we searched over {1e-4, 5e-4, 1e-3, 2e-3, 3e-3, 5e-3, 1e-2, 2e-2} for all the adaptive methods and found that 5e-3 works the best for AMX. The best step sizes for AdaGrad, Adam, AMSGrad were 1e-2, 1e-3, 1e-3 respectively. For SGDM, the best step size was 1e-1. We also grid-searched the weight decay on {1e-1, 5e-2, 1e-2, 5e-3, 1e-3, 5e-4, 1e-4}. On CIFAR-10, 1e-1 weight decay worked the best for Adam and AMSGrad and 5e-2 worked the best for AMX. 5e-4 worked the best for AdaGrad and SGD. On CIFAR-100, all adaptive algotithms worked the best when the weight decay was 1e-1, but SGD still needed the 5e-4 weight decay. The increments in the first term and the second term of the regret bounds are plotted in Figure 5 (a) and 5(b). On CIFAR-10, the step sizes of different algorithms were decreased by a factor of 0.1 at the the 100th and 150th epoch. On CIFAR-100, the step sizes were decreased by a factor of 0.2 at the 60th, 120th, and 160th epoch.

D.3 IMAGE SEGMENTATION

The Deeplab-ASPP model implemented by Kazuto1011 (2016 was used in this task. We followed their settings and reported the mean IoU values averaged over three independent runs. The model 2) with the diameter D t,∞ replaced by D ∞ = 2. As can be observed, the first term of AMX stops increasing after τ , which is the first time step. The other algorithms do not have such a nice property. Therefore, even if the second term of AMX is slightly larger than the second term of AdaGrad, the overall regret of AMX is much smaller. That means AMX converge much faster than AdaGrad and AMSGrad in the example. was pretrained on the MS-COCO dataset (Lin et al., 2014) . We did not use the CRF post-processing technique. We tried the initial step sizes {5e-3, 1e-3, 5e-4, 1e-4, 5e-5, 3e-5, 1e-5, 5e-6} for all the optimizers and we found that 1e-6 worked the best for Adam and AMSGrad. 5e-5 was the best for AMX and 1e-3 was the best for SGD. Similar to the experiments on CIFAR-10, 0.1 weight decay was applied to Adam and AMSGrad. 5e-2 weight decay was applied to AMX and 5e-4 weight decay was applied to SGDM.

D.4 LANGUAGE MODELING

We trained three layer LSTMs using the instructions provided by Kazuto1011 (2016). Specifically, the LSTMs consisted of 200 embedding size and 1k hidden units. For the dropout probabilities and the batch size, we followed the default values. A 1.2e-6 weight decay was applied to all the algorithms and we tuned over {1e-4, 5e-4, 1e-3, 2e-3, 3e-3, 4e-3, 5e-3, 1e-2, 5e-2} for the initial step sizes. We found that the algorithms were not very sensitive to the initial step sizes, and we reported the results of 2e-3 for Adam and AMSGrad, 3e-3 for AMX, and 1e-2 for AdaGrad, which were the best results among all the possible step sizes. We decayed the step size by a factor of 0.1 at the 300th and 400th epochs.

D.5 NEURAL MACHINE TRANSLATION

We used the basic implementation of the attentional neural machine translation model by pcyin (2018) and followed their settings. The hidden size of the LSTM was 256 and the embedding size was 256. Label smoothing was set to be 0.1 and the drop out probability was 0.2. We tuned over {5e-2, 1e-2 5e-3, 1e-3, 5e-4, 1e-4} step sizes for all the adaptive optimizers and found that similar to CIFAR-10, 1e-3 worked the best for Adam and AMSGrad and 5e-3, 1e-2 were the best step sizes for AMX and AdaGrad respectively. We averaged the BLEU scores on the IWSLT'14 German to English dataset (Ranzato et al., 2015) over three independent runs and reported them in table 2.

D.6 EMPIRICAL STUDY OF THE HYPER-PARAMETER

The hyper-parameter c t plays an important role in our class of AMX algorithms. We conducted an empirical study on some of the designs of c t and compared the performance of the corresponding algorithms. We compared the designs c t = 1, 0.5, 0.1, 1/ √ t, 1/t on CIFAR-10 and reported their top-1 accuracy in Figure 6 and Table 3 . All the algorithms were trained using the same settings as Appendix D.2 with the same initial step size 5e-4. As observed, changing the constant c t to 0.1 and 0.5 did not affect the convergence speed or the final accuracy too much, which meant the performance was not sensitive to the value of c t . However, c t = 1/ √ t and c t = 1/t resulted in slower convergence and worse performance, meaning that these designs were not preferred. These observations also correspond to our claims in section A.5, i.e. despite that these designs make the first term in Theorem 2.1 even smaller, they also make the second term much larger so that the algorithm becomes slower. 



The regret bound of AdaGrad given in the original paper was O( g1:T,i 2) becauseDuchi et al. (2011) used a constant learning rate αt = α and hence ht = t i=1 g 2 i . When changed into the same setting where αt = α/ √ t, ht becomes t i=1 g 2 i /t and the regret also has the form of (3), shown inReddi et al. (2018). Because (ht,i/αt) -(ht-1,i/αt-1) = 0 when ht,i = (αt/αt-1)ht-1,i This claim can be easily proved by taking derivative of τ and finding the maximum of √ τ log(T /τ )



, all existing regret bounds are still O( √ T ), which makes it hard to compare different proximal functions. Whether the best proximal function exists and whether the O( √ T ) regret bound can be further improved still remain open questions.

Corollary 4.1 Let τ = max i {τ (i) mi } in Theorem 4.1, under the same assumptions as AdaGrad and AMSGrad, Algorithm 1 converges with regret bound

Figure 1: The regret bounds of AMX, AdaGrad, AMSGrad in the example.

CIFAR-100 Testing Acc.

Figure 2: Training and Testing Top-1 accuracy on CIFAR-10 and CIFAR-100.

Figure 3: (a), (b). Training Loss and Testing IoU curves on the VOC2012 Segmentation dataset. (c). Validation perplexity curve on the Penn Tree Bank (PTB) dataset. (d). Validation Perplexity curve on the IWSLT'14 DE-EN machine translation dataset.

Figure 4: Training and testing Top-1 accuracy curve on CIFAR-10 with different momentum parameters β 1

For the parameter settings and conditions assumed in Theorem C.1, we have T t=1

The Second Term of the Regret Bound

Figure 5: The first and the second term of the regret bound in equation (2) with the diameter D t,∞ replaced by D ∞ = 2. As can be observed, the first term of AMX stops increasing after τ , which is the first time step. The other algorithms do not have such a nice property. Therefore, even if the second term of AMX is slightly larger than the second term of AdaGrad, the overall regret of AMX is much smaller. That means AMX converge much faster than AdaGrad and AMSGrad in the example.

Figure 6: Training and testing Top-1 accuracy curve on CIFAR-10 with different designs of c t

Testing Top-1 accuracy on the CIFAR-10, CIFAR-100 datasets and testing IoU on the VOC2012 Segmentation dataset. The results were averaged over 5 independent runs. Our results were shown in bold.

AMX, c t =1 AMX, c t =0.5 AMX, c t =0.1 AMX, c t =1/t AMX, c t =1/ t

Testing Top-1 Accuracy on CIFAR-10 with different designs of c t . The results were averaged over 3 independent runs.

