ON THE CONVERGENCE OF ADAGRAD(NORM) ON R d : BEYOND CONVEXITY, NON-ASYMPTOTIC RATE AND ACCELERATION

Abstract

Existing analysis of AdaGrad and other adaptive methods for smooth convex optimization is typically for functions with bounded domain diameter. In unconstrained problems, previous works guarantee an asymptotic convergence rate without an explicit constant factor that holds true for the entire function class. Furthermore, in the stochastic setting, only a modified version of AdaGrad, different from the one commonly used in practice, in which the latest gradient is not used to update the stepsize, has been analyzed. Our paper aims at bridging these gaps and developing a deeper understanding of AdaGrad and its variants in the standard setting of smooth convex functions as well as the more general setting of quasar convex functions. First, we demonstrate new techniques to explicitly bound the convergence rate of the vanilla AdaGrad for unconstrained problems in both deterministic and stochastic settings. Second, we propose a variant of AdaGrad for which we can show the convergence of the last iterate, instead of the average iterate. Finally, we give new accelerated adaptive algorithms and their convergence guarantee in the deterministic setting with explicit dependency on the problem parameters, improving upon the asymptotic rate shown in previous works.

1. INTRODUCTION

In recent years, the prevalence of machine learning models has motivated the development of new optimization tools, among which adaptive methods such as Adam (Kingma & Ba, 2014) , AmsGrad (Reddi et al., 2018 ), AdaGrad (Duchi et al., 2011) emerge as the most important class of algorithms. These methods do not require the knowledge of the problem parameters when setting the stepsize as traditional methods like SGD, while still showing robust performances in many ML tasks. However, it remains a challenge to analyze and understand the properties of these methods. Take AdaGrad and its variants for example. In its vanilla scalar form, also known as AdaGradNorm, the step size is set using the cumulative sum of the gradient norm of all iterates so far. The work of Ward et al. ( 2020) has shown the convergence of this algorithm for non-convex funtions by bounding the decay of the gradient norms. However, in convex optimization, usually we require a stronger convergence criterion-bounding the function value gap. This is where we lack theoretical understanding. Even in the deterministic setting, most existing works (Levy, 2017; Levy et al., 2018; Ene et al., 2021) rely on the assumption that the domain of the function is bounded. The dependence on the domain diameter can become an issue if it is unknown or cannot be readily estimated. Other works for unconstrained problems (Antonakopoulos et al., 2020; 2022) offer a convergence rate that depends on the limit of the step size sequence. This limit is shown to exist for each function, but without an explicit value, and more importantly, it is not shown to be a constant for the entire function class. This means that these methods essentially do not tell us how fast the algorithm converges in the worst case. Another work by Ene & Nguyen (2022) gives an explicit rate of convergence for the entire class but requires the strong assumption that the gradients are bounded even in the smooth setting and the convergence guarantee has additional error terms depending on this bound.

