ON THE CONVERGENCE OF ADAGRAD(NORM) ON R d : BEYOND CONVEXITY, NON-ASYMPTOTIC RATE AND ACCELERATION

Abstract

Existing analysis of AdaGrad and other adaptive methods for smooth convex optimization is typically for functions with bounded domain diameter. In unconstrained problems, previous works guarantee an asymptotic convergence rate without an explicit constant factor that holds true for the entire function class. Furthermore, in the stochastic setting, only a modified version of AdaGrad, different from the one commonly used in practice, in which the latest gradient is not used to update the stepsize, has been analyzed. Our paper aims at bridging these gaps and developing a deeper understanding of AdaGrad and its variants in the standard setting of smooth convex functions as well as the more general setting of quasar convex functions. First, we demonstrate new techniques to explicitly bound the convergence rate of the vanilla AdaGrad for unconstrained problems in both deterministic and stochastic settings. Second, we propose a variant of AdaGrad for which we can show the convergence of the last iterate, instead of the average iterate. Finally, we give new accelerated adaptive algorithms and their convergence guarantee in the deterministic setting with explicit dependency on the problem parameters, improving upon the asymptotic rate shown in previous works.

1. INTRODUCTION

In recent years, the prevalence of machine learning models has motivated the development of new optimization tools, among which adaptive methods such as Adam (Kingma & Ba, 2014) , AmsGrad (Reddi et al., 2018) , AdaGrad (Duchi et al., 2011) emerge as the most important class of algorithms. These methods do not require the knowledge of the problem parameters when setting the stepsize as traditional methods like SGD, while still showing robust performances in many ML tasks. However, it remains a challenge to analyze and understand the properties of these methods. Take AdaGrad and its variants for example. In its vanilla scalar form, also known as AdaGradNorm, the step size is set using the cumulative sum of the gradient norm of all iterates so far. The work of Ward et al. (2020) has shown the convergence of this algorithm for non-convex funtions by bounding the decay of the gradient norms. However, in convex optimization, usually we require a stronger convergence criterion-bounding the function value gap. This is where we lack theoretical understanding. Even in the deterministic setting, most existing works (Levy, 2017; Levy et al., 2018; Ene et al., 2021) rely on the assumption that the domain of the function is bounded. The dependence on the domain diameter can become an issue if it is unknown or cannot be readily estimated. Other works for unconstrained problems (Antonakopoulos et al., 2020; 2022) offer a convergence rate that depends on the limit of the step size sequence. This limit is shown to exist for each function, but without an explicit value, and more importantly, it is not shown to be a constant for the entire function class. This means that these methods essentially do not tell us how fast the algorithm converges in the worst case. Another work by Ene & Nguyen (2022) gives an explicit rate of convergence for the entire class but requires the strong assumption that the gradients are bounded even in the smooth setting and the convergence guarantee has additional error terms depending on this bound. In the stochastic setting, one common approach is to analyze a modified version of AdaGrad with off-by-one step size, i.e. the gradient at the current time step is not taken into account when setting the new step size. This is where the gap between theory and practice exists.

1.1. OUR CONTRIBUTION

In this paper, we make the following contributions. First, we demonstrate a method to show an explicit non-asymptotic convergence rate of AdaGradNorm and AdaGrad on R d in the deterministic setting. Our method extends to a more general function class known as γ-quasar convex functions with a weaker condition for smoothness. To the best of our knowledge, we are the first to prove this result. Second, we present new techniques to analyze stochastic AdaGradNorm and offer an explicit convergence guarantee for γ-quasar convex optimization on R d with a mild assumption on the noise of the gradient estimates. We propose two new variants of AdaGradNorm which demonstrate the convergence of the last iterate instead of the average iterate as shown in AdaGradNorm. Finally, we propose a new accelerated algorithm with two variants and show their non-asymptotic convergence rate in the deterministic setting.

1.2. RELATED WORK

Adaptive methods There has been a long line of works on adaptive methods, including AdaGrad (Duchi et al., 2011 ), RMSProp (Tieleman et al., 2012) and Adam (Kingma & Ba, 2014) . AdaGrad was first designed for stochastic online optimization; subsequent works (Levy, 2017; Kavis et al., 2019; Bach & Levy, 2019; Antonakopoulos et al., 2020; Ene et al., 2021) analyzed AdaGrad and various adaptive algorithms for convex optimization and generalized them for variational inequality problems. These works commonly assume that the optimization problem is contrained in a set with bounded diameter. Li & Orabona ( 2019) are the first to analyze a variant of AdaGrad for unbounded domains where the latest gradient is not used to construct the step size, which differs from the standard version of AdaGrad commonly used in practice. However, the algorithm and analysis of Li & Orabona (2019) set the initial step size based on the smoothness parameter and thus they do not adapt to it. Other works provide convergence guarantees for adaptive methods for unbounded domains, yet without explicit dependency on the problem parameters (Antonakopoulos et al., 2020; 2022) , or for a class of strongly convex functions (Xie et al., 2020) . Another work by Ene & Nguyen (2022) requires the strong assumption that the gradients are bounded even for smooth functions and the convergence guarantee has additional error terms depending on the gradient upperbound. Our work analyzes the standard version of AdaGrad for unconstrained and general convex problems and shows explicit convergence rate in both the deterministic and stochastic setting. 2022) for a more detailed survey on AdaGrad-style methods for nonconvex optimization. In general, the criterion used to study these convergence rates is the gradient norm of the function, which is weaker than the function value gap normally used in the study of convex functions. In comparison, we study the convergence of AdaGrad via the function value gap for a broader notion of convexity, known as quasar-convexity, as well as a more generalized definition of smoothness.

2. PRELIMINARIES

We consider the following optimization problem: minimize x∈R d F (x), where F is differentiable satisfying F * = inf x∈R d F (x) > -∞ and x * ∈ arg min x∈R d F (x) = ∅. We will use the following notations throughout the paper: a + = max {a, 0}, a ∨ b = max {a, b}, [n] = {1, 2, • • • , n}, and • denotes the 2 -norm • 2 for simplicity.



Accelerated adaptive methods have been designed to achieve O(1/T 2 ) and O(1/ √ T ) respectively in the deterministic and stochastic setting in the works of Levy et al. (2018); Ene & Nguyen (2022); Antonakopoulos et al. (2022). We show different variants and demonstrate the same but explicit accelerated convergence rate in the deterministic setting for unconstrained problems. Analysis beyond convexity The convergence of some variants of AdaGrad has been established for nonconvex functions in the work of Li & Orabona (2019); Ward et al. (2020); Faw et al. (2022) under various assumptions. Other works (Li & Orabona, 2020; Kavis et al., 2022) demonstrate the convergence with high probability. We refer the reader to Faw et al. (

