ADAPTIVE GRADIENT METHODS CAN BE PROVABLY FASTER THAN SGD WITH RANDOM SHUFFLING

Abstract

Adaptive gradient methods have been shown to outperform SGD in many tasks of training neural networks. However, the acceleration effect is yet to be explained in the non-convex setting since the best convergence rate of adaptive gradient methods is worse than that of SGD in literature. In this paper, we prove that adaptive gradient methods exhibit an Õ(T 1/2 )-convergence rate for finding firstorder stationary points under the strong growth condition, which improves previous best convergence results of adaptive gradient methods and random shuffling SGD by factors of O(T 1/4 ) and O(T 1/6 ), respectively. In particular, we study two variants of AdaGrad with random shuffling for finite sum minimization. Our analysis suggests that the combination of random shuffling and adaptive learning rates gives rise to better convergence.

1. INTRODUCTION

We consider the finite sum minimization problem in stochastic optimization: min x2R d f (x) = 1 n n X i=1 f i (x), where f is the objective function and its component functions f i : R d ! R are smooth and possibly non-convex. This formulation has been used extensively in training neural networks today. Stochastic gradient descend (SGD) and its variants have shown to be quite effective for solving this problem, whereas recent works demonstrate another prominent line of gradient-based algorithms by introducing adaptive step sizes to automatically adjust the learning rate (Duchi et al., 2011; Tieleman & Hinton, 2012; Kingma & Ba, 2014) . Despite the superior performance of adaptive gradient methods in many tasks (Devlin et al., 2019; Vaswani et al., 2017) , their theoretical convergence remains the same or even worse for non-convex objectives, compared to SGD. In general non-convex settings, it is often impractical to discuss optimal solutions. Therefore, the attention of analysis turns to stationary points instead. 1 summarized some previous best-known results for finding first-order stationary points. One might notice that the best dependence on the total iteration number T of adaptive gradient methods matches that of vanilla SGD. In addition, with the introduction of incremental sampling techniques, an even better convergence of SGD can be obtained (Haochen & Sra, 2019; Nguyen et al., 2020) . This gap between theory and practice of adaptive gradient methods has been an open problem that we aim to solve in this paper. Motivated by the analysis of sampling techniques, we rigorously prove that adaptive gradient methods exhibit a faster non-asymptotic convergence rate that matches the best result on SGD. In particular, we make the following contributions: • Our main contribution (Theorem 1,2,3) is to prove that two variants of AdaGrad can find Õ(T 1/2 )-approximate first-order stationary points under the strong growth condition assumption (Schmidt & Roux, 2013) . This improves previous best convergence results of adaptive gradient methods and shuffling SGD by factors of O(T 1/4 ) and O(T 1/6 ) ,



Many works have been proposed to study first-order(Chen et al., 2019; Zhou et al., 2018; Zaheer et al., 2018; Ward  et al., 2018; Zhou et al., 2018)  and second-order(Allen-Zhu, 2018; Staib et al., 2019)  stationary points. Table

