ADAPTIVE GRADIENT METHODS CAN BE PROVABLY FASTER THAN SGD WITH RANDOM SHUFFLING

Abstract

Adaptive gradient methods have been shown to outperform SGD in many tasks of training neural networks. However, the acceleration effect is yet to be explained in the non-convex setting since the best convergence rate of adaptive gradient methods is worse than that of SGD in literature. In this paper, we prove that adaptive gradient methods exhibit an Õ(T 1/2 )-convergence rate for finding firstorder stationary points under the strong growth condition, which improves previous best convergence results of adaptive gradient methods and random shuffling SGD by factors of O(T 1/4 ) and O(T 1/6 ), respectively. In particular, we study two variants of AdaGrad with random shuffling for finite sum minimization. Our analysis suggests that the combination of random shuffling and adaptive learning rates gives rise to better convergence.

1. INTRODUCTION

We consider the finite sum minimization problem in stochastic optimization: min x2R d f (x) = 1 n n X i=1 f i (x), where f is the objective function and its component functions f i : R d ! R are smooth and possibly non-convex. This formulation has been used extensively in training neural networks today. Stochastic gradient descend (SGD) and its variants have shown to be quite effective for solving this problem, whereas recent works demonstrate another prominent line of gradient-based algorithms by introducing adaptive step sizes to automatically adjust the learning rate (Duchi et al., 2011; Tieleman & Hinton, 2012; Kingma & Ba, 2014) . Despite the superior performance of adaptive gradient methods in many tasks (Devlin et al., 2019; Vaswani et al., 2017) , their theoretical convergence remains the same or even worse for non-convex objectives, compared to SGD. In general non-convex settings, it is often impractical to discuss optimal solutions. Therefore, the attention of analysis turns to stationary points instead. Many works have been proposed to study first-order (Chen et al., 2019; Zhou et al., 2018; Zaheer et al., 2018; Ward et al., 2018; Zhou et al., 2018) and second-order (Allen-Zhu, 2018; Staib et al., 2019) stationary points. Table 1 summarized some previous best-known results for finding first-order stationary points. One might notice that the best dependence on the total iteration number T of adaptive gradient methods matches that of vanilla SGD. In addition, with the introduction of incremental sampling techniques, an even better convergence of SGD can be obtained (Haochen & Sra, 2019; Nguyen et al., 2020) . This gap between theory and practice of adaptive gradient methods has been an open problem that we aim to solve in this paper. Motivated by the analysis of sampling techniques, we rigorously prove that adaptive gradient methods exhibit a faster non-asymptotic convergence rate that matches the best result on SGD. In particular, we make the following contributions: • Our main contribution (Theorem 1,2,3) is to prove that two variants of AdaGrad can find Õ(T 1/2 )-approximate first-order stationary points under the strong growth condition assumption (Schmidt & Roux, 2013) . This improves previous best convergence results of adaptive gradient methods and shuffling SGD by factors of O(T 1/4 ) and O(T  Õ ⇣ n 1/3 • T 1/3 ⌘ Ours • bounded gradients • strong growth condition Õ ⇣ n 1/2 • T 1/2 ⌘ T denotes the number of iterations; n is the number of samples in the finite sum minimization problem. respectively. As a result, this bridges the gap between analysis and practice of adaptive gradient methods by proving that adaptive gradient methods can be faster than SGD with random shuffling in theory. • We study the strong growth condition under which our convergence rate is derived. This condition has been previous used to study SGD in the expectation minimization setting (Schmidt & Roux, 2013; Vaswani et al., 2019) . We prove that this condition is satisfied by two general types of models under some additional assumptions. • We conduct preliminary experiments to demonstrate the combined acceleration effect of random shuffling and adaptive learning rates. Our analysis points out two key components that lead to better convergence results of adaptive gradient methods: the epoch-wise analysis of random shuffling can incorporate the benefit of full gradients; the adaptive learning rates along with the strong growth condition provide better improvement of objective value in consecutive epochs. Finite Sum Minimization vs. Expectation Minimization The comparison in Table 1 shows the convergence rates in the non-convex setting with respect to first-order stationary points. The results in the first two categories apply to the general expectation minimization problem with f (x) = E z f (x, z). Whereas the convergences for expectation minimization naturally transform to finite sum minimization, the statements remain asymptotic, meaning Ekrf (x)k ⇠ O(T ↵ ), where the expectation is taken to compensate the stochastic gradients. Many efforts have been made to reduce variance in finite sum minimization (Johnson & Zhang, 2013; Reddi et al., 2016; Haochen & Sra, 2019) . In particular, non-asymptotic results can be gained using random shuffling, under which the dependency on sample size n seems to be unavoidable (Haochen & Sra, 2019) .

2. PRELIMINARIES

A typical setting of machine learning using gradient methods is the finite sum minimization in equation ( 1). In this problem, the number of samples n is usually very large, rendering the evaluation of full gradients expensive. Therefore, a mini-batch gradient is introduced to approximate the full



Convergence rate comparisons for the non-convex optimization problem. First two categories, 'SGD' and 'Adaptive Gradient Methods' are based on the expectation minimization problem, whereas the last category is based on the finite sum minimization problem.

