THE CONVERGENCE RATE OF SGD'S FINAL ITERATE: ANALYSIS ON DIMENSION DEPENDENCE Anonymous authors Paper under double-blind review

Abstract

Stochastic Gradient Descent (SGD) is among the simplest and most popular optimization and machine learning methods. Running SGD with a fixed step size and outputting the final iteration is an ideal strategy one can hope for, but it is still not well-understood even though SGD has been studied extensively for over 70 years. Given the Θ(log T ) gap between current upper and lower bounds for running SGD for T steps, it was then asked by Koren & Segal (2020) how to characterize the final-iterate convergence of SGD with a fixed step size in the constant dimension setting, i.e., d = O(1). In this paper, we consider the more general setting for any d ≤ T , proving Ω(log d/ T ) lower bounds for the sub-optimality of the final iterate of SGD in minimizing non-smooth Lipschitz convex functions with standard step sizes. Our results provide the first general dimension-dependent lower bound on the convergence of SGD's final iterate, partially resolving the COLT open question raised by Koren & Segal (2020). Moreover, we present a new method in one dimension based on martingale and Freedman's inequality, which gets the tight O(1/ √ T ) upper bound with mild assumptions.

1. INTRODUCTION

Stochastic gradient descent (SGD) was first introduced by Robbins & Monro (1951) . It soon became one of the most popular tools in applied machine learning, e.g., Johnson & Zhang (2013) ; Schmidt et al. (2017) due to its simplicity and effectiveness. SGD works by iteratively taking a small step in the opposite direction of an unbiased estimate of sub-gradients and is widely used in minimizing convex function f over a convex domain K. Formally speaking, given a stochastic gradient oracle for an input x ∈ K, the oracle returns a random vector ĝ whose expectation is equal to one of the sub-gradients of f at x. Given an initial point x 1 , SGD generates a sequence of points x 1 , ..., x T +1 according to the update rule x t+1 = Π K (x t -η t ĝt ) (1) where Π K denotes projection onto K and {η t } t≥1 is a sequence of step sizes. Theoretical analysis on SGD usually adopt running average step size, i.e., outputting 1 T T t=1 x t in the end, to get optimal rates of convergence in the stochastic approximation setting. Optimal convergence rates have been achieved in both convex and strongly convex settings when averaging of iterates is used Nemirovskij & Yudin (1983); Zinkevich (2003) ; Kakade & Tewari (2008) ; Cesa-Bianchi et al. (2004) . Nonetheless, the final iterate of SGD, which is often preferred over the running average, as pointed out by Shalev-Shwartz et al. (2011) , has not been very well studied from the theoretical perspective, and convergence results for the final iterate are relatively scarce compared with the running average schedule. Standard choices of step sizes for convex functions include η t = 1/ √ t for unknown horizon T and η t = 1/ √ T for known T , and η t = 1/t for strongly convex functions. In these cases, it is known that the final-iterate convergence rate of SGD is optimal when f is both smooth and strongly convex (Nemirovski et al. (2009) ). However, in practice, the convex functions we want to minimize are often non-smooth. See Cohen et al. ( 2016 2019) proved an information-theoretically optimal O(1/ √ T ) upper bound using a rather NON-STANDARD step size schedule. Roughly speaking, the T steps are divided into log T phases, and the step size decreases by half when entering the next phase. Many implementations take ever-shrinking step sizes, which is somewhat consistent with this theoretical result. Harvey et al. (2019a) gave an Ω(log T / √ T ) lower bound for the STANDARD η t = 1/ √ t step size schedule, but their construction requires the dimension d to be no less than T , which is restrictive. See Table 1 for more details. A natural question arises: Question: What's the dependence on dimension d of the convergence rate of SGD's final iterate with standard step sizes when d ≤ T ? In a recent COLT open question raised by Koren & Segal (2020) , the same problem was posed but mainly for the more restrictive constant dimension setting. Moreover, they conjectured that the right convergence rate of SGD with standard step size in the constant dimensional case is Θ(1/ √ T ). As preliminary support evidence for their conjecture, they analyzed a one-dimensional one-sided random walk special case. However, this result is limited in the one-dimension setting for the particular absolute-value function and thus can not be easily generalized. Analyzing the final-iterate convergence rate of SGD in the general dimension for general convex functions is a more exciting and challenging question. In particular, in Koren & Segal (2020), they wrote: For dimension d > 1, a natural conjecture is that the right convergence rate is Θ(log d/ √ T ), but we have no indication to corroborate this. Motivated by this, we mainly focus on analyzing the final iterate of SGD with standard step size in general dimension d ≤ T without smoothness assumptions in this paper. 



OUR CONTRIBUTIONS Our first main result is an Ω(log d/ √ T ) lower bound for SGD minimizing Lipschitz convex functions with a fixed step size η t = 1/ √ T when dimension d ≤ T , generalizing the result in Harvey et al. (2019a). Our main observation is that we can let the initial point x 1 stay still for any number of steps as long as 0 is one of the sub-gradient of f at x 1 . By modifying the original construction of Harvey et al. (2019a), we can keep x 1 at 0 for T -d steps and then 'kick' it to start taking a similar route as in Harvey et al. (2019a) in a d-dimensional space, which incurs an Ω(log d/ √ T ) sub-optimality.

);Lee et al. (2013)  for more details. The convergence rate of SGD's final iterate with standard step sizes in the non-smooth setting is much less explored. Understanding this problem is essential as the final iterate of SGD is popular and used often. If the last iterate of SGD performs as well as the running average, it yields a very simple, implementable, Convergence results for the expected sub-optimality of the final iterate of SGD for minimizing non-smooth convex functions in various settings. GD denotes the sub-gradient descent method, and lower bounds of GD also hold for SGD. The lower bounds for Lipschitz convex functions in Shamir & Zhang (2013);Harvey et al. (2019a)  can also be extended to fixed step size 1/ interpretable form of SGD. If there is a lower bound saying the last iterate of SGD is worse than the running average, we may need to compare the last iterate and running average when implementing the algorithm. A line of works attempts to understand the convergence rate of the final iterate of SGD. A seminar work Shamir & Zhang (2013) first established a near-optimal O(log T / √ T ) convergence rate for the final iterate of SGD with a STANDARD step size schedule η

