THE CONVERGENCE RATE OF SGD'S FINAL ITERATE: ANALYSIS ON DIMENSION DEPENDENCE Anonymous authors Paper under double-blind review

Abstract

Stochastic Gradient Descent (SGD) is among the simplest and most popular optimization and machine learning methods. Running SGD with a fixed step size and outputting the final iteration is an ideal strategy one can hope for, but it is still not well-understood even though SGD has been studied extensively for over 70 years. Given the Θ(log T ) gap between current upper and lower bounds for running SGD for T steps, it was then asked by Koren & Segal (2020) how to characterize the final-iterate convergence of SGD with a fixed step size in the constant dimension setting, i.e., d = O(1). In this paper, we consider the more general setting for any d ≤ T , proving Ω(log d/ T ) lower bounds for the sub-optimality of the final iterate of SGD in minimizing non-smooth Lipschitz convex functions with standard step sizes. Our results provide the first general dimension-dependent lower bound on the convergence of SGD's final iterate, partially resolving the COLT open question raised by Koren & Segal (2020). Moreover, we present a new method in one dimension based on martingale and Freedman's inequality, which gets the tight O(1/ √ T ) upper bound with mild assumptions.

1. INTRODUCTION

Stochastic gradient descent (SGD) was first introduced by Robbins & Monro (1951) . It soon became one of the most popular tools in applied machine learning, e.g., Johnson & Zhang (2013); Schmidt et al. (2017) due to its simplicity and effectiveness. SGD works by iteratively taking a small step in the opposite direction of an unbiased estimate of sub-gradients and is widely used in minimizing convex function f over a convex domain K. Formally speaking, given a stochastic gradient oracle for an input x ∈ K, the oracle returns a random vector ĝ whose expectation is equal to one of the sub-gradients of f at x. Given an initial point x 1 , SGD generates a sequence of points x 1 , ..., x T +1 according to the update rule x t+1 = Π K (x t -η t ĝt ) (1) where Π K denotes projection onto K and {η t } t≥1 is a sequence of step sizes. Theoretical analysis on SGD usually adopt running average step size, i.e., outputting 1 T T t=1 x t in the end, to get optimal rates of convergence in the stochastic approximation setting. Optimal convergence rates have been achieved in both convex and strongly convex settings when averaging of iterates is used Nemirovskij & Yudin (1983); Zinkevich (2003) ; Kakade & Tewari (2008) ; Cesa-Bianchi et al. (2004) . Nonetheless, the final iterate of SGD, which is often preferred over the running average, as pointed out by Shalev-Shwartz et al. (2011) , has not been very well studied from the theoretical perspective, and convergence results for the final iterate are relatively scarce compared with the running average schedule. Standard choices of step sizes for convex functions include η t = 1/ √ t for unknown horizon T and η t = 1/ √ T for known T , and η t = 1/t for strongly convex functions. In these cases, it is known that the final-iterate convergence rate of SGD is optimal when f is both smooth and strongly convex (Nemirovski et al. (2009) ). However, in practice, the convex functions we want to minimize are often non-smooth. See Cohen et al. (2016) ; Lee et al. (2013) for more details. The convergence rate of SGD's final iterate with standard step sizes in the non-smooth setting is much less explored. Understanding this problem is essential as the final iterate of SGD is popular and used often. If the last iterate of SGD performs as well as the running average, it yields a very simple, implementable,

