THE CONVERGENCE RATE OF SGD'S FINAL ITERATE: ANALYSIS ON DIMENSION DEPENDENCE Anonymous authors Paper under double-blind review

Abstract

Stochastic Gradient Descent (SGD) is among the simplest and most popular optimization and machine learning methods. Running SGD with a fixed step size and outputting the final iteration is an ideal strategy one can hope for, but it is still not well-understood even though SGD has been studied extensively for over 70 years. Given the Θ(log T ) gap between current upper and lower bounds for running SGD for T steps, it was then asked by Koren & Segal (2020) how to characterize the final-iterate convergence of SGD with a fixed step size in the constant dimension setting, i.e., d = O(1). In this paper, we consider the more general setting for any d ≤ T , proving Ω(log d/ T ) lower bounds for the sub-optimality of the final iterate of SGD in minimizing non-smooth Lipschitz convex functions with standard step sizes. Our results provide the first general dimension-dependent lower bound on the convergence of SGD's final iterate, partially resolving the COLT open question raised by Koren & Segal (2020) . Moreover, we present a new method in one dimension based on martingale and Freedman's inequality, which gets the tight O(1/ √ T ) upper bound with mild assumptions.

1. INTRODUCTION

Stochastic gradient descent (SGD) was first introduced by Robbins & Monro (1951) . It soon became one of the most popular tools in applied machine learning, e.g., Johnson & Zhang (2013) ; Schmidt et al. (2017) due to its simplicity and effectiveness. SGD works by iteratively taking a small step in the opposite direction of an unbiased estimate of sub-gradients and is widely used in minimizing convex function f over a convex domain K. Formally speaking, given a stochastic gradient oracle for an input x ∈ K, the oracle returns a random vector ĝ whose expectation is equal to one of the sub-gradients of f at x. Given an initial point x 1 , SGD generates a sequence of points x 1 , ..., x T +1 according to the update rule x t+1 = Π K (x t -η t ĝt ) (1) where Π K denotes projection onto K and {η t } t≥1 is a sequence of step sizes. Theoretical analysis on SGD usually adopt running average step size, i.e., outputting 1 T T t=1 x t in the end, to get optimal rates of convergence in the stochastic approximation setting. Optimal convergence rates have been achieved in both convex and strongly convex settings when averaging of iterates is used Nemirovskij & Yudin (1983) ; Zinkevich (2003) ; Kakade & Tewari (2008) ; Cesa-Bianchi et al. (2004) . Nonetheless, the final iterate of SGD, which is often preferred over the running average, as pointed out by Shalev-Shwartz et al. (2011) , has not been very well studied from the theoretical perspective, and convergence results for the final iterate are relatively scarce compared with the running average schedule. Standard choices of step sizes for convex functions include η t = 1/ √ t for unknown horizon T and η t = 1/ √ T for known T , and η t = 1/t for strongly convex functions. In these cases, it is known that the final-iterate convergence rate of SGD is optimal when f is both smooth and strongly convex (Nemirovski et al. (2009) ). However, in practice, the convex functions we want to minimize are often non-smooth. See Cohen et al. (2016) ; Lee et al. (2013) for more details. The convergence rate of SGD's final iterate with standard step sizes in the non-smooth setting is much less explored. Understanding this problem is essential as the final iterate of SGD is popular and used often. If the last iterate of SGD performs as well as the running average, it yields a very simple, implementable,

Rate Method Convexity

Step size Assumptions Nemirovski et al. (2009) O 1 : Convergence results for the expected sub-optimality of the final iterate of SGD for minimizing non-smooth convex functions in various settings. GD denotes the sub-gradient descent method, and lower bounds of GD also hold for SGD. The lower bounds for Lipschitz convex functions in Shamir & Zhang (2013) ; Harvey et al. (2019a) can also be extended to fixed step size 1/ √ T , observed by Koren & Segal (2020) . and interpretable form of SGD. If there is a lower bound saying the last iterate of SGD is worse than the running average, we may need to compare the last iterate and running average when implementing the algorithm. A line of works attempts to understand the convergence rate of the final iterate of SGD. A seminar work Shamir & Zhang (2013) first established a near-optimal O(log T /

√

T ) convergence rate for the final iterate of SGD with a STANDARD step size schedule η t = 1/ √ t. Jain et al. ( 2019) proved an information-theoretically optimal O(1/ √ T ) upper bound using a rather NON-STANDARD step size schedule. Roughly speaking, the T steps are divided into log T phases, and the step size decreases by half when entering the next phase. Many implementations take ever-shrinking step sizes, which is somewhat consistent with this theoretical result. Harvey et al. (2019a) gave an Ω(log T / T ) lower bound for the STANDARD η t = 1/ √ t step size schedule, but their construction requires the dimension d to be no less than T , which is restrictive. See Table 1 for more details. A natural question arises: Question: What's the dependence on dimension d of the convergence rate of SGD's final iterate with standard step sizes when d ≤ T ? In a recent COLT open question raised by Koren & Segal (2020) , the same problem was posed but mainly for the more restrictive constant dimension setting. Moreover, they conjectured that the right convergence rate of SGD with standard step size in the constant dimensional case is Θ(1/ √ T ). As preliminary support evidence for their conjecture, they analyzed a one-dimensional one-sided random walk special case. However, this result is limited in the one-dimension setting for the particular absolute-value function and thus can not be easily generalized. Analyzing the final-iterate convergence rate of SGD in the general dimension for general convex functions is a more exciting and challenging question. In particular, in Koren & Segal (2020) , they wrote: For dimension d > 1, a natural conjecture is that the right convergence rate is Θ(log d/ √ T ), but we have no indication to corroborate this. Motivated by this, we mainly focus on analyzing the final iterate of SGD with standard step size in general dimension d ≤ T without smoothness assumptions in this paper. This result is generalized to Lipschitz convex functions with 1/ √ t decreasing step size schedule with the same sub-optimality, and an Ω(log d/T ) lower bound to strongly convex functions with 1/t step size schedule is also constructed with the similar technique. Our lower bound results imply that the last iterate with fixed step size has sub-optimal convergence rate for SGD in general theoretically, which, unfortunately, is used a lot in practice. As for the upper bound, we present a new method based on martingale and Freedman's inequality to analyze the one-dimensional case. Though seemingly straightforward, the convergence rate of fixed-step-size SGD for one-dimensional linear functions is still open and non-trivial. Koren & Segal (2020) considered minimizing a linear function with a restricted SGD oracle which only outputs ±1, reducing this problem to a one-sided random walk. We relax the restriction on the SGD oracle and prove an O(1/ √ T ) optimal rate for a class of convex functions which we call nearly linear convex functions, with the help of martingale theory. The class of nearly linear functions captures many common functions, such as linear functions, |x|, e x , x 2 + x, -sin(x) on [0, 1]. Our contributions are summarized as follows: • We prove an Ω(log d/ T ) lower bound for the sub-optimality of the final iterate of SGD minimizing non-smooth Lipschitz convex functions with η t = 1/ √ T step size schedule. We also generalize this bound to the η t = 1/ √ t decreasing step size schedule, and also prove an Ω(log d/T ) lower bound for non-smooth strongly convex functions with η t = 1/t. To the best of our knowledge, our results are the first that characterize the general dimension dependence in analyzing the final iterate convergence of SGD with standard step sizes. • We prove an optimal O(1/ √ T ) upper bound for the sub-optimality of the final iterate of SGD minimizing nearly linear Lipschitz convex functions with fixed Θ(1/ √ T ) step sizes in one dimension, which captures a broad class of convex functions including linear functions.

2. PRELIMINARIES

Given a bounded convex set K ⊂ R d , and a convex function f : K → R defined on K, our goal is to solve min x∈K f (x). In the black-box optimization, there is no explicit representation of f . Instead, we can use a stochastic oracle to query the sub-gradients of f at x ∈ K. The set K is given in the form of a projection oracle, which outputs the closest point in K to a given point x in the Euclidean norm. We introduce several standard definitions. Definition 1 (Sub-gradient). A sub-gradient g ∈ R d of a convex function f : K → R at point x, is a vector satisfying that for any x ∈ K, f (x ) -f (x) ≥ g (x -x). (2) We use ∂f (x) to denote the set of all sub-gradients of f at x. Definition 2 (Strong Convexity). A function f : K → R is said to be α-strongly convex, if for any x, y ∈ K and g ∈ ∂f (x), the following holds: f (y) -f (x) ≥ g (y -x) + α 2 y -x 2 2 (3) Definition 3 (Lipschitz Function). A function f : K → R is called G-Lipschitz (with respect to 2 norm) , if for any x, y ∈ K, we have that: |f (x) -f (y)| ≤ G x -y 2 (4) Further, if we assume f is convex, the above definition is equal to g 2 ≤ G for any sub-gradient g. Let Π K denote the projection operator on K, the (projected) stochastic gradient descent (SGD) is described in Algorithm 1. We make the following standard assumption on the convex objective f and the SGD algorithm we consider throughout this paper: Assumption 1 (Standard Assumption). We make the following assumptions for the objective f and running SGD: • The domain K ⊂ R is convex and bounded with diameter D. Algorithm 1 Stochastic gradient descent with the final iterate output 1: Given K ⊂ R d , initial point x 1 ∈ K, step size schedule η t : 2: for j = 1, ..., T : do 3: Query stochastic gradient oracle at x t for ĝt such that E[ĝ t |ĝ 1 , ..., ĝt-1 ] ∈ ∂f (x t ) 4: y t+1 = x t -η t ĝt 5: x t+1 = Π K (y t+1 ) 6: end for 7: return x T +1 • The objective f : K → R is convex and G-Lipschitz, and not necessarily differentiable. • The output stochastic gradients are bounded: ĝt 2 ≤ G, and we have E[ĝ t | ĝ1 , • • • , ĝt-1 ] ∈ ∂f (x t ). The first two items hold for both our lower bound and upper bound. Our results are in the strong versions regarding the third item. In particular, our lower bound even holds for Gradient Descent (GD), i.e., even if the gradient oracle always outputs ĝt ∈ ∂f (x t ) rather than in expectation, one still has the lower bound Ω(log d/ √ T ). Our upper bound works for the SGD, where the oracle's outputs can be stochastic and one only assumes their expectations are sub-gradients.

3. LOWER BOUNDS

In this section we prove our main result, that is the final iterate of SGD for (non-smooth) Lipschitz convex functions with fixed step sizes η t = 1/

√

T has sub-optimality Ω(log d/ √ T ), even with deterministic oracle. We build upon the construction in Harvey et al. (2019a) , which is a variant of classical lower bound constructions Nesterov (2003) and proves an Ω(log T / √ T ) lower bound for the high-dimensional case d ≥ T . In a nutshell, we consider the setting d ≤ T and construct a function f along with a special subgradient oracle such that the initial point stays still for the first T -d steps, and then start moving in Algorithm 1, in which the final iterate satisfies f (x T +1 ) = Ω(log d/ T ). Then we extend the analysis to decreasing step sizes and strongly convex functions. Let [j] be the set of positive integers no larger than j. For simplicity, we consider convex functions over the d-dimensional Euclidean unit ball. Let 0 be the d-dimensional all-zero vector. We present our proof for general convex functions with fixed step sizes first. For decreasing step sizes and strongly convex functions, it is straightforward to scale our construction and get corresponding lower bounds, and we leave the proofs in the Appendix. Theorem 4. For any positive integer T > 0 and 1 ≤ d ≤ T , there exists a 1-Lipschitz convex function f : K → R where K ⊂ R d is the Euclidean unit ball, and a non-stochastic sub-gradient oracle satisfying Assumption 1, such that when executing Algorithm 1 on f with initial point 0 and step size schedule η t = 1/ √ T , the last iterate satisfies: f (x T +1 ) -min x∈K f (x) ≥ log d 32 √ T Proof. Let B d be the Euclidean unit ball and define f : B d → R for i ∈ [d + 1] ∪ {0} to be: f (x) = max 0≤i≤d+1 H i (x) where H i (x) = h i x, and we define for i ≥ 1 h i,j = a j ( if 1 ≤ j < i) -b i ( if i = j ≤ d) 0 ( if i < j ≤ d) and a j = 1 8(d + 1 -j) , b j = 1 2 ( for j ∈ [d]) in which h i,j is the j-th coordinate of h i . Additionally, let h 0 = 0 and H 0 (x) = 0. It's straightforward to check that f is 1-Lipschitz on K, with a minimum value of 0. Furthermore, ∂f (x) is the convex hull of {h i | i ∈ I(x)} where I(x) = {i ≥ 0 | H i (x) = f (x)} , which is a standard fact in convex analysis Hiriart-Urruty & Lemaréchal (2013). Setting x 1 = 0, we observe that f (x 1 ) = 0 which attains the global minimum, and by the characterization of ∂f (x) from above, we know that h 0 = 0 is a sub-gradient at x 1 . This observation allows our non-stochastic sub-gradient oracle to output 0 for the first T -d steps and outputs h i where i = min I(x) \ {0} for the last d steps. Define z 1 = • • • = z T -d+1 = 0, let T * =: T -d and we further define z t,j = bj √ T -a j t-j-T * -1 √ T ( if 1 ≤ j < t -T * ) 0 ( if t -T * ≤ j ≤ d) ( for t > T * + 1). We show inductively that these are precisely the first T iterates produced by algorithm 1 when using the sub-gradient oracle defined above. The following claim is easy to verify from the definition. Claim 5. We have the following claims: • z t is non-negative. In particular, z t,j ≥ 1 4 √ T for j < t -T * and z t,j = 0 for j ≥ t -T * . • z t,j ≤ 1 2 √ T for all j. In particular, z t ∈ K. Proof. It is evident that z t,j = 0 for j ≥ t -T * from the definition. As bj √ T = 1 2 √ T , it suffices to prove that 0 ≤ a j t-j-T * -1 √ T ≤ 1 4 √ T , which is direct as 0 ≤ t -j -T * -1 ≤ d + 1 -j. We can now determine the value and sub-gradient at z t . The case for the first T * steps is trivial as the sub-gradient oracle always outputs 0 and x 1 never moves a bit. For the last d steps we have that z t is supported on its first t -T * coordinates, and h t-T * z t = h i-T * z t for all i > t > T * . For the other case T * + 1 ≤ i < t, one has that z t (h t-T * -h i-T * ) = t-T * j=i-T * z t,j (h t-T * ,j -h i-T * ,j ) = t-T * -1 j=i-T * z t,j (h t-T * ,j -h i-T * ,j ) = t-T * -1 j=i-T * +1 z t,j (h t-T * ,j -h i-T * ,j ) + z t,i-T * (h t-T * ,i-T * -h i-T * ,i-T * ) = t-T * -1 j=i-T * +1 z t,j a j + z t,i-T * (a i-T * + 1/2) > 0 which means z t h t-T * > z t h i-T * for all T * + 1 ≤ i < t. The two results together guarantee that H t-T * (z t ) ≥ H i-T * (z t ) for all T * + 1 ≤ i and further f (z t ) = H t-T * (z t ). Combining with the fact I(z t ) = {t -T * , ..., d + 1}, we conclude that the sub-gradient oracle outputs h t-T * at time t. Lemma 6. For the function constructed in this section, the solution of t-th step in algorithm 1 equals to z t for every T * < t ≤ T + 1. Proof. We prove this lemma by induction. For base case t = T * + 1, we know that z t = 0 = x t holds. Next, when z t = x t holds for some t: y t+1,j =z t,j - 1 √ T h t-T * ,j = bj √ T -a j t-j-T * -1 √ T ( for 1 ≤ j < t -T * ) 0 ( for j ≥ t -T * ) - 1 √ T a j ( if 1 ≤ j < t -T * ) -b i ( if t -T * = j ≤ d) 0 ( if t -T * < j ≤ d) =      bj √ T -a j t-j-T * √ T ( for j < t -T * ) bt √ T = 1 2 √ T ( for j = t -T * ) 0 ( for j > t -T * )      . So y t+1 = z t+1 . Since z t+1 ∈ K, we have that x t+1 = z t+1 . From the above equivalence, we have that the vector x t in algorithm 1 is equal to z t for t ∈ [T + 1], which allows the determination of the value of the final iterate: f (x T +1 ) = f (z T +1 ) = H d+1 (z T +1 ) ≥ d j=1 h d+1,j z T +1,j ≥ d j=1 1 8(d + 1 -j) 1 4 √ T > log d 32 √ T . Remark 7. For the case d = 1 we still have the Ω(1/ √ T ) lower bound, by not using d i=1 1 i > log d in the last step. Remark 8. Notably, our lower bound is valid even for GD, where one can access a noiseless sub-gradient oracle. Theorem 4 improves the previously known lower bound by a factor of log d, implying an inevitable dependence on the dimension of the convergence of SGD's final iterate. Though our proof is built upon Harvey et al. (2019a) , their construction doesn't apply directly. Other natural ways of adaption, for example, cyclic (gradient oracle repeatedly goes over each coordinate), repeated (gradient oracle stays at one coordinate for T /d steps then go to the next), do not work here. Next, we extend this result to Lipschitz convex functions with step sizes η t = 1 √ t and strongly convex functions with step sizes η t = 1 t , both known to be the optimal choice of learning rate schedule. The proofs are mostly similar to that of Theorem 4, and we defer them to the Appendix. Corollary 9. For any T and 1 ≤ d ≤ T , there exist a 1-Lipschitz convex function f : K → R where K ⊂ R d is the Euclidean unit ball, and a non-stochastic sub-gradient oracle satisfying Assumption 1, such that when executing algorithm 1 on f with initial point 0 and step size schedule η t = 1/ √ t, the last iterate satisfies: f (x T +1 ) -min x∈K f (x) ≥ log d 32 √ T Corollary 10. For any T and 1 ≤ d ≤ T , there exist a 3-Lipschitz and 1-strongly convex function f : K → R where K ⊂ R d is the Euclidean unit ball, and a non-stochastic sub-gradient oracle satisfying Assumption 1, such that when executing Algorithm 1 on f with initial point 0 (the global minimum) and step size schedule η t = 1/t, the final iterate satisfies: f (x T +1 ) -min x∈K f (x) ≥ log d 5T

4. UPPER BOUND IN ONE DIMENSION

With our lower bound, it is natural to conjecture that the optimal rate should be Θ(log d/ √ T ) when d ≤ T . In particular, it's believed that in the one-dimensional case, the optimal rate is Θ(1/ √ T ). As mentioned in the introduction, Koren & Segal (2020) considered a random walk induced by a linear function as evidence for this conjecture in one dimension, which is somewhat restricted. In this section, we relax their assumptions by considering a function class that we call nearly linear functions, which capture a broad class of functions, including linear functions, and prove an optimal rate O(1/ √ T ). For the general Lipschitz convex function class, our analysis also recovers the previously known best bound O(log T / √ T ).

4.1. NEARLY LINEAR FUNCTIONS

Let f * = min x∈K f (x). We need the following definition before defining nearly linear functions. Definition 11. We say a point x is good if f (x) -f * ≤ 4GD √ T , and define a set of good points by S: S = {x ∈ K : f (x) -f * ≤ 4GD √ T }. Now we can define the convex function family. In a nutshell, the class of nearly linear functions we consider is the function such that at any not-good point, the absolute value of its sub-gradient is not too small. Put it formally: Definition 12 (Nearly Linear Function). We call a convex function f : K → R nearly linear if there exist a constant 0 < c ≤ 1, such that for any x t / ∈ S which does not belong to the set of good points, we have E[ĝ t | ĝ1 , • • • , ĝt-1 ] ∈ [cG, G]. We note that any general Lipschitz convex function is nearly linear with c = 1/ √ T , and our later analysis recovers the previously known best bound O(log T / √ T ) under this interpretation. Therefore our method is a strict improvement over previous results. The family of nearly linear functions captures those functions whose sub-gradients do not change drastically outside the set of good points, for example, |x|, e x , x 2 + x, -sin(x). The linear functions considered in Koren & Segal (2020) 

4.2. ANALYSIS

We show how to improve the convergence of the last iterate of SGD with a fixed step size η = 4D G √ T in one dimension for nearly linear functions. The proof mainly consists of two parts. In the first part, we prove that for running SGD with fixed step sizes for any convex function satisfying Assumption 1, with very high probability, the solution goes into the set of good points at least once. In some sense, this is consistent with the known result that averaging scheme can achieve the optimal rate. It is straightforward to get the following lemma by convexity. Lemma 14. For any x ∈ K \ S, ∀∇f (x) ∈ ∂f (x), one has |∇f (x)| > G √ T . Suppose we start from an arbitrary point x 1 ∈ K and the (random) sequence of the SGD algorithm with the fixed step size η is denoted by x 1 , x 2 , • • • , x T +1 , i.e. x t+1 = Π K (x t -ηĝ t ). The following lemma says that with a very high probability, the solution enters S at least once. Lemma 15. Given any x 1 ∈ K, and let η = 4D G √ T . For any nearly linear function f under the Assumption 1, define τ t := ∞ if SGD never goes back to S in the first t steps, and τ t := min i {1 ≤ i ≤ t | x i ∈ S} otherwise. If t ≥ T + 1 and k ≥ 10, we have that Pr[τ t = ∞ | x 1 ] ≤ 2 exp(-Ω(T )). Lemma 15 shows that the probability that x t never entered S in the first T steps is negligible, whose proof can be found in the Appendix. In the second part, we bound the tail probability of the sub-optimality of the last iterate for nearly linear functions, from which we can bound the expectation of the sub-optimality. Roughly speaking, we consider the events that f (x T +1 ) -f * ≥ GDk √ T and the last T + 1 -i steps all lie out the set of good points, and bound its probability by exp -Ω(k + (T + 1 -i)) . And by Union Bound we know that the tail probability Pr [f (x T +1 ) -f * ≥ GDk √ T ] ≤ exp(-Ω(k)) , which is enough to get the optimal bound O( GD √ T ). Theorem 16. Given positive integer T > 0 which is large enough, running SGD with a fixed step size η = 4D G √ T on any nearly linear function f under Assumption 1 for T steps, one has E[f (x T +1 ) -f * ] = O( GD √ T ), where f * = min x∈K f (x). Proof. We try to bound the tail probability, that is Pr[f (x T +1 ) -f * ≥ GDk √ T ] for any k ≥ 10. We define t := ∞ if SGD never goes in the set S and let t : = max i {1 ≤ i ≤ T + 1 | x i ∈ S} otherwise. One has Pr[f (x T +1 ) -f * ≥ GDk √ T ] = T +1 i=1 Pr[f (x T +1 ) -f * ≥ GDk √ T ∧ t = i] + Pr[f (x T +1 ) -f * ≥ GDk √ T ∧ t = ∞] = T i=1 Pr[f (x T +1 ) -f * ≥ GDk √ T ∧ t = i] + Pr[f (x T +1 ) -f * ≥ GDk √ T ∧ t = ∞], where the second equality follows from the fact that Pr[f (x T +1 ) -f * ≥ GDk √ T ∧ t = T + 1] = 0 by the definition of S and k ≥ 10. By Lemma 15, we have Pr[f (x T +1 ) -f * ≥ GDk √ T ∧ t = ∞] ≤ Pr[t = ∞] ≤ 2 exp(-Ω(T )), which is negligible when T is large enough.

Now we begin to bound

Pr[f (x T +1 ) -f * ≥ GDk √ T ∧ t = i]. We use y i = x i -x i-1 to capture the movement of the solution. Let n L = inf x∈S x and n R = sup x∈S x, which exist because the domain is bounded and the function is continuous. By Definition 11, there exists x * ∈ arg min x∈K f (x) such that either |n R -x * | ≥ 4D/ √ T or |n L -x * | ≥ 4D/ √ T . By our setting of step size η, if x j > n R for some j, it is impossible that x j+1 < n L , and vice versa. Consider the event t = i and assume x i+1 > n R first. Hence x j > n R for all i < j ≤ T + 1. By the Assumption that f is nearly linear, we have E[y i ] ∈ [-ηG, -cηG] for some constant c ∈ (0, 1] (See Definition 11). Let F i-1 be the filtration and ỹi = y i -E[y i | F i-1 ]. Obviously, we know E[ỹ i | F i-1 ] = 0, |ỹ i | ≤ 2ηG and {ỹ i } is a martingale difference sequence. We know that W (i,T +1] := T +1 j=i+1 E[ ỹi 2 | F i-1 ] ≤ T +1 j=i+1 E[y 2 i | F i-1 ] ≤ η 2 G 2 (T + 1 -i) as |y i | ≤ ηG. Let = T +1 j=i+1 E[y j | F j-1 ]. It is evident that ≤ -cηG(T + 1 -i) by the assumption of being nearly linear and t j > n R . Condition on f (x T +1 ) -f * ≥ GDk √ T ∧ t = i. It follows that T +1 j=i+1 y i ≥ D(k-4) √ T . More specifically, as x i ∈ S and thus f (x i ) -f * ≤ 4GD √ T , we have that f (x T +1 ) -f (x i ) ≥ GD(k-4) √ T and further x T +1 -x i = T +1 j=i+1 y j ≥ D(k-4) √ T . Moreover, we know W (i,T +1] ≤ η 2 G 2 (T + 1 -i) and ≤ -cηG(T + 1 -i), which means T +1 j=i+1 ỹj = T +1 j=i+1 y j - T +1 j=i+1 E[y j | F j-1 ] ≥ D(k-4) √ T - = D(k-4) √ T + | |. As for the case when x i+1 < n L , conditioning on f (x T +1 ) -f * ≥ GDk √ T ∧ t = i ∧ x i+1 < n L , it is similar to get T +1 j=i+1 ỹj ≤ -D(k-4) √ T -| | ∧ W (i:T +1] ≤ η 2 G 2 (T + 1 -i) ∧ ≥ cηG(T + 1 -i) as well. Hence we have Pr[f (x T +1 ) -f * ≥ GDk √ T ∧ t = i] ≤ Pr[| T +1 j=i+1 y j | ≥ D(k -4) √ T ∧ t = i] ≤ Pr[| T +1 j=i+1 ỹj | ≥ D(k -4) √ T + | | ∧ W (i:T +1] ≤ η 2 G 2 (T + 1 -i) ∧ | | ≥ cηG(T + 1 -i)], where the second inequality follows from the analysis above. Applying Freedman's Inequality (Theorem 13) over Equation ( 8), one has Pr[f (x T +1 ) -f * ≥ GDk √ T ∧ t = i] ≤ max | |≥cηG(T +1-i) 2 exp   - ( D(k-4) √ T + | |) 2 /2 η 2 G 2 (T + 1 -i) + 2ηG( D(k-4) √ T + | |)/3   ≤ max | |≥cηG(T +1-i) 2 exp   - ( D(k-4) √ T + | |)/2 ηG c + 2ηG/3   ≤2 exp - 3c 10ηG ( D(k -4) √ T + cηG(T + 1 -i)) =2 exp(- 3c(k -4) 40 - 3 10 c 2 (T + 1 -i)). Further, for k ≥ 10, we have Pr[f (x T +1 ) -f * ≥ GDk √ T ] = T i=1 Pr[f (x T +1 ) -f * ≥ GDk √ T ∧ t = i] + Pr[f (x T +1 ) -f * ≥ GDk √ T ∧ t = ∞] ≤ T i=1 2 exp(- 3c(k -4) 40 - 3 10 c 2 (T + 1 -i)) + 2 exp(-Ω( √ T )) ≤ 20 3c 2 exp(- 3c(k -4) 40 ) + 2 exp(-Ω( √ T )), where the last step follows from the fact that for any constant C > 0 one has T i=1 exp(-Ci) ≤ T -1 i=0 exp(-Ci)di ≤ 1/C. As a result, for h ≥ 10GD/ √ T , we have that Pr[f (x T +1 ) -f * ≥ h] = O(exp(-hλ)), where λ = Θ( √ T GD ). Our conclusion follows from E[f (x T +1 ) -f * ] = GD 0 Pr[f (x T +1 ) -f * ≥ h]dh = O(1/λ) = O( GD √ T ).

5. CONCLUSION

In this paper, we analyze the final iterate convergence rate of SGD with standard step size schedules, proving Ω(log d/ √ T ) and Ω(log d/T ) lower bounds for the sub-optimality of SGD minimizing nonsmooth general convex and strongly convex functions respectively. We also prove a tight O(1/ √ T ) upper bound for one-dimensional nearly linear functions, a more general setting than Koren & Segal (2020) . This work is the first, to the best of our knowledge, that characterizes the dependence on dimension in the general d ≤ T setting, and we hope it can advance our theoretical understanding of the final iterate convergence of SGD with standard step sizes, and guide the implementations in practice.

A MORE PRELIMINARIES A.1 PRELIMINARIES ON MARTINGALE

We demonstrate some basic definitions with a relationship to Martingale, which is used in the proof. Definition 17 (Martingale). A sequence Y 1 , Y 2 , • • • is said to be a martingale with respect to another sequence X 1 , X 2 , • • • if for all n: • E (|Y n |) < ∞ • E (Y n+1 | X 1 , . . . , X n ) = Y n . Definition 18 (Martingale Difference). Consider an adapted sequence {X t , F t } ∞ -∞ on a probability space. X t is a martingale difference sequence (MDS) if it satisfies the following two conditions for all t: • E|X t | < ∞ • E[X t | F t-1 ] = 0, a.s. Definition 19 (Stopping Time). A stopping time with respect to a sequence of random variables X 1 , X 2 , X 3 , • • • is a random variable τ with the property that for each t, the occurrence or nonoccurrence of the event τ = t depends only on the values of X 1 , X 2 , X 3 , • • • , X t . B OMITTED PROOFS FOR SECTION 3 B.1 PROOF OF COROLLARY 9 Proof. Define f : K = B d → R and h i ∈ R d for i ∈ [d + 1] ∪ {0} by f (x) = max 0≤i≤d+1 H i (x) where H i (x) = h i x. For i ≥ 1 we define h i,j = a j ( if 1 ≤ j < i) -b i ( if i = j ≤ d) 0 ( if i < j ≤ d) and a j = 1 8(d + 1 -j) , b j = √ j + T -d 2 √ T ( for j ∈ [d]) Additionally, let h 0 = 0 and H 0 (x) = 0. It's easy to check that f is 1-Lipschitz, with minimal value 0. We have that ∂f (x) is the convex hull of {h i | i ∈ I(x)} where I(x) = {i ≥ 0 | H i (x) = f (x)}. Our non-stochastic sub-gradient oracle outputs 0 for the first T -d steps and outputs h i where i = min I(x) \ {0} for the last d steps. Define z 1 = • • • = z T -d+1 = 0, let T * =: T -d. z t,j = bj √ j+T * -a j t-1 k=j+T * +1 1 √ k ( if 1 ≤ j < t -T * ) 0 ( if t -T * ≤ j ≤ d) ( for t > T * + 1). We will show inductively that these are precisely the first T iterates produced by algorithm 1 when using the sub-gradient oracle defined above. The following claim follows from definition. Claim 20. We have the following claims: • z t is non-negative. In particular, z t,j ≥ 1 4 √ T for j < t -T * and z t,j = 0 for j ≥ t -T * . • z t,j ≤ 1 2 √ T ofr all j. In particular, z t ∈ K. Proof. It is obvious that z t,j = 0 for j ≥ t -T * from the definition. As bj √ j+T * = 1 2 √ T , it suffices to prove that 0 ≤ a j t-1 k=j+T * 1 √ k ≤ 1 4 √ T . We have that 0 ≤ t-1 k=j+T * 1 √ k ≤ t-1 j+T * -1 1 √ x dx = 2(t -j -T * ) √ t -1 + √ j + T * -1 ≤ 2(t -j -T * ) √ t -1 Under review as a conference paper at ICLR 2023 and further t-j-T * √ t-1 ≤ T +1-j-T * √ T = d+1-j √ T by monotony. Thus 0 ≤ a j t-1 k=j+T * 1 √ k ≤ 1 4 √ T follows from the definition of a j . We can now determine the value and sub-differential at z t . The case for the first T * steps is trivial as the sub-gradient oracle always outputs 0 and x 1 never moves a bit. For the last d steps we have that z t is supported on its first t -T * coordinates and h t-T * z t = h i-T * z t for all i > t > T * . For T * + 1 ≤ i < t, one has z t (h t-T * -h i-T * ) = t-T * j=i-T * z t,j (h t-T * ,j -h i-T * ,j ) = z t,i (a i + 1) + t-1 j=i+1 z t,j a j > 0. which means z t h t-T * > z t h i-T * for all T * + 1 ≤ i < t. The two results together guarantee that H t-T * (z t ) ≥ H i-T * (z t ) for all T * + 1 ≤ i and further f (z t ) = H t-T * (z t ). Combining with the fact I(z t ) = {t -T * , ..., d + 1}, we conclude that the sub-gradient oracle outputs h t-T * . Lemma 21. For the function constructed in this section, the solution of t-th step in algorithm 1 equals to z t for every T * < t ≤ T + 1. Proof. We prove this lemma by induction. For base case t = T * + 1, we know that z t = 0 = x t holds. Next, when z t = x t holds for some t: y t+1,j =z t,j - 1 √ t h t-T * ,j = bj √ j+T * -a j t-1 k=j+T * 1 √ k ( for 1 ≤ j < t -T * ) 0 ( for j ≥ t -T * ) - 1 √ t a j ( if 1 ≤ j < t -T * ) -b i ( if t -T * = j ≤ d) 0 ( if t -T * < j ≤ d) =      bj √ j+T * -a j t k=j+T * 1 √ k ( for j < t -T * ) bt √ t = bt √ j+T * ( for j = t -T * ) 0 ( fro j > t -T * )      . So y t+1 = z t+1 . Since z t+1 ∈ K, we have that x t+1 = z t+1 . From the above claim we have that the vector x t in algorithm 1 is equal to z t for t ∈ [T + 1], which allows determination of the value of the final iterate: f (x T +1 ) = f (z T +1 ) = H d+1 (z T +1 ) ≥ d j=1 h d+1,j z T +1,j ≥ d j=1 1 8(d + 1 -j) 1 4 √ T > log d 32 √ T . B.2 PROOF OF COROLLARY 10 Proof. Define f : K = B d → R by H i ∈ R d for i ∈ [d + 1] ∪ {0} to be: f (x) = max 0≤i≤d+1 H i (x) where H i (x) = h i x + 1 2 x 2 , and we define for i ≥ 1 h i,j = a j ( if 1 ≤ j < i) -1 ( if i = j ≤ d) 0 ( if i < j ≤ d) and a j = 1 2(d + 1 -j) ( for j ∈ [d]) in which h i,j is the j-th coordinate of h i . Additionally, let h 0 = 0 and H 0 (x) = 1 2 x 2 . It's straightforward to check that f is 3-lipschitz and 1-strongly convex on K, with minimal value 0. Furthermore, ∂f (x) is the convex hull of {h i +x | i ∈ I(x)} where I(x) = {i ≥ 0 | H i (x) = f (x)}, a standard fact in convex analysis Hiriart-Urruty & Lemaréchal (2013). Setting x 1 = 0, we observation that f (x 1 ) = 0 which attains the global minimum, and by the characterization of ∂f (x) from above, we know that h 0 + x 1 = 0 is a sub-gradient at x 1 . This observation allows our non-stochastic sub-gradient oracle to output 0 for the first T -d steps and outputs h i + x where i = min I(x) \ {0} for the last d steps, since outputting 0 in the first T -d steps makes x 1 = ... = x T -d+1 = 0 by the update rule of SGD. Define z 1 = • • • = z T -d+1 = 0, let T * := T -d and z t,j = 1-(t-T * -j-1)aj t-1 ( if 1 ≤ j < t -T * ) 0 ( if t -T * ≤ j ≤ T ) ( for t > T * + 1). We will show inductively that these are precisely the first T iterates produced by algorithm 1 when using the sub-gradient oracle defined above. The following claim is easy to verify from definition. Claim 22. We have the following claims: • z t is non-negative. In particular, z t,j ≥ 1 2(t-1) for j < t -T * and z t,j = 0 for j ≥ t -T * . • z t = 0 for t ∈ [T * + 1] and z t 2 ≤ 1 t-1 for t > T * + 1. Thus z t ∈ K for all t. Proof. The first claim simply follows from the fact that t-T * -j-1 d-j+1 ≤ 1. The second claim follows from that (t -T * -1) 1 (t-1) 2 ≤ 1 t-1 . We can now determine the value and sub-differential at z t . The case for the first T * steps is trivial as the sub-gradient oracle always outputs 0 and x 1 never moves. For the value of last d steps we observe that z t is supported on its first t -T * coordinates by definition, and as a result h t-T * z t = h i-T * z t for all i > t > T * . For the other case T * + 1 ≤ i < t, one have that z t (h t-T * -h i-T * ) = t-1 j=i z t,j (h t-T * ,j -h i-T * ,j ) = z t,i (a i + 1) + t-1 j=i+1 z t,j a j > 0. which means z t h t-T * > z t h i-T * for all T * + 1 ≤ i < t. The two results together guarantee that H t-T * (z t ) ≥ H i-T * (z t ) for all T * + 1 ≤ i and thus f (z t ) = H t-T * (z t ). Combining with the fact I(z t ) = {t -T * , ..., d + 1}, we conclude that the sub-gradient oracle outputs h t-T * + z t . Lemma 23. For the function f and its gradient oracle constructed in the proof, the output x t of t-th step in Algorithm 1 equals to z t for every T * < t ≤ T + 1. Proof. We prove this lemma by induction. For base case t = T * + 1, we know by the definition of z t that z t = 0 = x t holds. Next, when z t = x t for some t holds, we have that y t+1,j =z t,j -1 t (h t-T * ,j + z t,j ) = t -1 t 1-(t-T * -j-1)aj t-1 ( for 1 ≤ j < t -T * ) 0 ( for j ≥ t -T * ) - 1 t a j ( if 1 ≤ j < t -T * ) -1 ( if t -T * = j ≤ d) 0 ( if t -T * < j ≤ d) = 1 t 1 -(t -T * -j -1)a j ( for 1 ≤ j < t -T * ) 0 ( for j ≥ t -T * ) - 1 t a j ( if 1 ≤ j < t -T * ) -1 ( if t -T * = j ≤ d) 0 ( if t -T * < j ≤ d) =    1-(t-T * -j)aj t ( for j < t -T * ) 1 t ( for j = t -T * ) 0 ( fro j > t -T * )    . So y t+1 = z t+1 . Since z t+1 ∈ K, we have that x t+1 = z t+1 . From the above equivalence we have that the vector x t in algorithm 1 is equal to z t for t ∈ [T + 1], which allows the determination of the value of the final iterate: f (x T +1 ) = f (z T +1 ) = H d+1 (z T +1 ) ≥ d j=1 h d+1,j z T +1,j ≥ d j=1 1 2(d + 1 -j) 1 2T > log d 5T . C OMITTED PROOFS OF SECTION 4 C.1 PROOF OF LEMMA 14 Proof. We prove this statement by contradiction. Suppose there exists x ∈ K \S such that |∇f (x)| ≤ G √ T . By the convexity of f and the definition of sub-gradient and let x * ∈ K be a minimizer (arbitrarily if the minimizers are not unique), one has f (x * ) ≥ f (x) + ∇f (x)(x -x * ), which implies that f (x) -f (x * ) ≤∇f (x)(x * -x) ≤ GD √ T . This means x ∈ S and thus is a contradiction.

C.2 PROOF OF LEMMA 15

Proof. Let n L = inf x∈S x and n R = sup x∈S x, which exist because the domain is bounded. By our setting of parameters and definition, we know if x j > n R , then it is impossible that x j+1 < n L , and vice versa. As we are considering τ t = ∞, either x i > n R for all 1 ≤ i ≤ t, or x i < n L for all 1 ≤ i ≤ t. Without loss of generality, we consider first the case where x i > n R for all 1 ≤ i ≤ t. We define a random variable y i = x i -x i-1 to capture the movement of the solution for 1 ≤ i ≤ t. Conditioning on τ = ∞, i.e. x i > n R for all 1 ≤ i ≤ t, we have that E[y i ] ≤ -cηG = -4D/T for i ≥ 2 by Lemma 14 (the projection only makes the expectation smaller). By standard arguments, let F i be the filtration and ỹi = y i -E[y i | F i-1 ]. It is easy to verify that {ỹ i } is a martingale difference sequence:  E[ỹ i | F i ] = E[y i | F i ] -E[y i | F i ] = 0. (12) E[|ỹ i |] ≤ Gη < ∞. i | F i-1 ] = E[y 2 i | F i-1 ] -(E[y i | F i-1 ]) 2 ≤ E[y 2 i | F i-1 ] ≤ η 2 G 2 . Hence, we get the estimation W t = t i=2 E[ ỹi 2 | F i-1 ] ≤ (t -1)η 2 G 2 . Let := t i=2 E[y i | F i-1 ]. We know that = t i=1 E[y i | F i-1 ] ≤ -tηcG. So far, we have shown that if x i > n R for all 1 ≤ i ≤ t, then we must have ≤ -tηcG, and at the same time D ≥ t i=2 y i ≥ -D which must happen. Similarly, if x i < n L for all 1 ≤ i ≤ t, then we have ≥ tηcG and -D ≤ 



OUR CONTRIBUTIONS Our first main result is an Ω(log d/ √ T ) lower bound for SGD minimizing Lipschitz convex functions with a fixed step size η t = 1/ √ T when dimension d ≤ T , generalizing the result in Harvey et al. (2019a). Our main observation is that we can let the initial point x 1 stay still for any number of steps as long as 0 is one of the sub-gradient of f at x 1 . By modifying the original construction of Harvey et al. (2019a), we can keep x 1 at 0 for T -d steps and then 'kick' it to start taking a similar route as in Harvey et al. (2019a) in a d-dimensional space, which incurs an Ω(log d/ √ T ) sub-optimality.

) Obviously, one has |ỹ i | ≤ Gη = 4D √ T by the third line of Assumptions 1. As a result, E[ỹ 2

t i=2 y i ≤ D. If we can show the probability that | | ≥ tηcG and | t i=2 y i | ≤ D happen simultaneously is small, we are done.

lie in this family. The nice property of nearly linear functions allows a martingale-based analysis which gives an improved O(1/

annex

By the Freedman's Inequality, if t = T + 1, one has:We complete the proof.

