ADAPTIVE GRADIENT METHODS CONVERGE FASTER WITH OVER-PARAMETERIZATION (AND YOU CAN DO A LINE-SEARCH)

Abstract

Adaptive gradient methods are typically used for training over-parameterized models capable of exactly fitting the data; we thus study their convergence in this interpolation setting. Under an interpolation assumption, we prove that AMSGrad with a constant step-size and momentum can converge to the minimizer at the faster O(1/T ) rate for smooth, convex functions. Furthermore, in this setting, we show that AdaGrad can achieve an O(1) regret in the online convex optimization framework. When interpolation is only approximately satisfied, we show that constant step-size AMSGrad converges to a neighbourhood of the solution. On the other hand, we prove that AdaGrad is robust to the violation of interpolation and converges to the minimizer at the optimal rate. However, we demonstrate that even for simple, convex problems satisfying interpolation, the empirical performance of these methods heavily depends on the step-size and requires tuning. We alleviate this problem by using stochastic line-search (SLS) and Polyak's step-sizes (SPS) to help these methods adapt to the function's local smoothness. By using these techniques, we prove that AdaGrad and AMSGrad do not require knowledge of problem-dependent constants and retain the convergence guarantees of their constant step-size counterparts. Experimentally, we show that these techniques help improve the convergence and generalization performance across tasks, from binary classification with kernel mappings to classification with deep neural networks. RMSProp and Adam maintain an exponential moving average of past stochastic gradients, but as Reddi et al. (2018) pointed out, unlike AdaGrad, the corresponding preconditioners do not guarantee that A k+1 A k and the resulting per-dimension step-sizes do not go to zero. This can lead to large fluctuations in the effective step-size and prevent these methods from converging. To mitigate this problem, they proposed AMSGrad, which ensures A k+1 A k and the convergence of iterates. Consequently, our theoretical results focus on AdaGrad, AMSGrad and other adaptive gradient methods that ensure this monotonicity. However, we also considered RMSProp and Adam in our experimental evaluation. Although our theory holds for both the full matrix and diagonal variants (where A k is a diagonal matrix) of these methods, we use only the latter in experiments for scalability. The diagonal variants

1. INTRODUCTION

Adaptive gradient methods such as AdaGrad (Duchi et al., 2011) , RMSProp (Tieleman & Hinton, 2012) , AdaDelta (Zeiler, 2012) , Adam (Kingma & Ba, 2015) , and AMSGrad (Reddi et al., 2018) are popular optimizers for training deep neural networks (Goodfellow et al., 2016) . These methods scale well and exhibit good performance across problems, making them the default choice for many machine learning applications. Theoretically, these methods are usually studied in the non-smooth, online convex optimization setting (Duchi et al., 2011; Reddi et al., 2018) with recent extensions to the strongly-convex (Mukkamala & Hein, 2017; Wang et al., 2020; Xie et al., 2020) and non-convex settings (Li & Orabona, 2019; Ward et al., 2019; Zhou et al., 2018; Chen et al., 2019; Wu et al., 2019; Défossez et al., 2020; Staib et al., 2019 ). An online-batch reduction gives guarantees similar to stochastic gradient descent (SGD) in the offline setting (Cesa-Bianchi et al., 2004; Hazan & Kale, 2014; Levy et al., 2018) . However, there are several discrepancies between the theory and application of these methods. Although the theory advocates for using decreasing step-sizes for Adam, AMSGrad and its variants (Kingma & Ba, 2015; Reddi et al., 2018) , a constant step-size is typically used in practice (Paszke et al., 2019) . Similarly, the standard analysis of these methods assumes a decreasing momentum parameter, however, the momentum is fixed in practice. On the other hand, AdaGrad (Duchi et al., 2011) has been shown to be "universal" as it attains the best known convergence rates in both the stochastic smooth and non-smooth settings (Levy et al., 2018) , but its empirical performance is rather disappointing when training deep models (Kingma & Ba, 2015) . Improving the empirical performance was indeed the main motivation behind Adam and other methods (Tieleman & Hinton, 2012; Zeiler, 2012) that followed AdaGrad. Although these methods have better empirical performance, they are not guaranteed to converge to the solution with a constant step-size and momentum parameter. Another inconsistency is that although the standard theoretical results are for non-smooth functions, these methods are also extensively used in the easier, smooth setting. More importantly, adaptive gradient methods are generally used to train highly expressive, large over-parameterized models (Zhang et al., 2017; Liang & Rakhlin, 2018) capable of interpolating the data. However, the standard theoretical analyses do not take advantage of these additional properties. On the other hand, a line of recent work (Schmidt & Le Roux, 2013; Jain et al., 2018; Ma et al., 2018; Liu & Belkin, 2020; Cevher & Vũ, 2019; Vaswani et al., 2019a; b; Wu et al., 2019; Loizou et al., 2020) focuses on the convergence of SGD in this interpolation setting. In the standard finite-sum case, interpolation implies that all the functions in the sum are minimized at the same solution. Under this additional assumption, these works show SGD with a constant step-size converges to the minimizer at a faster rate for both convex and non-convex smooth functions. In this work, we aim to resolve some of the discrepancies in the theory and practice of adaptive gradient methods. To theoretically analyze these methods, we consider a simplistic setting -smooth, convex functions under interpolation. Using the intuition gained from theory, we propose better techniques to adaptively set the step-size for these methods, dramatically improving their empirical performance when training over-parameterized models.

1.1. BACKGROUND AND CONTRIBUTIONS

Constant step-size. We focus on the theoretical convergence of two adaptive gradient methods: AdaGrad and AMSGrad. For smooth, convex functions, Levy et al. (2018) prove that AdaGrad with a constant step-size adapts to the smoothness and gradient noise, resulting in an O( 1 /T + ζ / √ T ) convergence rate, where T is the number of iterations and ζ 2 is a global bound on the variance in the stochastic gradients. This convergence rate matches that of SGD under the same setting (Moulines & Bach, 2011) . In Section 3, we show that constant step-size AdaGrad also adapts to interpolation and prove an O( 1 /T + σ / √ T ) rate, where σ is the extent to which interpolation is violated. In the over-parameterized setting, σ 2 can be much smaller than ζ 2 (Zhang & Zhou, 2019), implying a faster convergence. When interpolation is exactly satisfied, σ 2 = 0, we obtain an O( 1 /T ) rate, while ζ 2 can still be large. In the online convex optimization framework, for smooth functions, we show that the regret of AdaGrad improves from O( √ T ) to O(1) when interpolation is satisfied and retains its O( √ T )-regret guarantee in the general setting (Appendix C.2). Assuming its corresponding preconditioner remains bounded, we show that AMSGrad with a constant step-size and constant momentum parameter also converges at the rate O( 1 /T ) under interpolation (Section 4). However, unlike AdaGrad, it requires specific step-sizes that depend on the problem's smoothness. More generally, constant step-size AMSGrad converges to a neighbourhood of the solution, attaining an O( 1 /T + σ 2 ) rate, which matches the rate of constant step-size SGD in the same setting (Schmidt & Le Roux, 2013; Vaswani et al., 2019a) . When training over-parameterized models, this result provides some justification for the faster (O( 1 /T ) vs. O( 1 / √ T )) convergence of the AMSGrad variant typically used in practice. Adaptive step-size. Although AdaGrad converges at the same asymptotic rate for any step-size (up to constants), it is unclear how to choose this step-size without manually trying different values. Similarly, AMSGrad is sensitive to the step-size, converging only for a specific range in both theory and practice. In Section 5, we experimentally show that even for simple, convex problems, the step-size has a big impact on the empirical performance of AdaGrad and AMSGrad. To overcome this limitation, we use recent methods (Vaswani et al., 2019a; Loizou et al., 2020) that automatically set the step-size for SGD. These works use stochastic variants of the classical Armijo line-search (Armijo, 1966) or the Polyak step-size (Polyak, 1963) in the interpolation setting. We combine these techniques with adaptive gradient methods and show that a variant of stochastic line-search (SLS) enables AdaGrad to adapt to the smoothness of the underlying function, resulting in faster empirical convergence, while retaining its favourable convergence properties (Section 3). Similarly, AMSGrad with variants of SLS and SPS can match the convergence rate of its constant step-size counterpart, but without knowledge of the underlying smoothness properties (Section 4). Experimental results. Finally, in Section 5, we benchmark our results against SGD variants with SLS (Vaswani et al., 2019b) , SPS (Loizou et al., 2020) , tuned Adam and its recently proposed variants (Luo et al., 2019; Liu et al., 2020) . We demonstrate that the proposed techniques for setting the step-size improve the empirical performance of adaptive gradient methods. These improvements are consistent across tasks, ranging from binary classification with a kernel mapping to multi-class classification using standard deep neural network architectures.

2. PROBLEM SETUP

We consider the unconstrained minimization of an objective f : R d → R with a finite-sum structure, f (w) = 1 n n i=1 f i (w). In supervised learning, n represents the number of training examples, and f i is the loss function on training example i. Although we focus on the finite-sum setting, our results can be easily generalized to the online optimization setting. The objective of our analysis is to better understand the effect of the step-size and line-searches when interpolation is (almost) satisfied. This is complicated by the fact that adaptive methods are still poorly understood; state-of-the-art analyses do not show an improvement over gradient descent in the worst-case. To focus on the effect of step-sizes, we make the simplifying assumptions described in this section. We assume f and each f i are differentiable, convex, and lower-bounded by f * and f * i , respectively. Furthermore, we assume that each function f i in the finite-sum is L i -smooth, implying that f is L max -smooth, where L max = max i L i . We also make the standard assumption that the iterates remain bounded in a ball of radius D around a global minimizer, w k -w * ≤ D for all w k (Ahn et al., 2020) . We remark that the bounded iterates assumption simplifies the analysis but is not essential, and similar to Reddi et al. (2018) ; Duchi et al. (2011); Levy et al. (2018) , our theoretical results can be extended to include a projection step. We include the formal definitions of these properties (Nemirovski et al., 2009) in Appendix A. The interpolation assumption means that the gradient of each f i in the finite-sum converges to zero at an optimum. If the overall objective f is minimized at w * , ∇f (w * ) = 0, then for all f i we have ∇f i (w * ) = 0. The interpolation condition can be exactly satisfied for many over-parameterized machine learning models such as non-parametric kernel regression without regularization (Belkin et al., 2019; Liang & Rakhlin, 2018) and over-parameterized deep neural networks (Zhang et al., 2017) . We measure the extent to which interpolation is violated by the disagreement between the minimum overall function value f * and the minimum value of each individual functions f * i , et al., 2020) . The minimizer of f need not be unique for σ 2 to be uniquely defined, as it only depends on the minimum function values. Interpolation is said to be exactly satisfied if σ 2 = 0, and we also study the setting when σ 2 > 0. σ 2 := E i [f * -f * i ] ∈ [0, ∞) (Loizou For a preconditioner matrix A k and a constant momentum parameter β ∈ [0, 1), the update for a generic adaptive gradient method at iteration k can be expressed as: w k+1 = w k -η k A -1 k m k ; m k = βm k-1 + (1 -β)∇f i k (w k ) Here, ∇f i k (w k ) is the stochastic gradient of a randomly chosen function f i k , and η k is the step-size. Adaptive gradient methods typically differ in how their preconditioners are constructed and whether or not they include the momentum term βm k-1 (see Table 1 for a list of common methods). Both  (∇ k := ∇f i k (w k )) A k β AdaGrad G k-1 + diag(∇ k ∇ k ) G 1 /2 k 0 RMSProp β 2 G k-1 + (1 -β 2 ) diag(∇ k ∇ k ) G 1 /2 k 0 Adam (β 2 G k-1 + (1 -β 2 ) diag(∇ k ∇ k ))/(1 -β k 2 ) G perform a per-dimension scaling of the gradient and avoid computing the full matrix inverse, so their per-iteration cost is the same as SGD, although with an additional O(d) memory. For AMSGrad, we assume that the corresponding preconditioners are well-behaved in the sense that their eigenvalues are bounded in an interval [a min , a max ]. This is a common assumption made in the analysis of adaptive methods. Moreover, for diagonal preconditioners, such a boundedness property is easy to verify, and it is also inexpensive to maintain the desired range by projection. Our main theoretical results for AdaGrad (Section 3) and AMSGrad (Section 4) are summarized in Table 2 . Table 2 : Results for smooth, convex functions.

Adapts Method

Step-size to smoothness Rate Reference AdaGrad Constant O( 1 /T + σ / √ T ) Theorem 1 Conservative Lipschitz LS O( 1 /T + σ / √ T ) Theorem 2 AMSGrad Constant O( 1 /T + σ 2 ) Theorem 3 AMSGrad w/o momentum Armijo SLS O( 1 /T + σ 2 ) Theorem 4 AMSGrad Conservative Armijo SPS O( 1 /T + σ 2 ) Theorem 5

3. ADAGRAD

For smooth, convex objectives, Levy et al. (2018) showed that AdaGrad converges at a rate O( 1 /T + ζ / √ T ), where ζ 2 = sup w E i [ ∇f (w) -∇f i (w) 2 ] is a uniform bound on the variance of the stochastic gradients. In the over-parameterized setting, we show that AdaGrad achieves the O(foot_0 /T ) rate when interpolation is exactly satisfied and a slower convergence to the solution if interpolation is violated. 1 The proofs for this section are in Appendix C. Theorem 1 (Constant step-size AdaGrad). Assuming (i) convexity and (ii) L max -smoothness of each f i , and (iii) bounded iterates, AdaGrad with a constant step-size η and uniform averaging such that wT = 1 T T k=1 w k , converges at a rate E[f ( wT ) -f * ] ≤ α T + √ ασ √ T , where α = 1 2 D 2 η + 2η 2 dL max . When interpolation is exactly satisfied, a similar proof technique can be used to show that AdaGrad incurs only O(1) regret in the online convex optimization setting (Theorem 6 in Appendix C.2). The above theorem shows that AdaGrad is robust to the violation of interpolation and converges to the minimizer at the desired rate for any reasonable step-size. Although this is a favourable property, the best constant step-size depends on the problem, and as we demonstrate experimentally in Section 5, the performance of AdaGrad depends on correctly tuning this step-size. To overcome this limitation, we use a conservative Lipschitz line-search that sets the step-size on the fly, improving the empirical performance of AdaGrad while retaining its favourable convergence guarantees. At each iteration, this line-search selects a step-size η k that satisfies the property f i k (w k -η k ∇f i k (w k )) ≤ f i k (w k ) -c η k ∇f i k (w k ) 2 , and η k ≤ η k-1 . The resulting step-size is then used in the standard AdaGrad update in Eq. (1). To find an acceptable step, our results use a backtracking line-search, described in Appendix F. For simplicity, the theoretical results assume access to the largest step-size that satisfies the above condition. 2 Here, c is a hyperparameter determined theoretically and typically set to 1 /2 in our experiments. The "conservative" part of the line-search is the non-increasing constraint on the step-sizes, which is essential for convergence to the minimizer when interpolation is violated. We refer to it as the Lipschitz line-search as it is only used to estimate the local Lipschitz constant. Unlike the classical Armijo line-search for preconditioned gradient descent, the line-search in Eq. ( 2) is in the gradient direction, even though the update is in the preconditioned direction. The resulting step-size found is guaranteed to be in the range et al., 2019b) and allows us to prove the following theorem. Theorem 2. Under the same assumptions as Theorem 1, AdaGrad with a conservative Lipschitz line-search with c = 1/2, a step-size upper bound η max and uniform averaging converges at a rate [ 2(1-c) /Lmax, η k-1 ] (Vaswani E[f ( wT ) -f * ] ≤ α T + √ ασ √ T , where α = 1 2 D 2 max 1 η max , L max + 2 η max 2 dL max . Intuitively, the Lipschitz line-search enables AdaGrad to take larger steps at iterates where the underlying function is smoother. It retains the favourable convergence guarantees of constant stepsize AdaGrad, while improving its empirical performance (Section 5). Moreover, if interpolation is exactly satisfied, we can obtain an O( 1 /T ) convergence without the conservative constraint η k ≤ η k-1 on the step-sizes (Appendix C.3).

4. AMSGRAD AND NON-DECREASING PRECONDITIONERS

In this section, we consider AMSGrad and, more generally, methods with non-decreasing preconditioners satisfying A k A k-1 . As our focus is on the behavior of the algorithm with respect to the overall step-size, we make the simplifying assumption that the effect of the preconditioning is bounded, meaning that the eigenvalues of A k lie in the [a min , a max ] range. This is a common assumption made in the analyses of adaptive methods (Reddi et al., 2018; Alacaoglu et al., 2020) that prove worst-case convergence rates matching those of SGD. For our theoretical results, we consider the variant of AMSGrad without bias correction, as its effect is minimal after the first few iterations. The proofs for this section are in Appendix D and Appendix E. The original analysis of AMSGrad (Reddi et al., 2018) uses a decreasing step-size and a decreasing momentum parameter. It shows an O( 1 / √ T ) convergence for AMSGrad in both the smooth and non-smooth convex settings. Recently, Alacaoglu et al. (2020) showed that this analysis is loose and that AMSGrad does not require a decreasing momentum parameter to obtain the O( 1 / √ T ) rate. However, in practice, AMSGrad is typically used with both a constant step-size and momentum parameter. Next, we present the convergence result for this commonly-used variant of AMSGrad. Theorem 3. Under the same assumptions as Theorem 1, and assuming (iv) non-decreasing preconditioners (v) bounded eigenvalues in the [a min , a max ] interval, where κ = amax /amin, AMSGrad with β ∈ [0, 1), constant step-size η = 1-β 1+β amin 2Lmax and uniform averaging converges at a rate, E[f ( wT ) -f * ] ≤ 1 + β 1 -β 2 2L max D 2 dκ T + σ 2 . When σ = 0, we obtain a O( 1 /T ) convergence to the minimizer. However, when interpolation is only approximately satisfied, we obtain convergence to a neighbourhood with its size depending on σ 2 . We observe that the noise σ 2 is not amplified because of the non-decreasing momentum (or step-size). A similar distinction between the convergence of constant step-size Adam (or AMSGrad) vs. AdaGrad has also been recently discussed in the non-convex setting (Défossez et al., 2020) . Unfortunately, the final bound is minimized by setting β 1 = 0 and our theoretical analysis does not show an advantage of using momentum. Note that this is a common drawback in the analyses of heavy-ball momentum for non-quadratic functions in both the stochastic and deterministic settings (Ghadimi et al., 2015; Reddi et al., 2018; Alacaoglu et al., 2020; Sebbouh et al., 2020) . Since AMSGrad is typically used for optimizing over-parameterized models, the violation σ 2 is small, even when interpolation is not exactly satisfied. Another reason that constant step-size AMSGrad is practically useful is because of the use of large batch-sizes that result in a smaller effective neighbourhood. To get some intuition about the effect of batch-size, note that if we use a batch-size of b, the resulting neighbourhood depends on σ 2 b := E B;|B|=b [f B (w * ) -f B (x * B )] where w * B is the minimizer of a batch B of training examples. By convexity, σ 2 b ≤ E[ w * -x * B ∇f B (w * ) ]. If we assume that the distance w * -x * B is bounded, σ 2 b ∝ E ∇f B (w * ) . Since the examples in each batch are sampled with replacement, using the bounds in (Lohr, 2009) , σ 2 b ∝ n-b nb ∇f i (w * ) , showing that the effective neighbourhood shrinks as the batch-size becomes larger, becoming zero for the full-batch variant. With over-parameterization and large batch-sizes, the effective neighbourhood is small enough for machine learning tasks that do not require exact convergence to the solution. The constant step-size required for the above result depends on L max , which is typically unknown. Furthermore, using a global bound on L max usually results in slower convergence since the local Lipschitz constant can vary considerably during the optimization. To overcome these issues, we use a stochastic variant of the Armijo line-search. Unlike the Lipschitz line-search whose sole purpose is to estimate the Lipschitz constant, the Armijo line-search selects a suitable step-size in the preconditioned gradient direction, and as we show in Section 5, it results in better empirical performance. Similar to the constant step-size, when interpolation is violated, we only obtain convergence to a neighbourhood of the solution. The stochastic Armijo line-search returns the largest step-size η k satisfying the following conditions at iteration k, f i k (w k -η k A -1 k ∇f i k (w k )) ≤ f i k (w k ) -c η k ∇f i k (w k ) 2 A -1 k , and η k ≤ η max . (3) The step-size is artificially upper-bounded by η max (typically chosen to be a large value). The linesearch guarantees descent on the current function f i k and that η k lies in the [ 2amin (1-c) /Lmax, η max ] range. In the next theorem, we first consider the variant of AMSGrad without momentum (β = 0) and show that using the Armijo line-search retains the O(1/T ) convergence rate without the need to know the Lipschitz constant. Theorem 4. Under the same assumptions as Theorem 1, AMSGrad with zero momentum, Armijo line-search with c = 3 /4, a step-size upper bound η max and uniform averaging converges at a rate, E[f ( wT ) -f * ] ≤ 3D 2 d • a max 2T + 3η max σ 2 max 1 η max , 2L max a min . Comparing this rate with that of using constant step-size (Theorem 3), we observe that the Armijo line-search results in a worse constant in the convergence rate and a larger neighbourhood. These dependencies can be improved by considering a conservative version of the Armijo line-search. However, we experimentally show that the proposed line-search drastically improves the empirical performance of AMSGrad. We show that a similar bound also holds for AdaGrad (see Theorem 7 in Appendix C). AdaGrad with an Armijo line-search converges to a neighbourhood in the absence of interpolation (unlike the results in 3). Moreover, the above bound depends on a min which can be O( ) in the worst-case, resulting in an unsatisfactory worst-case rate of O( 1 / T ) even in the interpolation setting. However, like AMSGrad, AdaGrad with Armijo line-search has excellent empirical performance, implying the need for a different theoretical assumption in the future. Before considering techniques to set the step-size for AMSGrad including momentum, we present the details of the stochastic Polyak step-size (SPS) Loizou et al. (2020) ; Berrada et al. (2019) and Armijo SPS, our modification to the adaptive setting. These variants set the step-size as: SPS: η k = min f i k (w k ) -f i k * c ∇f i k (w k ) 2 , η max , Armijo SPS: η k = min f i k (w k ) -f i k * c ∇f i k (w k ) 2 A -1 k , η max . Here, f i k * is the minimum value for the function f i k .The advantage of SPS over a line-search is that it does not require a potentially expensive backtracking procedure to set the step-size. Moreover, it can be shown that this step-size is always larger than the one returned by line-search, which can lead to faster convergence. However, SPS requires knowledge of f * i for each function in the finite-sum. This value is difficult to obtain for general functions but is readily available in the interpolation setting for many machine learning applications. Common loss functions are lower-bounded by zero, and the interpolation setting ensures that these lower-bounds are tight. Consequently, using SPS with f * i = 0 has been shown to yield good performance for over-parameterized problems (Loizou et al., 2020; Berrada et al., 2019) . In Appendix D, we show that the Armijo line-search used for the previous results can be replaced by Armijo SPS and result in similar convergence rates. For AMSGrad with momentum, we propose to use a conservative variant of Armijo SPS that sets η max = η k-1 at iteration k ensuring that η k ≤ η k-1 . This is because using a potentially increasing step-size sequence along with momentum can make the optimization unstable and result in divergence. Using this step-size, we prove the following result. Theorem 5. Under the same assumptions of Theorem 1 and assuming (iv) non-decreasing preconditioners (v) bounded eigenvalues in the [a min , a max ] interval with κ = amax /amin, AMSGrad with β ∈ [0, 1), conservative Armijo SPS with c = 1+β /1-β and uniform averaging converges at a rate, E[f ( wT ) -f * ] ≤ 1 + β 1 -β 2 2L max D 2 dκ T + σ 2 . The above result exactly matches the convergence rate in Theorem 3 but does not require knowledge of the smoothness constant to set the step-size. Moreover, the conservative step-size enables convergence without requiring an artificial upper-bound η max as in Theorem 8. We note that a similar convergence rate can be obtained when using a conservative variant of Armijo SLS ( Appendix E.2), although our theoretical techniques only allow for a restricted range of β. When A k = I d , the AMSGrad update is equivalent to the update for SGD with heavy-ball momentum (Sebbouh et al., 2020) . By setting A k = I d in the above result, we recover an O(1/T + σ 2 ) rate for SGD (using SPS to set the step-size) with heavy-ball momentum. In the smooth, convex setting, our rate matches that of (Sebbouh et al., 2020) ; however, unlike their result, we do not require knowledge of the Lipschitz constant. This result also provides theoretical justification for the heuristic used for incorporating heavy-ball momentum for SLS in (Vaswani et al., 2019b) . For a general preconditioner, the AMSGrad update in Eq. ( 1) is not equivalent to heavy-ball momentum. With a constant momentum parameter γ ∈ [0, 1), the general heavy-ball update (Loizou & Richtárik, 2017) is given as w k+1 = w k -α k A -1 k ∇f i k (w k )+γ (w k -w k-1 ) (refer to Appendix E.1 for a relation between the two updates). Unlike this update, AMSGrad also preconditions the momentum direction (w k -w k-1 ). If we consider the zero-momentum variant of adaptive gradient methods as preconditioned gradient descent, the above update is a more natural way to incorporate momentum. We explore this alternate method and prove the same O(1/T + σ 2 ) convergence rate for constant step-size, conservative Armijo SPS and Armijo SLS techniques in Appendix E.3. In the next section, we use the above techniques for training large over-parameterized deep networks. 

Synthetic experiment:

We first present an experiment to show that AdaGrad and AMSGrad with constant step-size are not robust even for simple, convex problems. We use their PyTorch implementations (Paszke et al., 2019) on a binary classification task with logistic regression. Following the protocol of Meng et al. (2020) , we generate a linearly-separable dataset with n = 10foot_2 examples (ensuring interpolation is satisfied) and d = 20 features with varying margins. For AdaGrad and AMSGrad with a batch-size of 100, we show the training loss for a grid of step-sizes in the [10 3 , 10 -3 ] range and also plot their default (in PyTorch) variants. For AdaGrad, we compare against the proposed Lipschitz line-search and Armijo SLS variants. As is suggested by the theory, for each of these variants, we set the value of c = 1 /2. For AMSGrad, we compare against the variant employing the Armijo SLS with c = 1 /2. 3 and use the default (in PyTorch) momentum parameter of β = 0.9. In Fig. 1 , we observe a large variance across step-sizes and poor performance of the default step-size. The best performing variant of AdaGrad/AMSGrad has a step-size of order 10 2 . The line-search variants have good performance across margins, often better than the best-performing constant step-size. Real experiments: Following the protocol in (Luo et al., 2019; Vaswani et al., 2019b; Loizou et al., 2020) , we consider training standard neural network architectures for multi-class classification on CIFAR-10, CIFAR-100 and variants of the ImageNet datasets. For each of these experiments, we use a batch-size of 128 and compare against Adam with the best constant step-size found by grid-search. We also include recent improved variants of Adam; RAdam (Liu et al., 2020) and AdaBound (Luo et al., 2019) . To see the effect of preconditioning, we compare against SGD with SLS (Vaswani et al., 2019a) and SPS (Loizou et al., 2020) . We find that SGD with SLS is more stable and has consistently better test performance than SPS, and hence we only show results for SLS. We also compared against tuned constant step-size SGD and similar to (Vaswani et al., 2019a) , we observe that it is consistently outperformed by SGD with SLS. For the proposed methods, we consider the combinations with theoretical guarantees in the convex setting, specifically AdaGrad and AMSGrad with the Armijo SLS. For AdaGrad, we only show Armijo SLS since it consistently outperforms the Lipschitz line-search. For all variants with Armijo SLS, we use c = 0.5 for all convex experiments (suggested by Theorem 4 and Vaswani et al. (2019a) ). Since we do not have a theoretical analysis for non-convex problems, we follow the protocol in Vaswani et al. (2019a) and set c = 0.1 for all the non-convex experiments. Throughout, we set β = 0.9 for AMSGrad. We also compare to the AMSGrad variant with heavy-ball (HB) momentum (with γ = 0.25 found by grid-search). We refer to Appendix F for a detailed discussion about the practical considerations and pseudocodes for the SLS/SPS variants. We show a subset of results for CIFAR-10, CIFAR-100 and Tiny ImageNet and defer the rest to Appendix G. From Fig. 2 we make the following observations, (i) in terms of generalization, AdaGrad and AMSGrad with Armijo SLS have consistently the best performance, while SGD with SLS is often competitive. (ii) the AdaGrad and AMSGrad variants not only converge faster than Adam and Radam but also with considerably better test performance. AdaBound has comparable convergence in terms of training loss, but does not generalize as well. (iii) AMSGrad momentum is consistently better than the heavy-ball (HB) variant. Moreover, we observed that HB momentum was quite sensitive to the setting of γ, whereas AMSGrad is robust to β. In Appendix G, we include ablation results for AMSGrad with Armijo SLS but without momentum, and conclude that momentum does indeed improve the performance. In Appendix G, we plot the wall-clock time for the SLS variants and verify that the performance gains justify the increase in wall-clock time per epoch. In the appendix, we show the variation of step-size across epochs, observing a warm-up phase where the step-size increases followed by a constant or decreasing step-size (Goyal et al., 2017) . In Appendix G, we also consider binary classification with RBF kernels for datasets from LIB-SVM (Chang & Lin, 2011) and study the effect of over-parameterization for deep matrix factorization (Rolinek & Martius, 2018; Vaswani et al., 2019b) . We show that the same trends hold across different datasets, deep models, deep matrix factorization, and binary classification using kernels. Our results indicate that simply setting the correct step-size on the fly can lead to substantial empirical gains, often more than those obtained by designing a different preconditioner. Furthermore, we see that with an appropriate step-size adaptation, adaptive gradient methods can generalize better than SGD. By disentangling the effect of the step-size from the preconditioner, our results show that AdaGrad has good empirical performance, contradicting common knowledge. Moreover, our techniques are orthogonal to designing better preconditioners and can be used with other adaptive gradient or even second-order methods.

6. DISCUSSION

When training over-parameterized models in the interpolation setting, we showed that for smooth, convex functions, constant step-size variants of both AdaGrad and AMSGrad are guaranteed to converge to the minimizer at O(1/T ) rates. We proposed to use stochastic line-search techniques to help these methods adapt to the function's local smoothness, alleviating the need to tune their step-size and resulting in consistent empirical improvements across tasks. Although adaptive gradient methods outperform SGD in practice, their convergence rates are worse than constant step-size SGD and we hope to address this discrepancy in the future.  D Variance σ 2 = E i [f i (w * ) -f * i ] A SETUP AND ASSUMPTIONS We restate the main notation in Table 3 . We now restate the main assumptions required for our theoretical results We assume our objective f : R d → R has a finite-sum structure, f (w) = 1 n n i=1 f i (w), and analyze the following update, with i k selected uniformly at random, w k+1 = w k -η k A -1 k m k ; m k = βm k-1 + (1 -β)∇f i k (w k ) (Update rule) where η k is either a pre-specified constant or selected on the fly. We consider AdaGrad and AMSGrad and use the fact that the preconditioners are non-decreasing i.e. A k A k-1 . For AdaGrad, β = 0. For AMSGrad, we further assume that the preconditioners remain bounded with eigenvalues in the range [a min , a max ], a min I A k a max I. (Bounded preconditioner) For all algorithms, we assume that the iterates do not diverge and remain in a ball of radius D, as is standard in the literature on online learning (Duchi et al., 2011; Levy et al., 2018) and adaptive gradient methods (Reddi et al., 2018) , w k -w * ≤ D. (Bounded iterates) Our main assumptions are that each individual function f i is convex, differentiable, has a finite minimum f * i , and is L i -smooth, meaning that for all v and w, f i (v) ≥ f i (w) -∇f i (w), w -v , (Individual Convexity) f i (v) ≤ f i (w) + ∇f i (w), v -w + L i 2 v -w 2 , (Individual Smoothness) which also implies that f is convex and L max -smooth, where L max is the maximum smoothness constant of the individual functions. A consequence of smoothness is the following bound on the norm of the gradient stochastic gradients, ∇f i (w) 2 ≤ 2L max (f i (w) -f * i ). To characterize interpolation, we define the expected difference between the minimum of f , f (w * ), and the minimum of the individual functions f * i , σ 2 = E i [f i (w * ) -f * i ] < ∞. (Noise) When interpolation is exactly satisfied, every data point can be fit exactly, such that f * i = 0 and f (w * ) = 0, we have σ 2 = 0.

B LINE-SEARCH AND POLYAK STEP-SIZES

We now give the main guarantees on the step-sizes returned by the line-search. In practice, we use a backtracking line-search to find a step-size that satisfies the constraints, described in Algorithm 1 (Appendix F). For simplicity of presentation, here we assume the line-search returns the largest step-size that satisfies the constraints. When interpolation is not exactly satisfied, the procedures need to be equipped with an additional safety mechanism; either by capping the maximum step-size by some η max or by ensuring nonincreasing step-sizes, η k ≤ η k-1 . In this case, η max ensures that a bad iteration of the line-search procedure does not result in divergence. When interpolation is satisfied, those conditions can be dropped (e.g., setting η max → ∞) and the rate does not depend on η max . The line-searches depend on a parameter c ∈ (0, 1) that controls how much decrease is necessary to accept a step (larger c means more decrease is demanded). Assuming the Lipschitz and Armijo line-searches select the largest η such that f i (w -η∇f i (w)) ≤ f i (w) -cη ∇f i (w) 2 , η ≤ η max , (Lipschitz line-search) f i (w -ηA -1 ∇f i (w)) ≤ f i (w) -cη ∇f i (w) 2 A -1 , η ≤ η max , (Armijo line-search) the following lemma holds. Lemma 1 (Line-search). If f i is L i -smooth, the Lipschitz and Armijo lines-searches ensure η ∇f i (w) 2 ≤ 1 c (f i (w) -f * i ), and min η max , 2 (1 -c) L i ≤ η ≤ η max , η ∇f i (w) 2 A -1 ≤ 1 c (f i (w) -f * i ), and min η max , 2 λ min (A) (1 -c) L i ≤ η ≤ η max . We do not include the backtracking line-search parameters in the analysis for simplicity, as the same bounds hold, up to some constant. With a backtracking line-search, we start with a large enough candidate step-size and multiply it by some constant γ < 1 until the Lipschitz or Armijo line-search condition is satisfied. If η was a proposal step-size that did not satisfy the constraint, but γη does, the maximum step-size η that satisfies the constraint must be in the range γη ≤ η < η . Proof of Lemma 1. Recall that if f i is L i -smooth, then for an arbitrary direction d, f i (w -d) ≤ f i (w) -∇f i (w), d + L i 2 d 2 . For the Lipschitz line-search, d = η∇f i (w). The smoothness and the line-search condition are then Smoothness: f i (w -η∇f i (w)) -f i (w) ≤ Li 2 η 2 -η ∇f i (w) 2 , Line-search: f i (w -η∇f i (w)) -f i (w) ≤ -cη ∇f i (w) 2 . As illustrated in Fig. 3 , the line-search condition is looser than smoothness if Li 2 η 2 -η ∇f i (w) 2 ≤ -cη ∇f i (w) 2 . The inequality is satisfied for any η ∈ [a, b], where a, b are values of η that satisfy the equation with equality, a = 0, b = 2(1-c) /Li, and the line-search condition holds for η ≤ 2(1-c) /Li. Smoothness: f i (w) + ( Li 2 η 2 -η) ∇f i (w) 2 Line search: As the line-search selects the largest feasible step-size, η ≥ 2(1-c) /Li. If the step-size is capped at η max , we have η ≥ min{η max , 2(1-c) /Li}, and the proof for the Lipschitz line-search is complete. f i (w) -cη ∇f i (w) 2 η = 0 η = 2(1-c) Li f i (w) • The proof for the Armijo line-search is identical except for the smoothness property, which is modified to use the • A -norm for the direction d = ηA -1 ∇f i (w); f i (w -ηA -1 ∇f i (w)) ≤ f i (w) -η ∇f i (w), A -1 ∇f i (w) + L i 2 η 2 A -1 ∇f i (w) 2 , ≤ f i (w) -η ∇f i (w) 2 A -1 + L i 2λ min (A) η 2 ∇f i (w) 2 A -1 , = f i (w)+ L i 2λ min (A) η 2 -η ∇f i (w) 2 A -1 , where the second inequality comes from A -1 ∇f i (w) 2 ≤ 1 λmin(A) ∇f i (w) 2 A -1 . Similarly, the stochastic Polyak step-sizes (SPS) for f i at w are defined as SPS: η = min f i (w) -f * i c ∇f i (w) 2 , η max , Armijo SPS: η = min f i (w) -f * i c ∇f i (w) 2 A -1 , η max , where the parameter c > 0 controls the scaling of the step (larger c means smaller steps). Lemma 2 (SPS guarantees). If f i is L i -smooth, SPS and Armijo SPS ensure that SPS: η ∇f i (w) 2 ≤ 1 c (f i (w) -f * i ), min η max , 1 2cLi ≤ η ≤ η max , Armijo SPS: η ∇f i (w) 2 A -1 ≤ 1 c (f i (w) -f * i ), min η max , λmin(A) 2cLi ≤ η ≤ η max Proof of Lemma 2. The first guarantee follows directly from the definition of the step-size. For SPS, η ∇f i (w) 2 = min f i (w) -f * i c ∇f i (w) 2 , η max ∇f i (w) 2 , = min f i (w) -f * i c , η max ∇f i (w) 2 ≤ 1 c (f i (w) -f i ). The same inequalities hold for Armijo SPS with ∇f i (w)

2

A -1 . To lower-bound the step-size, we use the L i -smoothness of f i , which implies f i (w) -f * i ≥ 1 2Li ∇f i (w) 2 . For SPS, f i (w) -f * i c ∇f i (w) 2 ≥ 1 2Li ∇f i (w) 2 c ∇f i (w) 2 = 1 2cL i . For Armijo SPS, we additionally use ∇f i (w) 2 A -1 ≤ 1 λmin(A) ∇f i (w) 2 , f i (w) -f * i c ∇f i (w) 2 A -1 ≥ 1 2Li ∇f i (w) 2 c 1 λmin(A) ∇f i (w) 2 = λ min (A) 2cL i .

C PROOFS FOR ADAGRAD

We now move to the proof of the convergence of AdaGrad in the smooth setting with a constant step-size (Theorem 1) and the conservative Lipschitz line-search (Theorem 2). We first give a rate for an arbitrary step-size η k in the range [η min , η max ], and derive the rates of Theorems 1 and 2 by specializing the range to a constant step-size or line-search. Proposition 1 (AdaGrad with non-increasing step-sizes). Assuming (i) convexity and (ii) L maxsmoothness of each f i , and (iii) bounded iterates, AdaGrad with non-increasing (η k ≤ η k-1 ), bounded step-sizes (η k ∈ [η min , η max ]), and uniform averaging wT = 1 T T k=1 w k , converges at a rate E[f ( wT ) -f * ] ≤ α T + √ ασ √ T , where α = 1 2 D 2 η min + 2η max 2 dL max . We first use the above result to prove Theorems 1 and 2. The proof of Theorem 1 is immediate by plugging η = η min = η max in Proposition 1. We recall its statement; Theorem 1 (Constant step-size AdaGrad). Assuming (i) convexity and (ii) L maxsmoothness of each f i , and (iii) bounded iterates, AdaGrad with a constant step-size η and uniform averaging such that wT = 1 T T k=1 w k , converges at a rate E[f ( wT ) -f * ] ≤ α T + √ ασ √ T , where α = 1 2 D 2 η + 2η 2 dL max . For Theorem 2, we use the properties of the conservative Lipschitz line-search. We recall its statement; Theorem 2. Under the same assumptions as Theorem 1, AdaGrad with a conservative Lipschitz line-search with c = 1/2, a step-size upper bound η max and uniform averaging converges at a rate E[f ( wT ) -f * ] ≤ α T + √ ασ √ T , where α = 1 2 D 2 max 1 η max , L max + 2 η max 2 dL max . Proof of Theorem 2. Using Lemma 1, there is a step-size η k that satisfies the Lipschitz line-search with η k ≥ 2 (1-c) /Lmax. Setting c = 1 /2 and using a maximum step-size η max , we have min η max , 1 L max ≤ η k ≤ η max , =⇒ 1 η min = max 1 η max , L max . Before going into the proof of Proposition 1, we recall some standard lemmas from the adaptive gradient literature (Theorem 7 & Lemma 10 in (Duchi et al., 2011) , Lemma 5.15 & 5.16 in (Hazan, 2016) ), and a useful quadratic inequality (Levy et al., 2018, Part of Theorem 4.2)). We include proofs in Appendix C.1 for completeness. Lemma 3. If the preconditioners are non-decreasing (A k A k-1 ), the step-sizes are nonincreasing (η k ≤ η k-1 ), and the iterates stay within a ball of radius D of the minima, T k=1 w k -w * 2 1 η k A k -1 η k-1 A k-1 ≤ D 2 η T Tr(A T ). Lemma 4. For AdaGrad, A k = k i=1 ∇f i k (w k )∇f i k (w k ) 1/2 and satisfies, T k=1 ∇f i k (w k ) 2 A -1 k ≤ 2Tr(A T ), Tr(A T ) ≤ d T k=1 ∇f i k (w k ) 2 . Lemma 5. If x 2 ≤ a(x + b) for a ≥ 0 and b ≥ 0, x ≤ 1 2 a 2 + 4ab + a ≤ a + √ ab. We now prove Proposition 1. Proof of Proposition 1. We first give an overview of the main steps. Using the definition of the update rule, along with Lemmas 3 and 4, we will show that 2 T k=1 ∇f i k (w k ), w k -w * ≤ D 2 ηmin + 2η max Tr(A T ). Using the definition of A T , individual smoothness and convexity, we then show that for a constant a, T k=1 E[f (w k ) -f * ] ≤ a E T k=1 f i k (w k ) -f i k (w * ) + T σ 2 , Using the quadratic inequality (Lemma 5), averaging and using Jensen's inequality finishes the proof. To derive Eq. ( 5), we start with the Update rule, measuring distances to w * in the • A k norm, w k+1 -w * 2 A k = w k -w * 2 A k -2η k ∇f i k (w k ), w k -w * + η 2 k ∇f i k (w k ) 2 A -1 k . Dividing by η k , reorganizing the equation and summing across iterations yields 2 T k=1 ∇f i k (w k ), w k -w * ≤ T k=1 w k -w * 2 A k η k - A k-1 η k-1 + T k=1 η k ∇f i k (w k ) 2 A -1 k , ≤ T k=1 w k -w * 2 A k η k - A k-1 η k-1 + η max T k=1 ∇f i k (w k ) 2 A -1 k . We use the Lemmas 3, 4 to bound the RHS by the trace of the last preconditioner, ≤ D 2 η T Tr(A T ) + 2η max Tr(A T ), (Lemmas 3 and 4) ≤ D 2 η min + 2η max Tr(A T ). (η k ≥ η min ) To derive Eq. ( 6), we bound the trace of A T using Lemma 4 and Individual Smoothness, Tr(A T ) ≤ √ d T k=1 ∇f i k (w k ) 2 , (Lemma 4, Trace bound) ≤ √ 2dL max T k=1 f i k (w k ) -f * i k . (Individual Smoothness) ≤ √ 2dL max T k=1 f i k (w k ) -f i k (w * ) + f i k (w * ) -f * i k (±f i k (w * )) Combining the above inequalities with δ i k = f i k (w * ) -f * i k and a = 1 2 ( D 2 ηmin + 2η max ) √ 2dL max , T k=1 ∇f i k (w k ), w k -w * ≤ a T k=1 f i k (w k ) -f i k (w * ) + δ i k . Using Individual Convexity and taking expectations, T k=1 E[f (w k ) -f * ] ≤ a E T k=1 f i k (w k ) -f i k (w * ) + δ i k , ≤ a E T k=1 f i k (w k ) -f i k (w * ) + δ i k . (Jensen's inequality) Letting σ 2 := Ei[δi] = Ei[fi(w * ) -f * i ] and taking the square on both sides yields T k=1 E[f (w k ) -f * ] 2 ≤ a 2 E T k=1 f i k (w k ) -f i k (w * ) + T σ 2 . The quadratic bound (Lemma 5) x 2 ≤ α(x + β) implies x ≤ α + √ αβ, with x = T k=1 E[f (w k ) -f * ], α = 1 2 D 2 1 η min + 2η max 2 dL max , β = T σ 2 , gives the first bound below. Averaging wT = 1 T T k=1 w k and using Jensen's inequality give the result; T k=1 E[f (w k ) -f * ] ≤ α + αβ =⇒ E[f ( wT ) -f * ] ≤ α T + √ ασ √ T .

C.1 PROOFS OF ADAPTIVE GRADIENT LEMMAS

For completeness, we give proofs for the lemmas used in the previous section. We restate them here; Lemma 3. If the preconditioners are non-decreasing (A k A k-1 ), the step-sizes are non-increasing (η k ≤ η k-1 ), and the iterates stay within a ball of radius D of the minima, T k=1 w k -w * 2 1 η k A k -1 η k-1 A k-1 ≤ D 2 η T Tr(A T ). Proof of Lemma 3. Under the assumptions that A k is non-decreasing and η k is non-increasing, 1 η k A k -1 η k-1 A k-1 0 , so we can use the Bounded iterates assumption to bound T k=1 w k -w * 2 A k η k - A k-1 η k-1 ≤ T k=1 λ max A k η k -A k-1 η k-1 w k -w * 2 ≤ D 2 T k=1 λ max A k η k -A k-1 η k-1 . We then upper-bound λ max by the trace and use the linearity of the trace to telescope the sum, ≤ D 2 T k=1 Tr A k η k -A k-1 η k-1 = D 2 T k=1 Tr A k η k -Tr A k-1 η k-1 , = D 2 Tr A T η T -Tr A0 η0 ≤ D 2 1 η T Tr(A T ). Lemma 4. For AdaGrad, A k = k i=1 ∇f i k (w k )∇f i k (w k ) 1/2 and satisfies, T k=1 ∇f i k (w k ) 2 A -1 k ≤ 2Tr(A T ), Tr(A T ) ≤ d T k=1 ∇f i k (w k ) 2 . Proof of Lemma 4. For ease of notation, let ∇ k := ∇f i k (w k ). By induction, starting with T = 1, ∇f i1 (w 1 ) 2 A -1 1 = ∇ 1 A -1 1 ∇ 1 = Tr ∇ 1 A -1 1 ∇ 1 = Tr A -1 1 ∇ 1 ∇ 1 , (Cyclic property of trace) = Tr A -1 1 A 2 1 = Tr(A 1 ). (A 1 = (∇ 1 ∇ 1 ) 1 /2 ) Suppose that it holds for T -1, T -1 k=1 ∇ k 2 A -1 k ≤ 2Tr(A T -1 ). We will show that it also holds for T . Using the definition of the preconditioner and the cyclic property of the trace, T k=1 ∇f i k (w k ) 2 A -1 k ≤ 2Tr(A T -1 ) + ∇ T 2 A -1 T (Induction hypothesis) = 2Tr (A 2 T -∇ T ∇ T ) 1 /2 + Tr A -1 T ∇ T ∇ T (AdaGrad update) We then use the fact that for any X Y 0, we have (Duchi et al., 2011, Lemma 8 ) 2Tr (X -Y ) 1 /2 + Tr X -1 /2 Y ≤ 2Tr X 1 /2 . As X = A 2 T Y = ∇ T ∇ T 0, we can use the above inequality and the induction holds for T . For the trace bound, recall that A T = G 1/2 T where G T = T i=1 ∇f i k (w k )∇f i k (w k ) . We use Jensen's inequality, Tr(A T ) = Tr G 1 /2 T = d j=1 λ j (G T ) = d 1 d d j=1 λ j (G T ) , ≤ d 1 d d j=1 λ j (G T ) = √ d Tr(G T ). To finish the proof, we use the definition of G T and the linearity of the trace to get Tr(G T ) = Tr T k=1 ∇ k ∇ k = T k=1 Tr(∇ k ∇ k ) = T k=1 ∇ k 2 . Lemma 5. If x 2 ≤ a(x + b) for a ≥ 0 and b ≥ 0, x ≤ 1 2 a 2 + 4ab + a ≤ a + √ ab. Proof of Lemma 5. The starting point is the quadratic inequality x 2 -ax -ab ≤ 0. Letting r 1 ≤ r 2 be the roots of the quadratic, the inequality holds if x ∈ [r 1 , r 2 ]. The upper bound is then given by using √ a + b ≤ √ a + √ b r 2 = a + √ a 2 + 4ab 2 ≤ a + √ a 2 + √ 4ab 2 = a + √ ab.

C.2 REGRET BOUND FOR ADAGRAD UNDER INTERPOLATION

In the online convex optimization framework, we consider a sequence of functions f k | T k=1 , chosen potentially adversarially by the environment. The aim of the learner is to output a series of strategies w k | T k=1 before seeing the function f k . After choosing w k , the learner suffers the loss f k (w k ) and observes the corresponding gradient vector ∇f k (w k ). They suffer an instantaneous regret r k = f k (w k ) -f k (w) compared to a fixed strategy w. The aim is to bound the cumulative regret, R(T ) = T k=1 [f k (w k ) -f k (w * )] where w * = arg min T k=1 f k (w) is the best strategy if we had access to the entire sequence of functions in hindsight. Assuming the functions are convex but non-smooth, AdaGrad obtains an O(1/ √ T ) regret bound (Duchi et al., 2011) . For online convex optimization, the interpolation assumption implies that the learner model is powerful enough to fit the entire sequence of functions. For large over-parameterized models like neural networks, where the number of parameters is of the order of millions, this is a reasonable assumption for large T . We first recall the update of AdaGrad, at iteration k, the learner decides to play the strategy w k , suffers loss f k (w k ) and uses the gradient feedback ∇f k (w k ) to update their strategy as w k+1 = w k -ηA -1 k ∇f k (w k ), where A k = k i=1 ∇f k (w k )∇f k (w k ) 1/2 . Now we show that for smooth, convex functions under the interpolation assumption, AdaGrad with a constant step-size can result in constant regret. Theorem 6. For a sequence of L max -smooth, convex functions f k , assuming the iterates remain bounded s.t. for all k, w k -w * ≤ D, AdaGrad with a constant step-size η achieves the following regret bound, R(T ) ≤ 1 2 D 2 1 η + 2η 2 dL max + 1 2 D 2 1 η + 2η 2 dL max σ 2 √ T where σ 2 is an upper-bound on f k (w * ) -f * k . Observe that σ 2 is the degree to which interpolation is violated, and if σ 2 = 0, R(T ) = O( √ T ) matching the regret of (Duchi et al., 2011) . However, when interpolation is exactly satisfied, σ 2 = 0, and R(T ) = O(1). Proof of Theorem 6. The proof follows that of Proposition 1 which is inspired from (Levy et al., 2018) . For convenience, we repeat the basic steps. Measuring distances to w * in the • A k norm, w k+1 -w * 2 A k = w k -w * 2 A k -2η ∇f k (w k ), w k -w * + η 2 ∇f k (w k ) 2 A -1 k . Dividing by 2η, reorganizing the equation and summing across iterations yields T k=1 ∇f k (w k ), w k -w * ≤ T k=1 w k -w * 2 A k 2η - A k-1 2η + η 2 T k=1 ∇f k (w k ) 2 A -1 k . By convexity of f k , ∇f k (w k ), w k -w * ≥ f k (w k ) -f k (w * ). Using the definition of regret, R(T ) ≤ T k=1 w k -w * 2 A k 2η - A k-1 2η + η 2 T k=1 ∇f k (w k ) 2 A -1 k . We use the Lemmas 3, 4 to bound the RHS by the trace of the last preconditioner, R(T ) ≤ D 2 2η + η Tr(A T ). We now bound the trace of A T using Lemma 4 and Individual Smoothness, Tr(A T ) ≤ √ d T k=1 ∇f k (w k ) 2 , (Lemma 4, Trace bound) ≤ √ 2dL max T k=1 f k (w k ) -f * k , (Individual Smoothness) ≤ √ 2dL max T k=1 f k (w k ) -f k (w * ) + f k (w * ) -f * k , (±f k (w * )) ≤ 2dL max R(T ) + σ 2 T . (Since f k (w * ) -f * k ≤ σ 2 ) Plugging this back into the regret bound, R(T ) ≤ D 2 2η + η 2dL max [ R(T ) + σ 2 T ]. Squaring both sides and denoting a = D 2 2η + η √ 2dL max , [R(T )] 2 ≤ a 2 [R(T ) + σ 2 T ]. Using the quadratic bound (Lemma 5) x 2 ≤ α(x + β) implies x ≤ α + √ αβ, with x = R(T ), α = 1 2 D 2 1 η + 2η 2 dL max , β = σ 2 T, yields the bound, R(T ) ≤ α + αβ = 1 2 D 2 1 η + 2η 2 dL max + 1 2 D 2 1 η + 2η 2 dL max σ 2 T .

C.3 WITH INTERPOLATION, WITHOUT CONSERVATIVE LINE-SEARCHES

In this section, we show that the conservative constraint η k+1 ≤ η k is not necessary if interpolation is satisfied. We give the proof for the Armijo line-search, that has better empirical performance, but a worse theoretical dependence on the problem's constants. For the theorem below, a min is lower-bounded by in practice. A similar proof also works for the Lipschitz line-search. Theorem 7 (AdaGrad with Armijo line-search under interpolation). Under the same assumptions of Proposition 1, but without non-increasing step-sizes, if interpolation is satisfied, AdaGrad with the Armijo line-search and uniform averaging converges at the rate, E[f ( wT ) -f * ] ≤ D 2 + 2η 2 max 2 dL max 2T max 1 η max , L max a min 2 . where a min = min k {λ min (A k )}. Proof of Theorem 7. Following the proof of Proposition 1, 2 T k=1 η k ∇f i k (w k ), w k -w * = T k=1 w k -w * 2 A k -w k+1 -w * 2 A k + η 2 k ∇f i k (w k ) 2 A -1 k . On the left-hand side, we use individual convexity and interpolation, which implies f i k (w * ) = min w f i k (w) and we can bound η k by η min , giving η k ∇f i k (w k ), w k -w * ≥ η k (f i k (w k ) -f i k (w * )) ≥0 ≥ η min (f i k (w k ) -f i k (w * )). On the right-hand side, we can apply the AdaGrad lemmas (Lemma 4) T k=1 w k -w * 2 A k -w k+1 -w * 2 A k + η 2 max ∇f i k (w k ) 2 A -1 k , ≤ D 2 Tr(A T ) + 2η 2 max Tr(A T ), (By Lemmas 3 and 4) ≤ D 2 + 2η 2 max √ d T k=1 ∇f i k (w k ) 2 , (By the trace bound of Lemma 4) ≤ D 2 + 2η 2 max √ 2dL max T k=1 f i k (w k ) -f i k (w * ). (By Individual Smoothness and interpolation) Defining a = 1 2ηmin D 2 + 2η 2 max √ 2dL max and combining the previous inequalities yields T k=1 (f i k (w k ) -f i k (w * )) ≤ a T k=1 f i k (w k ) -f i k (w * ). Taking expectations and applying Jensen's inequality yields T k=1 E[f (w k ) -f (w * )] ≤ a T k=1 E[f (w k ) -f (w * )]. Squaring both sides, dividing by T k=1 E[f (w k ) -f (w * )], followed by dividing by T and applying Jensen's inequality, E[f ( wT ) -f (w * )] ≤ a 2 T = D 2 + 2η 2 max 2 dL max 2η 2 min T . Using the Armijo line-search guarantee (Lemma 1) with c = 1 /2 and a maximum step-size η max , η min = min η max , a min L max , where a min = min k {λ min (A k )}, giving the rate E[f ( wT ) -f (w * )] ≤ D 2 + 2η 2 max 2 dL max 2T max 1 η max , L max a min 2 .

D PROOFS FOR AMSGRAD AND NON-DECREASING PRECONDITIONERS WITHOUT MOMENTUM

We now give the proofs for AMSGrad and general bounded, non-decreasing preconditioners in the smooth setting, using a constant step-size (Theorem 8) and the Armijo line-search (Theorem 4). As in Appendix C, we prove a general proposition and specialize it for each of the theorems; Proposition 2. In addition to assumptions of Theorem 1, assume that (iv) the preconditioners are non-decreasing and have (v) bounded eigenvalues in the [a min , a max ] range. If the step-sizes are constrained to lie in the range [η min , η max ] and satisfy η k ∇f i k (w k ) 2 A -1 k ≤ M (f i k (w k ) -f i k * ), for some M < 2, ( ) using uniform averaging wT = 1 T T k=1 w k leads to the rate E[f ( wT ) -f * ] ≤ 1 T D 2 da max (2 -M )η min + 2 2 -M η max η min -1 σ 2 . Theorem 8. Under the assumptions of Theorem 1 and assuming (iv) non-decreasing preconditioners (v) bounded eigenvalues in the [a min , a max ] interval, AMSGrad with no momentum, constant step-size η = amin 2Lmax and uniform averaging converges at a rate, E[f ( wT ) -f * ] ≤ 2D 2 d a max L max a min T + σ 2 . Proof of Theorem 8. Using Bounded preconditioner and Individual Smoothness, we have that ∇f i k (w k ) 2 A -1 k ≤ 1 a min ∇f i k (w k ) 2 ≤ 2L max a min (f i k (w k ) -f i k * ). A constant step-size η max = η min = amin 2Lmax satisfies the step-size assumption (Eq. 7) with M = 1 and 1 T D 2 da max (2 -M )η min + 2 2 -M η max η min -1 σ 2 = 1 T 2L max D 2 da max a min + σ 2 . We restate Theorem 4; Theorem 4. Under the same assumptions as Theorem 1, AMSGrad with zero momentum, Armijo line-search with c = 3 /4, a step-size upper bound η max and uniform averaging converges at a rate, E[f ( wT ) -f * ] ≤ 3D 2 d • a max 2T + 3η max σ 2 max 1 η max , 2L max a min . Proof of Theorem 4. For the Armijo line-search, Lemma 1 guarantees that η ∇f i k (w k ) 2 A -1 k ≤ 1 c (f i k (w k ) -f * i k ), and min η max , 2 λ min (A k ) (1 -c) L max ≤ η ≤ η max . Selecting c = 3 /4 gives M = 4 /3 and η min = min η max , amin 2Lmax , so 1 T D 2 da max (2 -M )η min + 2 2 -M η max η min -1 σ 2 = 1 T D 2 da max (2 -4 /3)η min + 2 2 -4 /3 η max η min -1 σ 2 , = 1 T 3D 2 da max 2η min + 3η max η min -1 σ 2 , ≤ 3D 2 da max 2T max 1 η max , 2L max a min + 3η max σ 2 max 1 η max , 2L max a min . Theorem 9. Under the assumptions of Theorem 1 and assuming (iv) non-decreasing preconditioners (v) bounded eigenvalues in the [a min , a max ] interval, AMSGrad with no momentum, Armijo SPS with c = 3 /4 and uniform averaging converges at a rate, E[f ( wT ) -f * ] ≤ 3D 2 d • a max 2T + 3η max σ 2 max 1 η max , 3L max 2a min . Proof of Theorem 5. For Armijo SPS, Lemma 2 guarantees that η k ∇f i k (w k ) 2 A -1 k ≤ 1 c (f i k (w k ) -f * i k ), and min η max , a min 2c L max ≤ η ≤ η max . Selecting c = 3 /4 gives M = 4 /3 and η min = min η max , 2amin 3Lmax , so 1 T D 2 da max (2 -M )η min + 2 2 -M η max η min -1 σ 2 = 1 T D 2 da max (2 -4 /3)η min + 2 2 -4 /3 η max η min -1 σ 2 , = 1 T 3D 2 da max 2η min + 3η max η min -1 σ 2 , ≤ 3D 2 da max 2T max 1 η max , 3L max 2a min + 3η max σ 2 max 1 η max , 3L max 2a min . Before diving into the proof of Proposition 2, we prove the following lemma to handle terms of the form η k (f i k (w k ) -f i k (w * )). If η k depends on the function sampled at the current iteration, f i k , as in the case of line-search, we cannot take expectations as the terms are not independent. Lemma 6 bounds η k (f i k (w k ) -f i k (w * )) in terms of the range [η min , η max ]; Lemma 6. If 0 ≤ η min ≤ η ≤ η max and the minimum value of f i is f * i , then η(f i (w) -f i (w * )) ≥ η min (f i (w) -f i (w * )) -(η max -η min )(f i (w * ) -f * i ). Proof of Lemma 6. By adding and subtracting f * i , the minimum value of f i , we get a non-negative and a non-positive term multiplied by η. We can use the bounds η ≥ η min and η ≤ η max separately; η[f i (w) -f i (w * )] = η[f i (w) -f * i ≥0 + f * i -f i (w * ) ≤0 ], ≥ η min [f i (w) -f * i ] + η max [f * i -f i (w * )]. Adding and subtracting η min f i (w * ) finishes the proof, = η min [f i (w) -f i (w * ) + f i (w * ) -f * i ] + η max [f * i -f i (w * )], = η min [f i (w) -f i (w * )] + (η max -η min )[f * i -f i (w * )]. Proof of Proposition 2. We start with the Update rule, measuring distances to w * in the • A k norm, w k+1 -w * 2 A k = w k -w * 2 A k -2η k ∇f i k (w k ), w k -w * + η 2 k ∇f i k (w k ) 2 A -1 k (8) To bound the RHS, we use the assumption on the step-sizes (Eq. ( 7)) and Individual Convexity, -2η k ∇f i k (w k ), w k -w * + η 2 k ∇f i k (w k ) 2 A -1 k , ≤ -2η k ∇f i k (w k ), w k -w * + M η k (f i k (w k ) -f i k * ), Step-size assumption, Eq. ( 7)) ≤ -2η k [f i k (w k ) -f i k (w * )] + M η k (f i k (w k ) -f i k * ), (Individual Convexity) ≤ -2η k [f i k (w k ) -f i k (w * )] + M η k (f i k (w k ) -f i k (w * ) + f i k (w * ) -f i k * ), (±f i k (w * )) ≤ -(2 -M )η k [f i k (w k ) -f i k (w * )] + M η max (f i k (w * ) -f i k * ). (η k ≤ η max ) Plugging the inequality back into Eq. ( 8) and reorganizing the terms yields (2 -M )η k [f i k (w k ) -f i k (w * )] ≤ w k -w * 2 A k -w k+1 -w * 2 A k + M η max (f i k (w * ) -f i k * ) Using Lemma 6, we have that (2 -M )η k [f i k (w k ) -f i k (w * )] ≥ (2 -M )η min (f i k (w k ) -f i k (w * )) -(2 -M )(η max -η min )(f i k (w * ) -f i k * ). Using this inequality in Eq. ( 9), we have that (2 -M )η min (f i k (w k ) -f i k (w * )) -(2 -M )(η max -η min )(f i k (w * ) -f i k * ) ≤ w k -w * 2 A k -w k+1 -w * 2 A k + M η max (f i k (w * ) -f i k * ), Moving the terms depending on f i k (w * ) -f i k * to the RHS, (2 -M )η min (f i k (w k ) -f i k (w * )) ≤ w k -w * 2 A k -w k+1 -w * 2 A k + (2η max -(2 -M )η min )(f i k (w * ) -f i k * ). Taking expectations and summing across iterations yields (2 -M )η min T k=1 E[fi k (w k ) -f i k (w * )] ≤ E T k=1 w k -w * 2 A k -w k+1 -w * 2 A k +(2η max -(2 -M )η min )T σ 2 . Using Lemma 3 to telescope the distances and using the Bounded preconditioner, T k=1 w k -w * 2 A k -w k+1 -w * 2 A k ≤ T k=1 w k -w * 2 A k -A k-1 ≤ D 2 Tr(A T ) ≤ D 2 d a max , which guarantees that (2 -M )η min T k=1 E[f (w k ) -f (w * )] ≤D 2 da max + (2η max -(2 -M )η min )T σ 2 . Dividing by T (2 -M )η min and using Jensen's inequality finishes the proof, giving the rate for the averaged iterate, E[f ( wT ) -f (w * )] ≤ 1 T D 2 da max (2 -M )η min + 2 2 -M η max η min -1 σ 2 .

E AMSGRAD WITH MOMENTUM

We first show the relation between the AMSGrad momentum and heavy ball momentum and then present the proofs with AMSGrad momentum in E.2 and heavy ball momentum in E.3.

E.1 RELATION BETWEEN THE AMSGRAD UPDATE AND PRECONDITIONED SGD WITH HEAVY-BALL MOMENTUM

Recall that the AMSGrad update is given as: w k+1 = w k -η k A -1 k m k ; m k = βm k-1 + (1 -β)∇f i k (w k ) Simplifying, w k+1 = w k -η k A -1 k (βm k-1 + (1 -β)∇f i k (w k )) w k+1 = w k -η k (1 -β) A -1 k ∇f i k (w k ) -η k β A -1 k m k-1 From the update at iteration k -1, w k = w k-1 -η k-1 A -1 k-1 m k-1 =⇒ -m k-1 = 1 η k-1 A k-1 (w k -w k-1 ) From the above relations, w k+1 = w k -η k (1 -β) A -1 k ∇f i k (w k ) + β η k η k-1 A -1 k A k-1 (w k -w k-1 ) which is of the same form as w k+1 = w k -η k A -1 k + γ(w k -w k-1 ), the update with heavy ball momentum. The two updates are equivalent up to constants except for the key difference that for AMSGrad, the momentum vector (w k -w k-1 ) is further preconditioned by A -1 k A k-1 .

E.2 PROOFS FOR AMSGRAD WITH MOMENTUM

We now give the proofs for AMSGrad having the update. w k+1 = w k -η k A -1 k m k ; m k = βm k-1 + (1 -β)∇f i k (w k ) We analyze it in the smooth setting using a constant step-size (Theorem 3), conservative Armijo SPS (Theorem 5) and conservative Armijo SLS (Theorem 10). As before, we abstract the common elements to a general proposition and specialize it for each of the theorems. Proposition 3. In addition to assumptions of Theorem 1, assume that (iv) the preconditioners are non-decreasing and have (v) bounded eigenvalues in the [a min , a max ] range. If the step-sizes are lower-bounded and non-increasing, η min ≤ η k ≤ η k-1 and satisfy η k ∇f i k (w k ) 2 A -1 k ≤ M (f i k (w k ) -f i k * ), for some M < 2 1 -β 1 + β , using uniform averaging wT = 1 T T k=1 w k leads to the rate E[f ( wT ) -f * ] ≤ 1 + β 1 -β 2 - 1 + β 1 -β M -1 D 2 da max η min T + M σ 2 . We first show how the convergence rate of each step-size method can be derived from Proposition 3. Theorem 3. Under the same assumptions as Theorem 1, and assuming (iv) nondecreasing preconditioners (v) bounded eigenvalues in the [a min , a max ] interval, where κ = amax /amin, AMSGrad with β ∈ [0, 1), constant step-size η = 1-β 1+β amin 2Lmax and uniform averaging converges at a rate, E[f ( wT ) -f * ] ≤ 1 + β 1 -β 2 2L max D 2 dκ T + σ 2 . Proof of Theorem 3. Using Bounded preconditioner and Individual Smoothness, we have that η ∇f i k (w k ) 2 A -1 k ≤ η 1 a min ∇f i k (w k ) 2 ≤ η 2L max a min (f i k (w k ) -f i k * ). Using a constant step-size η = 1-β 1+β amin 2Lmax satisfies the requirement of Proposition 3 (Eq. ( 10)) with constant M = 1-β 1+β . The convergence is then, E[f ( wT ) -f (w * )] ≤ 1 + β 1 -β 2 - 1 + β 1 -β M -1 D 2 da max η min T + M σ 2 , = 1 + β 1 -β D 2 da max 1-β 1+β amin 2Lmax T + 1 -β 1 + β σ 2 , = 1 + β 1 -β 2 2L max D 2 dκ T + σ 2 , with κ = a max /a min . Theorem 5. Under the same assumptions of Theorem 1 and assuming (iv) nondecreasing preconditioners (v) bounded eigenvalues in the [a min , a max ] interval with κ = amax /amin, AMSGrad with β ∈ [0, 1), conservative Armijo SPS with c = 1+β /1-β and uniform averaging converges at a rate, E[f ( wT ) -f * ] ≤ 1 + β 1 -β 2 2L max D 2 dκ T + σ 2 . Proof of Theorem 5. For Armijo SPS, Lemma 2 guarantees that η k ∇f i k (w k ) 2 A -1 k ≤ 1 c (f i k (w k ) -f * i k ), a min 2c L max ≤ η k . Setting c = 1+β 1-β ensures that M = 1/c satisfies the requirement of Proposition 3 and η min ≥ 1-β 1+β amin 2Lmax . Plugging in these values into Proposition 3 completes the proof. Theorem 10. Under the assumptions of Theorem 1 and assuming (iv) non-decreasing preconditioners (v) bounded eigenvalues in the [a min , a max ] interval, AMSGrad with momentum with parameter β ∈ [0, 1 /5), conservative Armijo SLS with c = 2 3 1+β 1-β and uniform averaging converges at a rate, E[f ( wT ) -f * ] ≤ 3 1 + β 1 -5β L max D 2 dκ T + 3σ 2 Proof of Theorem 10. For Armijo SLS, Lemma 1 guarantees that η k ∇f i k (w k ) 2 A -1 k ≤ 1 c (f i k (w k ) -f * i k ), 2(1 -c) a min L max ≤ η k . The line-search parameter c is restricted to [0, 1] and relates to the the requirement parameter M of Proposition 3 (Eq. ( 10)) through M = 1/c. The combined requirements on M are then that 1 < M < 2 1-β 1+β , which is only feasible if β < 1 3 . To leave room to satisfy the constraints, let β < 1 5 . Setting 1 c = M = 3 2 1-β 1+β satisfies the constraints and requirement for Proposition 3, and E[f ( wT ) -f (w * )] ≤ 1 + β 1 -β 2 - 1 + β 1 -β M -1 D 2 da max η min T + M σ 2 , = 1 + β 1 -β 2 - 3 2 -1 L max 2(1 -c) a min D 2 da max T + 3 2 1 -β 1 + β σ 2 , = 1 + β 1 -β L max (1 -c) D 2 dκ T + 3σ 2 = 3 1 + β 1 -5β L max D 2 dκ T + 3σ 2 . where the last step substituted 1/(1 -c), 1 -c = 1 - 2 3 1 + β 1 -β = 3(1 -β) -2(1 + β) 3(1 -β) = 1 3 1 -5β 1 -β . Before diving into the proof of Proposition 3, we prove the following lemma, Lemma 7. For any set of vectors a, b, c, d, if a = b + c, then, a -d 2 = b -d 2 -a -b 2 + 2 c, a -d Proof. a -d 2 = b + c -d 2 = b -d 2 + 2 c, b -d + c 2 Since c = a -b, = b -d 2 + 2 a -b, b -d + a -b 2 = b -d 2 + 2 a -b, b -a + a -d + a -b 2 = b -d 2 + 2 a -b, b -a + 2 a -b, a -d + a -b 2 = b -d 2 -2 a -b 2 + 2 a -b, a -d + a -b 2 = b -d 2 -a -b 2 + 2 c, a -d We now move to the proof of the main proposition. Our proof follows the structure of Reddi et al. ( 2018); Alacaoglu et al. (2020) . Proof of Proposition 3. To reduce clutter, let P k = A k /η k . Using the update, we have the expansion w k+1 -w * = w k -P -1 k m k -w * , = w k -(1 -β)P -1 k ∇f i k (w k ) -βP -1 k m k-1 -w * , Measuring distances in the • P k -norm, such that x 2 P k = x, P k x , w k+1 -w * 2 P k = w k -w * 2 P k -2(1 -β) w k -w * , ∇f i k (w k ) , -2β w k -w * , m k-1 + m k 2 P -1 k . We separate the distance to w * from the momentum in the second inner product using the update and Lemma 7 with a = c = P 1/2 k-1 (w k -w * ), b = 0, d = P 1/2 k-1 (w k-1 -w * ). -2 m k-1 , w k -w * = -2 P k-1 (w k-1 -w k ), w k -w * , = w k -w k-1 2 P k-1 + w k -w * 2 P k-1 -w k-1 -w * 2 P k-1 , = m k-1 2 P -1 k-1 + w k -w * 2 P k-1 -w k-1 -w * 2 P k-1 , ≤ m k-1 2 P -1 k-1 + w k -w * 2 P k -w k-1 -w * 2 P k-1 , where the last inequality uses the fact that η k ≤ η k-1 and A k A k-1 , which implies P k P k-1 , and w k -w * 2 P k-1 ≤ w k -w * 2 P k . Plugging this inequality in and grouping terms yields 2(1 -β) w k -w * , ∇f i k (w k ) ≤ w k -w * 2 P k -w k+1 -w * 2 P k + β w k -w * 2 P k -w k-1 -w * 2 P k-1 + β m k-1 2 P -1 k-1 + m k 2 P -1 k By convexity, the inner product on the left-hand-side is bounded by w k -w * , ∇f i k (w k ) ≥ f i k (w k ) -f i k (w * ). The first two lines of the right-hand-side will telescope if we sum all iterations, so we only need to treat the norms of the momentum terms. We introduce a free parameter δ ≥ 0, that is only used for the analysis, and expand β m k-1 2 P -1 k-1 + m k 2 P -1 k = β m k-1 2 P -1 k-1 + (1 + δ) m k 2 P -1 k -δ m k 2 P -1 k . To bound m k 2 P -1 k , we expand it by its update and use Young's inequality to get m k 2 P -1 k = βm k-1 + (1 -β)∇f i k (w k ) 2 P -1 k ≤ (1 + )β 2 m k-1 2 P -1 k + (1 + 1 / )(1 -β) 2 ∇f i k (w k ) 2 P -1 k , where > 0 is also a free parameter, introduced to control the tradeoff of the bound. Plugging this bound in the momentum terms, we get β m k-1 2 P -1 k-1 + m k 2 P -1 k ≤ β m k-1 2 P -1 k-1 + (1 + )(1 + δ)β 2 m k-1 2 P -1 k -δ m k 2 P -1 k , + (1 + 1 / )(1 + δ)(1 -β) 2 ∇f i k (w k ) 2 P -1 k . As P -1 k P -1 k-1 , we have that m k-1 2 P -1 k ≤ m k-1 2 P -1 k-1 which implies ≤ β + (1 + )(1 + δ)β 2 m k-1 2 P -1 k-1 -δ m k 2 P -1 k + (1 + 1 / )(1 + δ)(1 -β) 2 ∇f i k (w k ) 2 P -1 k . To get a telescoping sum, we set δ to be equal to β + (1 + )(1 + δ)β 2 , which is satisfied if δ = β+(1+ )β 2 1-(1+ )β 2 , and δ > 0 is satisfied if β < 1 / √ 1+ . We now plug back the inequality β m k-1 2 P -1 k-1 + m k 2 P -1 k ≤ δ m k-1 2 P -1 k-1 -m k 2 P -1 k + (1 + 1 / )(1 + δ)(1 -β) 2 ∇f i k (w k ) 2 P -1 k , in the previous expression to get 2(1 -β) (f i k (w k ) -f i k (w * )) ≤ w k -w * 2 P k -w k+1 -w * 2 P k + β w k -w * 2 P k -w k-1 -w * 2 P k-1 + δ m k-1 2 P -1 k-1 -m k 2 P -1 k + (1 + 1 / )(1 + δ)(1 -β) 2 ∇f i k (w k ) 2 P -1 k . All terms now telescope, except the gradient norm which we bound using the step size assumption, ∇f i k (w k ) 2 P -1 k = η k ∇f i k (w k ) 2 A -1 k ≤ M (f i k (w k ) -f i k * ), = M (f i k (w k ) -f i k (w * )) + M (f i k (w * ) -f i k * ). This gives the expression α (f i k (w k ) -f i k (w * )) ≤ w k -w * 2 P k -w k+1 -w * 2 P k + β w k -w * 2 P k -w k-1 -w * 2 P k-1 + δ m k-1 2 P -1 k-1 -m k 2 P -1 k + (1 + 1 / )(1 + δ)(1 -β) 2 M (f i k (w * ) -f i k * ), with α = 2(1 -β) -(1 + 1 / )(1 + δ)(1 -β) 2 M . Summing all iterations, the individual terms are bounded by the Bounded iterates and Lemma 3; T k=1 w k -w * 2 P k -w k+1 -w * 2 P k ≤ D 2 Tr(P T ) ≤ D 2 η min Tr(A T ) β T k=1 w k -w * 2 P k -w k-1 -w * 2 P k-1 ≤ β w T -w * 2 P T ≤ β D 2 η min Tr(A T ) δ T k=1 m k-1 2 P -1 k-1 -m k 2 P -1 k ≤ δ m 0 2 P0 = 0. Using the boundedness of the preconditioners gives Tr(A T ) ≤ da max and the total bound α T k=1 (f i k (w k ) -f i k (w * )) ≤ (1 + β)D 2 da max η min + (1 + 1 / )(1 + δ)(1 -β) 2 M T k=1 (f i k (w * ) -f i k * ). Taking expectations, α T k=1 E[f (w k ) -f (w * )] ≤ (1 + β)D 2 da max η min + (1 + 1 / )(1 + δ)(1 -β) 2 M σ 2 T. It remains to expand α and simplify the constants. We had defined α = 2(1 -β) -(1 + 1 / )(1 + δ)(1 -β) 2 M > 0, and δ = β + (1 + )β 2 1 -(1 + )β 2 > 0, where > 0 is a free parameter. This puts the requirement on β that β < 1/ √ 1 + . To simplify the bounds, we set β = 1/(1 + ), = 1/β -1, which gives the substitutions 1 + = 1 β 1 + 1 = 1 1 -β δ = 2 β 1 -β 1 + δ = 1 + β 1 -β . Plugging those into the rate gives α T k=1 E[f (w k ) -f (w * )] ≤ (1 + β)D 2 da max η min + (1 + β)M σ 2 T, while plugging them into α gives α = 2(1 -β) -(1 + 1 / )(1 + δ)(1 -β) 2 M, = (1 -β) 2 - 1 + β 1 -β M , which is positive if M < 2 1 -β 1 + β . Dividing by αT , using Jensen's inequality and averaging finishes the proof, with the rate T k=1 E[f (w k ) -f (w * )] ≤ 1 + β 1 -β 2 - 1 + β 1 -β M -1 D 2 da max η min T + M σ 2 .

E.3 PROOFS FOR AMSGRAD WITH HEAVY BALL MOMENTUM

We now give the proofs for AMSGrad with heavy ball momentum with the update. w k+1 = w k -η k A -1 k ∇f i k (w k ) + γ (w k -w k-1 ) We analyze it in the smooth setting using a constant step-size (Theorem 11), a conservative Armijo SPS (Theorem 12) and conservative Armijo SLS (Theorem 13). As before, we abstract the common elements to a general proposition and specialize it for each of the theorems. Proposition 4. In addition to assumptions of Theorem 1, assume that (iv) the preconditioners are non-decreasing and have (v) bounded eigenvalues in the [a min , a max ] range. If the step-sizes are lower-bounded and non-increasing, η min ≤ η k ≤ η k-1 and satisfy η k ∇f i k (w k ) 2 A -1 k ≤ M (f i k (w k ) -f i k * ), for some M < 2 -2γ, AMSGrad with heavy ball momentum with parameter γ < 1 and uniform averaging wT = 1 T T k=1 w k leads to the rate E[f ( wT ) -f * ] ≤ 1 2 -2γ -M 1 T 2(1 + γ 2 )D 2 a max d η min + 2γ[f (w 0 ) -f (w * )] + M σ 2 . We first show how the convergence rate of each step-size method can be derived from Proposition 4. Theorem 11. Under the assumptions of Theorem 1 and assuming (iv) non-decreasing preconditioners (v) bounded eigenvalues in the [a min , a max ] range, AMSGrad with heavy ball momentum with parameter γ ∈ [0, 1), constant step-size η = 2amin (1-γ) 3Lmax and uniform averaging converges at a rate E[f ( wT ) -f * ] ≤ 1 T 9 2 1 + γ 2 (1 -γ) 2 L max D 2 κd + 3γ (1 -γ) [f (w 0 ) -f (w * )] + 2σ 2 . Proof of Theorem 11. Using Bounded preconditioner and Individual Smoothness, we have that η ∇f i k (w k ) 2 A -1 k ≤ η 1 a min ∇f i k (w k ) 2 ≤ η 2L max a min (f i k (w k ) -f i k * ). A constant step-size η = 2amin (1-γ) /3Lmax means the requirement for Proposition 4 is satisfied with M = 4 3 (1 -γ). Plugging (2 -2γ -M ) = 2 3 (1 -γ) in Proposition 4 finishes the proof. Theorem 12. Under the assumptions of Theorem 1 and assuming (iv) non-decreasing preconditioners (v) bounded eigenvalues in the [a min , a max ] interval, AMSGrad with heavy ball momentum with parameter γ ∈ [0, 1), conservative Armijo SPS with c = 3 /4(1-γ) and uniform averaging converges at a rate, E[f ( wT ) -f * ] ≤ 1 T 9 2 1 + γ 2 (1 -γ) 2 L max D 2 κd + 3γ (1 -γ) [f (w 0 ) -f (w * )] + 2σ 2 . Proof of Theorem 12. For Armijo SPS, Lemma 2 guarantees that η k ∇f i k (w k ) 2 A -1 k ≤ 1 c (f i k (w k ) -f * i k ), a min 2c L max ≤ η k . Selecting c = 3 /4(1-γ) gives M = 4 /3(1 -γ) ≤ 2(1 -γ) and the requirement of Proposition 4 are satisfied. The minimum step-size is then η min = amin 2cLmax = 2amin (1-γ) 3Lmax , so η min and M are the same as in the constant step-size case (Theorem 11) and the same rate applies. Theorem 13. Under the assumptions of Theorem 1 and assuming (iv) non-decreasing preconditioners (v) bounded eigenvalues in the [a min , a max ] interval, AMSGrad with heavy ball momentum with parameter γ ∈ [0, 1 /4), conservative Armijo SLS with c = 3 /4(1-γ) and uniform averaging converges at a rate, E[f ( wT ) -f * ] ≤ 1 T 6 1 + γ 2 1 -4γ L max D 2 κd + 3γ (1 -γ) [f (w 0 ) -f (w * )] + 2σ 2 . Proof of Theorem 13. Selecting c = 3 /4(1-γ) is feasible if γ < 1/4 as c < 1. The Armijo SLS (Lemma 1) then guarantees that η k ∇f i k (w k ) 2 A -1 k ≤ 1 c (f i k (w k ) -f * i k ), 2(1 -c) a min L max ≤ η, which satisfies the requirements of Proposition 4 with M = 4 3 (1 -γ). Plugging M in the rate yields E[f ( wT ) -f (w * )] ≤ 1 T 6 1 + γ 2 1 -γ D 2 a max d η min + 3γ (1 -γ) [f (w 0 ) -f (w * )] + 2σ 2 , With c = 3 /4 1-γ , η min ≥ 2(1-c)amin Lmax = 2amin Lmax 1-4γ 4(1-γ) . Plugging it into the above bound yields E[f ( wT ) -f (w * )] ≤ 1 T 6 1 + γ 2 1 -4γ L max D 2 κd + 3γ (1 -γ) [f (w 0 ) -f (w * )] + 2σ 2 . We now move to the proof of the main proposition. Our proof follows the structure of Ghadimi et al. (2015) ; Sebbouh et al. (2020) . Proof of Proposition 4. Recall the update for AMSGrad with heavy-ball momentum, w k+1 = w k -η k A -1 k ∇f i k (w k ) + γ(w k -w k-1 ). The proof idea is to analyze the distance from w * to w k and a momentum term, δ k 2 = w k + m k -w * 2 A k , where m k = γ 1-γ (w k -w k-1 ), by considering the momentum update (Eq. 12) as a preconditioned step on the joint iterates (w k +m k ), w k+1 + m k+1 = w k + m k -η k 1-γ A -1 k ∇f i k (w k ). Let us verify Eq. ( 14). First, expressing w k+1 + m k+1 as a weighted difference of w k+1 and w k , w k+1 + m k+1 = w k+1 + γ 1-γ (w k+1 -w k ) = 1 1-γ w k+1 -γ 1-γ w k . Expanding w k+1 in terms of the update rule then gives = 1 1-γ (w k -η k A -1 k ∇f i k (w k ) + γ(w k -w k-1 )) -γ 1-γ w k , = 1 1-γ (w k -η k A -1 k ∇f i k (w k ) -γw k-1 ), = 1 1-γ w k -γ 1-γ w k-1 -η k 1-γ A -1 k ∇f i k (w k ), which can then be re-written as w k + m k -η k 1-γ A -1 k ∇f i k (w k ). The analysis of the method then follows similar steps as the analysis without momentum. Using Eq. ( 14), we have the recurrence δ k+1 2 A k = w k+1 + m k+1 -w * 2 A k = w k + m k -η k 1-γ A -1 k ∇f i k (w k ) -w * 2 A k , = δ k 2 A k - 2η k 1 -γ ∇f i k (w k ), w k + m k -w * + η 2 k (1 -γ) 2 ∇f i k (w k ) 2 A -1 k . To bound the inner-product, we use Individual Convexity to relate it to the optimality gap, ∇f i k (w k ), w k + m k -w * = ∇f i k (w k ), w k -w * + γ 1 -γ ∇f i k (w k ), w k -w k-1 , ≥ f i k (w k ) -f i k (w * ) + γ 1 -γ [f i k (w k ) -f i k (w k-1 )], = 1 1 -γ [f i k (w k ) -f i k (w * )] - γ 1 -γ [f i k (w k-1 ) -f i k (w * )]. To bound the gradient norm, we use the step-size assumption that η k ∇f i k (w k ) 2 A -1 k ≤ M [f i k (w k ) -f * i k ] = M [f i k (w k ) -f i k (w * )] + M [f i k (w * ) -f * i k ]. For simplicity of notation, let us define the shortcuts h k (w) = f i k (w) -f i k (w * ), σ 2 k = f i k (w * ) -f * i k . Plugging those two inequalities in the recursion of Eq. ( 15) gives δ k+1 2 A k ≤ δ k 2 A k - η k (1 -γ) 2 (2 -M )h k (w k ) + 2η k γ (1 -γ) 2 h k (w k-1 ) + M η k (1 -γ) 2 σ 2 k . We can now divide by η k/(1-γ) 2 and reorganize the inequality as (2 -M )h k (w k ) -2γh k (w k-1 ) ≤ (1 -γ) 2 η k δ k 2 A k -δ k+1 2 A k + M σ 2 k . Taking the average over all iterations, the inequality yields 1 T T k=1 (2 -M )h k (w k ) -2γh k (w k-1 ) ≤ 1 T T k=1 (1 -γ) 2 η k δ k 2 A k -δ k+1 2 A k + M σ 2 k . To bound the right-hand side, under the assumption that the iterates are bounded by w k -w * ≤ D, we use Young's inequality to get a bound on δ k 2 ; δ k 2 2 = w k + m k -w * 2 2 = 1 1-γ (w k -w * ) -γ 1-γ (w k-1 -w * ) 2 2 ≤ 2 (1 -γ) 2 w k -w * 2 2 + γ 2 w k-1 -w * 2 2 ≤ 2(1 + γ 2 ) (1 -γ) 2 D 2 = ∆ 2 . Given the upper bound δ k 2 ≤ ∆, a reorganization of the sum lets us apply Lemma 3 to get T k=1 1 η k δ k 2 A k -δ k+1 2 A k = T k=1 δ k 2 1 η k A k - T k=1 δ k+1 2 1 η k A k = T k=1 δ k 2 1 η k A k - T +1 k=2 δ k 2 1 η k-1 A k-1 ≤ T k=1 δ k 2 1 η k A k - T k=1 δ k 2 1 η k-1 A k-1 + δ 1 2 1 η 0 A0 = T k=1 δ k 2 1 η k A k -1 η k-1 A k-1 ≤ ∆ 2 amaxd ηmin , where the last step uses the convention A 0 = 0 and Lemma 3 on δ k instead of w k -w * . Plugging this inequality in, we get the simpler bound on the right-hand-side 1 T T k=1 (2 -M )h k (w k ) -2γh k (w k-1 ) ≤ 2(1 + γ 2 )D 2 a max d T η min + M σ 2 k . Now that the step-size is bounded deterministically, we can take the expectation on both sides to get 1 T E T k=1 (2 -M )h(w k ) -2γh(w k-1 ) ≤ 2(1 + γ 2 )D 2 a max d T η min + M σ 2 , where h(w) = f (w) -f * and σ 2 = E f i k (w * ) -f * i k . To simplify the left-hand-side, we change the weights on the optimality gaps to get a telescoping sum, T k=1 (2 -M )h(w k ) -2γh(w k-1 ) = T k=1 (2 -2γ -M )h(w k ) + 2γh(w k ) -2γh(w k-1 ), = (2 -2γ -M ) T k=1 h(w k ) + 2γ(h(w T ) -h(w 0 )), ≥(2 -2γ -M ) T k=1 h(w k ) -2γh(w 0 ). The last inequality uses h(w T ) ≥ 0. Moving the initial optimality gap to the right-hand-side, we get 1 T (2 -2γ -M ) E T k=1 h(w k ) ≤ 1 T 2(1 + γ 2 )D 2 a max d η min + 2γh(w 0 ) + M σ 2 . Assuming 2 -2γ -M > 0 and dividing, we get 1 T E T k=1 h(w k ) ≤ 1 2 -2γ -M 1 T 2(1 + γ 2 )D 2 a max d η min + 2γh(w 0 ) + M σ 2 . Using Jensen's inequality and averaging the iterates finishes the proof.

F EXPERIMENTAL DETAILS

Our proposed adaptive gradient methods with SLS and SPS step-sizes are presented in Algorithms 1 and 3. We now make a few additional remarks on the practical use of these methods. Algorithm 1 Adaptive methods with SLS(f , precond, β, conservative, mode, w 0 , η max , b, c ∈ (0, 1), γ < 1) 1: for k = 0, . . . , T -1 do 2: i k ← sample mini-batch of size b 3: A k ← precond(k) Form the preconditioner.

4:

if mode == Lipschitz then 5: p k ← ∇f i k (w k ) 6: else if mode == Armijo then 7: p k ← A -1 k ∇f i k (w k ) 8: end if η ← η max 9: end if 10: return η As suggested by Vaswani et al. (2019b) , the standard backtracking search can sometimes result in step-sizes that are too small while taking bigger steps can yield faster convergence. To this end, we adopted their strategies to reset the initial step-size at every iteration (Algorithm 2). In particular, using reset option 0 corresponds to starting every backtracking line search from the step-size used in the previous iteration. Since the backtracking never increases the step-size, this option enables the "conservative step-size" constraint for the Lipschitz line-search to be automatically satisfied. For the Armijo line-search, we use the heuristic from Vaswani et al. (2019b) corresponding to reset option 1. This option begins every backtracking with a slightly larger (by a factor of γ b /n , γ = 2 throughout our experiments) step-size compared to the step-size at the previous iteration, and works well consistently across our experiments. Although we do not have theoretical guarantees for Armijo SLS with general preconditioners such as Adam, our experimental results indicate that this is in fact a promising combination that also performs well in practice.  while f i k (w k -η k • p k ) > f i k (w k ) -c η k ∇f i k (w k ), η k ← min fi k (w k )-f * i k ) c ∇fi k (w k ), p k , η B 19: m k ← βm k-1 + (1 -β)∇f i k (w k ) 20: w k+1 ← w k -η k A -1 k m k 21: end for 22: return w T On the other hand, rather than being too conservative, the step-sizes produced by SPS between successive iterations can vary wildly such that convergence becomes unstable. Loizou et al. (2020) suggested to use a smoothing procedure that limits the growth of the SPS from the previous iteration to the current. We use this strategy in our experiments with τ = 2 b /n and show that both SPS and Armijo SPS work well. For the convex experiments, for both SLS and SPS, we set c = 0.5 as is suggested by the theory. For the non-convex experiments, we observe that all values of c ∈ [0.1, 0.5] result in reasonably good performance, but use the values suggested in Vaswani et al. (2019b) ; Loizou et al. (2020) , i.e. c = 0.1 for all adaptive methods using SLS and c = 0.2 for methods using SPS. (Vaswani et al., 2019b) . All line-search methods use c = 1/2 and the procedure described in Appendix F. The other methods are use their default parameters. We observe the superior convergence of the SLS variants and the poor performance of the baselines. Vaswani et al., 2019b; Rolinek & Martius, 2018) . We choose A ∈ R 10×6 with condition number κ(A) = 10 10 and control the over-parameterization via the rank k (equal to 1,4, 10) of W 1 ∈ R k×6 and W 2 ∈ R 10×k . We also compare against the true model. In each case, we use a fixed dataset of 1000 samples. We observe that as the overparameterization increases, the performance of all methods improves, with the methods equipped with SLS performing the best. We consider the AdaGrad with AMSGrad-like momentum and do not find improvements in performance. We also benchmark the performance of AMSGrad without momentum, and observe that incorporating AMSGrad momentum does improve the performance, whereas heavyball momentum has a minor, sometimes detrimental effect. We use SLS and Adam as benchmarks to study the effects of incorporating preconditioning vs step-size adaptation.



A similar result also appears in the course notes(Orabona, 2019).2 The difference between the exact and backtracking line-search is minimal, and the bounds are only changed by a constant depending on the backtracking parameter. This corresponds to the largest allowable step-size in Theorem without momentum. Unfortunately, the values of c suggested by the analysis incorporating momentum Theorem are too conservative.



Figure 1: Synthetic experiments showing the impact of step-size on the performance of AdaGrad, AMSGrad with varying step-sizes, including the default in PyTorch, and the SLS variants.

Figure 2: Comparing optimizers for multi-class classification with deep networks. Training loss (top) and validation accuracy (bottom) for CIFAR-10, CIFAR-100 and Tiny ImageNet.

Figure 3: Sketch of the line-search inequalities.

βm k-1 + (1 -β)∇f i k (w k ) 22: w k+1 ← w k -η k A -1 k m k 23: end for 24: return w T Algorithm 2 reset(η, η max , k, b, n, γ, opt) 1: if k = 0 then 2: return η max 3: else if opt= 0 then 4: η ← η 5: else if opt= 1 then 6: η ← η • γ b/n 7: else if opt= 2 then 8:

Figure5: Runtime (in seconds/epoch) for optimization methods for multi-class classification using the deep network models in Fig.2. Although the runtime/epoch is larger for the SLS/SPS variants, they require fewer epochs to reach the maximum test accuracy (Figure2). This justifies the moderate increase in wall-clock time.

Figure9: Comparison of optimization methods on convex objectives: binary classification on LIBSVM datasets using RBF kernel mappings. The kernel bandwidths are chosen by cross-validation following the protocol in(Vaswani et al., 2019b). All line-search methods use c = 1/2 and the procedure described in Appendix F. The other methods are use their default parameters. We observe the superior convergence of the SLS variants and the poor performance of the baselines.

Figure 10: Comparison of optimization methods for deep matrix factorization. Methods use the same hyper-parameter settings as above and we examine the effects of over-parameterization on the problem: minW1,W2 E x∼N (0,I) W 2 W 1 x -Ax 2 (Vaswani et al., 2019b;Rolinek & Martius, 2018). We choose A ∈ R 10×6 with condition number κ(A) = 10 10 and control the over-parameterization via the rank k (equal to 1,4, 10) of W 1 ∈ R k×6 and W 2 ∈ R 10×k . We also compare against the true model. In each case, we use a fixed dataset of 1000 samples. We observe that as the overparameterization increases, the performance of all methods improves, with the methods equipped with SLS performing the best.

Figure11: Ablation study comparing variants of the basic optimizers for multi-class classification with deep networks. Training loss (top) and validation accuracy (bottom) for CIFAR-10, CIFAR-100 and Tiny ImageNet. We consider the AdaGrad with AMSGrad-like momentum and do not find improvements in performance. We also benchmark the performance of AMSGrad without momentum, and observe that incorporating AMSGrad momentum does improve the performance, whereas heavyball momentum has a minor, sometimes detrimental effect. We use SLS and Adam as benchmarks to study the effects of incorporating preconditioning vs step-size adaptation.



Matthew D. Zeiler. ADADELTA: an adaptive learning rate method. arXiv preprint:1212.5701, 2012. Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In 5th International Conference on Learning Representations, ICLR, 2017. Lijun Zhang and Zhi-Hua Zhou. Stochastic approximation of smooth and strongly convex functions: Beyond the O(1/T ) convergence rate. In Conference on Learning Theory, COLT, 2019. Dongruo Zhou, Yiqi Tang, Ziyan Yang, Yuan Cao, and Quanquan Gu. On the convergence of adaptive gradient methods for nonconvex optimization. arXiv preprint:1808.05671, 2018.

Algorithm 3 Adaptive methods with SPS(f , [f * i ] n i=1 , precond, β,conservative, mode, w 0 , η max , b, c)1: for k = 0, . . . , T -1 do

G ADDITIONAL EXPERIMENTAL RESULTS

In this section, we present additional experimental results showing the effect of the step-size for adaptive gradient methods using a synthetic dataset (Fig. 4 ). We show the wall-clock times for the optimization methods (Fig. 5 ). We show the variation in the step-size for the SLS methods when training deep networks for both the CIFAR in Fig. 6 and ImageNet (Fig. 7 ) datasets. We evaluate these methods on easy non-convex objectives -classification on MNIST (Fig. 8 ) and deep matrix factorization (Fig. 10 ). We use deep matrix factorization to examine the effect of over-parameterization on the performance of the optimization methods and check the methods' performance when minimizing convex objectives associated with binary classification using RBF kernels in Fig. 9 . Finally in Fig. 11 , we quantify the gains of incorporating momentum in AMSGrad by comparing against the performance AMSGrad without momentum. Figure 6 : Comparing optimization methods on image classification tasks using ResNet and DenseNet models on the CIFAR-10/100 datasets. For the SLS/SPS variants, refer to the experimental details in Appendix F. For Adam, we did a grid-search and use the best step-size. We use the default hyper-parameters for the other baselines. We observe the consistently good performance of AdaGrad and AMSGrad with Armijo SLS. We also show the variation in the step-size and observe a cyclic pattern (Loshchilov & Hutter, 2017 ) -an initial warmup in the learning rate followed by a decrease or saturation to a small step-size (Goyal et al., 2017) . 

