ADAPTIVE GRADIENT METHODS CONVERGE FASTER WITH OVER-PARAMETERIZATION (AND YOU CAN DO A LINE-SEARCH)

Abstract

Adaptive gradient methods are typically used for training over-parameterized models capable of exactly fitting the data; we thus study their convergence in this interpolation setting. Under an interpolation assumption, we prove that AMSGrad with a constant step-size and momentum can converge to the minimizer at the faster O(1/T ) rate for smooth, convex functions. Furthermore, in this setting, we show that AdaGrad can achieve an O(1) regret in the online convex optimization framework. When interpolation is only approximately satisfied, we show that constant step-size AMSGrad converges to a neighbourhood of the solution. On the other hand, we prove that AdaGrad is robust to the violation of interpolation and converges to the minimizer at the optimal rate. However, we demonstrate that even for simple, convex problems satisfying interpolation, the empirical performance of these methods heavily depends on the step-size and requires tuning. We alleviate this problem by using stochastic line-search (SLS) and Polyak's step-sizes (SPS) to help these methods adapt to the function's local smoothness. By using these techniques, we prove that AdaGrad and AMSGrad do not require knowledge of problem-dependent constants and retain the convergence guarantees of their constant step-size counterparts. Experimentally, we show that these techniques help improve the convergence and generalization performance across tasks, from binary classification with kernel mappings to classification with deep neural networks.

1. INTRODUCTION

Adaptive gradient methods such as AdaGrad (Duchi et al., 2011 ), RMSProp (Tieleman & Hinton, 2012) , AdaDelta (Zeiler, 2012) , Adam (Kingma & Ba, 2015) , and AMSGrad (Reddi et al., 2018) are popular optimizers for training deep neural networks (Goodfellow et al., 2016) . These methods scale well and exhibit good performance across problems, making them the default choice for many machine learning applications. Theoretically, these methods are usually studied in the non-smooth, online convex optimization setting (Duchi et al., 2011; Reddi et al., 2018) with recent extensions to the strongly-convex (Mukkamala & Hein, 2017; Wang et al., 2020; Xie et al., 2020) and non-convex settings (Li & Orabona, 2019; Ward et al., 2019; Zhou et al., 2018; Chen et al., 2019; Wu et al., 2019; Défossez et al., 2020; Staib et al., 2019 ). An online-batch reduction gives guarantees similar to stochastic gradient descent (SGD) in the offline setting (Cesa-Bianchi et al., 2004; Hazan & Kale, 2014; Levy et al., 2018) . However, there are several discrepancies between the theory and application of these methods. Although the theory advocates for using decreasing step-sizes for Adam, AMSGrad and its variants (Kingma & Ba, 2015; Reddi et al., 2018) , a constant step-size is typically used in practice (Paszke et al., 2019) . Similarly, the standard analysis of these methods assumes a decreasing momentum parameter, however, the momentum is fixed in practice. On the other hand, AdaGrad (Duchi et al., 2011) has been shown to be "universal" as it attains the best known convergence rates in both the stochastic smooth and non-smooth settings (Levy et al., 2018) , but its empirical performance is rather disappointing when training deep models (Kingma & Ba, 2015) . Improving the empirical performance was indeed the main motivation behind Adam and other methods (Tieleman & Hinton, 2012; Zeiler, 2012) that followed AdaGrad. Although these methods have better empirical performance, they are not guaranteed to converge to the solution with a constant step-size and momentum parameter. Another inconsistency is that although the standard theoretical results are for non-smooth functions, these methods are also extensively used in the easier, smooth setting. More importantly, adaptive gradient methods are generally used to train highly expressive, large over-parameterized models (Zhang et al., 2017; Liang & Rakhlin, 2018 ) capable of interpolating the data. However, the standard theoretical analyses do not take advantage of these additional properties. On the other hand, a line of recent work (Schmidt & Le Roux, 2013; Jain et al., 2018; Ma et al., 2018; Liu & Belkin, 2020; Cevher & Vũ, 2019; Vaswani et al., 2019a; b; Wu et al., 2019; Loizou et al., 2020) focuses on the convergence of SGD in this interpolation setting. In the standard finite-sum case, interpolation implies that all the functions in the sum are minimized at the same solution. Under this additional assumption, these works show SGD with a constant step-size converges to the minimizer at a faster rate for both convex and non-convex smooth functions. In this work, we aim to resolve some of the discrepancies in the theory and practice of adaptive gradient methods. To theoretically analyze these methods, we consider a simplistic setting -smooth, convex functions under interpolation. Using the intuition gained from theory, we propose better techniques to adaptively set the step-size for these methods, dramatically improving their empirical performance when training over-parameterized models.

1.1. BACKGROUND AND CONTRIBUTIONS

Constant step-size. We focus on the theoretical convergence of two adaptive gradient methods: AdaGrad and AMSGrad. For smooth, convex functions, Levy et al. (2018) prove that AdaGrad with a constant step-size adapts to the smoothness and gradient noise, resulting in an O( 1 /T + ζ / √ T ) convergence rate, where T is the number of iterations and ζ 2 is a global bound on the variance in the stochastic gradients. This convergence rate matches that of SGD under the same setting (Moulines & Bach, 2011) . In Section 3, we show that constant step-size AdaGrad also adapts to interpolation and prove an O( 1 /T + σ / √ T ) rate, where σ is the extent to which interpolation is violated. In the over-parameterized setting, σ 2 can be much smaller than ζ 2 (Zhang & Zhou, 2019), implying a faster convergence. When interpolation is exactly satisfied, σ 2 = 0, we obtain an O( 1 /T ) rate, while ζ 2 can still be large. In the online convex optimization framework, for smooth functions, we show that the regret of AdaGrad improves from O( √ T ) to O(1) when interpolation is satisfied and retains its O( √ T )-regret guarantee in the general setting (Appendix C.2). Assuming its corresponding preconditioner remains bounded, we show that AMSGrad with a constant step-size and constant momentum parameter also converges at the rate O( 1 /T ) under interpolation (Section 4). However, unlike AdaGrad, it requires specific step-sizes that depend on the problem's smoothness. More generally, constant step-size AMSGrad converges to a neighbourhood of the solution, attaining an O( 1 /T + σ 2 ) rate, which matches the rate of constant step-size SGD in the same setting (Schmidt & Le Roux, 2013; Vaswani et al., 2019a) . When training over-parameterized models, this result provides some justification for the faster (O( 1 /T ) vs. O( 1 / √ T )) convergence of the AMSGrad variant typically used in practice. Adaptive step-size. Although AdaGrad converges at the same asymptotic rate for any step-size (up to constants), it is unclear how to choose this step-size without manually trying different values. Similarly, AMSGrad is sensitive to the step-size, converging only for a specific range in both theory and practice. In Section 5, we experimentally show that even for simple, convex problems, the step-size has a big impact on the empirical performance of AdaGrad and AMSGrad. To overcome this limitation, we use recent methods (Vaswani et al., 2019a; Loizou et al., 2020) that automatically set the step-size for SGD. These works use stochastic variants of the classical Armijo line-search (Armijo, 1966) or the Polyak step-size (Polyak, 1963) in the interpolation setting. We combine these techniques with adaptive gradient methods and show that a variant of stochastic line-search (SLS) enables AdaGrad to adapt to the smoothness of the underlying function, resulting in faster empirical convergence, while retaining its favourable convergence properties (Section 3). Similarly, AMSGrad with variants of SLS and SPS can match the convergence rate of its constant step-size counterpart, but without knowledge of the underlying smoothness properties (Section 4). Experimental results. Finally, in Section 5, we benchmark our results against SGD variants with SLS (Vaswani et al., 2019b) , SPS (Loizou et al., 2020) , tuned Adam and its recently proposed variants (Luo et al., 2019; Liu et al., 2020) . We demonstrate that the proposed techniques for setting the step-size improve the empirical performance of adaptive gradient methods. These improvements are consistent across tasks, ranging from binary classification with a kernel mapping to multi-class classification using standard deep neural network architectures.

