PROVABLE BENEFIT OF ADAPTIVITY IN ADAM

Abstract

Adaptive Moment Estimation (Adam) has been observed to converge faster than stochastic gradient descent (SGD) in practice. However, such an advantage has not been theoretically characterized -the existing convergence rate of Adam is no better than SGD. We attribute this mismatch between theory and practice to a commonly used assumption: the smoothness is globally upper bounded by some constant L (called L-smooth condition). Specifically, compared to SGD, Adam adaptively chooses a learning rate better suited to the local smoothness. This advantage becomes prominent when the local smoothness varies drastically across the domain, which however is hided under L-smooth condition. In this paper, we analyze the convergence of Adam under a condition called (L 0 , L 1 )-smooth condition, which allows the local smoothness to grow with the gradient norm. This condition has been empirically verified to be more realistic for deep neural networks (Zhang et al., 2019a) than the L-smooth condition. Under (L 0 , L 1 )-smooth condition, we establish the convergence for Adam with practical hyperparameters. As such, we argue that Adam can adapt to the local smoothness, justifying Adam's benefit of adaptivity. In contrast, SGD can be arbitrarily slow under this condition. Hence, we theoretically characterize the advantage of Adam over SGD.

1. INTRODUCTION

Machine learning tasks are often formulated as solving the following finite-sum problem. min w∈R d f (w) = n-1 i=0 f i (w), where {f i (w)} n-1 i=0 is lower bounded, n denotes the number of samples or mini-batches, and w denotes the trainable parameters. Among various gradient-based optimizers, Stochastic Gradient Descent (SGD) is a simple and popular method to solve Eq. (1). However, Adaptive gradient methods including Adaptive Moment estimation (Adam) (Kingma & Ba, 2014) is recently observed to outperform SGD in modern deep neural tasks including GANs (Brock et al., 2018 ), BERTs (Kenton & Toutanova, 2019) , GPTs (Brown et al., 2020) and ViTs (Dosovitskiy et al., 2020) . For instance, as reported in Figure 1 . (a): SGD converges much slower than Adam during the training of Transformers. Similar phenomena are also reported in BERT training (Zhang et al., 2019b) . The empirical success of Adam comes from its special update rules. Firstly, it uses the heavy-ball momentum mechanism controlled by a hyperparameter β 1 . Second, it uses an adaptive learning rate strategy. In particular, the learning rate of Adam contains exponential moving averages of past squared gradients, which is weighted by β 2 . Larger β 1 and β 2 will bring more gradient information of historical steps into the update. The update rule of Adam is given in Eq. (2) (presented later in Section 3). Despite its practical success, the theoretical understanding of Adam is limited. For instance, the existing convergence rates of Adam are no better than that of SGD (Zhang et al., 2022; Shi et al., 2021; Défossez et al., 2020; Zou et al., 2019; De et al., 2018; Guo et al., 2021) . As such, there is a mismatch between Adam's superior empirical performance and its theoretical understanding. To close the gap between theory and practice, we revisit the existing analyses for Adam. Current setups for the analysis of Adam fail to model the real-world applications: all of the convergence analyses of Adam are based on L-smooth condition, i.e., the Lipschitz coefficient of the gradient is globally bounded. However, it has been recently observed that L-smooth condition does not hold in many deep learning tasks (Zhang et al., 2019a; Crawshaw et al., 2022) . This gap in setting can obscure Adam's superiority: different local Lipschitz coefficients of the gradient (local smoothness) may require different optimal learning rates in terms of convergence. However, the learning rate of SGD is ignorant of the local smoothness along the training trajectory. If the local smoothness does not change sharply along the trajectory, the optimal learning rate does not change much. Then SGD can work well by selecting a learning rate through grid search. However, when the local smoothness varies dramatically, a learning rate fits some points well may fit other points along the trajectory arbitrarily badly, which indicates SGD may converge arbitrarily slow (detailed discussions can be found in [Theorem 4, (Zhang et al., 2019a) ] and Section 4.3 in this paper). In contrast, Adam adapts the update according to the local information and does not suffer from such an issue. Following the above methodology, we analyze the convergence of Adam under (L 0 , L 1 )-smooth condition, which assumes the local smoothness (the spectral norm of the local Hesssian) to be upper bounded by L 1 • (local gradient norm) + L 0 (Assumption 1). As the gradient norm can be unbounded, (L 0 , L 1 )-smooth condition allows the local smoothness to grow fiercely with the gradient norm. Moreover, it has been demonstrated by (Zhang et al., 2019a; 2020a; Crawshaw et al., 2022) that for many practical neural networks, (L 0 , L 1 )-smooth condition more closely characterizes the optimization landscape (as illustrated by Figure 1 (b) ) along the optimization trajectories. Under (L 0 , L 1 )-smooth condition, we successfully establish the convergence of Adam. Meanwhile, in contrast, under the same assumption, it is proved that SGD can be arbitrarily slow in (Zhang et al., 2019a) or even diverge (see Section 4.3). Therefore, our theory demonstrates that Adam can provably converge faster than SGD. The main contribution of this paper is summarized as follows. We derive the first convergence result of Adam under the more realistic (L 0 , L 1 )-smooth condition (Theorem 1). First, our convergence result is established under the mildest assumption so far: • (L 0 , L 1 )-smooth condition is strictly weaker than L-smooth condition. More importantly, (L 0 , L 1 )-smooth condition is observed to hold in practical deep learning tasks. Relaxing the smoothness condition is important to characterize the advantage of Adam over SGD. • Our result does not require the bounded gradient assumption (i.e. ∥∇f (x)∥ ≤ C). Removing such a condition is necessary, as otherwise (L 0 , L 1 )-smooth condition will degenerate to L-smooth condition. Our result does not need other strong assumptions like bounded adaptor or large ε (see Eq. ( 2)), either. Furthermore, the conclusion of our convergence result is among the strongest. • Our convergence result holds for every possible trajectory. This is much stronger than the common results of "convergence in expectation" and is technically challenging. • In our convergence results, the setting of hyperparameters (β 1 , β 2 ) is close to practice. Specifically, our result holds for any β 1 and any β 2 close to 1, which matches the practical settings (for example, 0.9 and 0.999 in deep learning libraries).



Figure 1: For Transformer (Vaswani et al., 2017) on the WMT 2014 dataset, we plot (a) the training loss of SGD and Adam and (b) the gradient norm vs. the local smoothness on the training trajectory. The blue line stands for log(local smoothness) = log(gradient norm) + 1.4. It can be observed that all the (log(gradient norm), log(local smoothness)) points lie under this line, and thus the training process obeys (0, e 1.4 )-smooth condition.

