PROVABLE BENEFIT OF ADAPTIVITY IN ADAM

Abstract

Adaptive Moment Estimation (Adam) has been observed to converge faster than stochastic gradient descent (SGD) in practice. However, such an advantage has not been theoretically characterized -the existing convergence rate of Adam is no better than SGD. We attribute this mismatch between theory and practice to a commonly used assumption: the smoothness is globally upper bounded by some constant L (called L-smooth condition). Specifically, compared to SGD, Adam adaptively chooses a learning rate better suited to the local smoothness. This advantage becomes prominent when the local smoothness varies drastically across the domain, which however is hided under L-smooth condition. In this paper, we analyze the convergence of Adam under a condition called (L 0 , L 1 )-smooth condition, which allows the local smoothness to grow with the gradient norm. This condition has been empirically verified to be more realistic for deep neural networks (Zhang et al., 2019a) than the L-smooth condition. Under (L 0 , L 1 )-smooth condition, we establish the convergence for Adam with practical hyperparameters. As such, we argue that Adam can adapt to the local smoothness, justifying Adam's benefit of adaptivity. In contrast, SGD can be arbitrarily slow under this condition. Hence, we theoretically characterize the advantage of Adam over SGD.

1. INTRODUCTION

Machine learning tasks are often formulated as solving the following finite-sum problem. min w∈R d f (w) = n-1 i=0 f i (w), where {f i (w)} n-1 i=0 is lower bounded, n denotes the number of samples or mini-batches, and w denotes the trainable parameters. Among various gradient-based optimizers, Stochastic Gradient Descent (SGD) is a simple and popular method to solve Eq. (1). However, Adaptive gradient methods including Adaptive Moment estimation (Adam) (Kingma & Ba, 2014) The empirical success of Adam comes from its special update rules. Firstly, it uses the heavy-ball momentum mechanism controlled by a hyperparameter β 1 . Second, it uses an adaptive learning rate strategy. In particular, the learning rate of Adam contains exponential moving averages of past squared gradients, which is weighted by β 2 . Larger β 1 and β 2 will bring more gradient information of historical steps into the update. The update rule of Adam is given in Eq. ( 2) (presented later in Section 3). Despite its practical success, the theoretical understanding of Adam is limited. For instance, the existing convergence rates of Adam are no better than that of SGD (Zhang et al., 2022; Shi et al., 2021; Défossez et al., 2020; Zou et al., 2019; De et al., 2018; Guo et al., 2021) . As such, there is a mismatch between Adam's superior empirical performance and its theoretical understanding. To close the gap between theory and practice, we revisit the existing analyses for Adam. Current setups for the analysis of Adam fail to model the real-world applications: all of the convergence analyses of Adam are based on L-smooth condition, i.e., the Lipschitz coefficient of the gradient is



is recently observed to outperform SGD in modern deep neural tasks including GANs (Brock et al., 2018), BERTs (Kenton & Toutanova, 2019), GPTs (Brown et al., 2020) and ViTs (Dosovitskiy et al., 2020). For instance, as reported in Figure 1. (a): SGD converges much slower than Adam during the training of Transformers. Similar phenomena are also reported in BERT training (Zhang et al., 2019b).

