NOISE IS NOT THE MAIN FACTOR BEHIND THE GAP BETWEEN SGD AND ADAM ON TRANSFORMERS, BUT SIGN DESCENT MIGHT BE

Abstract

The success of the Adam optimizer on a wide array of architectures has made it the default in settings where stochastic gradient descent (SGD) performs poorly. However, our theoretical understanding of this discrepancy is lagging, preventing the development of significant improvements on either algorithm. Recent work advances the hypothesis that Adam and other heuristics like gradient clipping outperform SGD on language tasks because the distribution of the error induced by sampling has heavy tails. This suggests that Adam outperform SGD because it uses a more robust gradient estimate. We evaluate this hypothesis by varying the batch size, up to the entire dataset, to control for stochasticity. We present evidence that stochasticity and heavy-tailed noise are not major factors in the performance gap between SGD and Adam. Rather, Adam performs better as the batch size increases, while SGD is less effective at taking advantage of the reduction in noise. This raises the question as to why Adam outperforms SGD in the full-batch setting. Through further investigation of simpler variants of SGD, we find that the behavior of Adam with large batches is similar to sign descent with momentum.

1. INTRODUCTION

Adam (Kingma and Ba, 2015) and its derivatives have been so successful in training deep learning models that they have become the default optimizer for some architectures. Adam often outperforms stochastic gradient descent (SGD) by such a margin that SGD is considered incapable of training certain models, to the point of being omitted from performance comparisons (e.g. Liu et al., 2020; Anil et al., 2019) . Despite this success, we still do not understand why Adam works, much less why it can outperform SGD by such a wide margin. We have made progress understanding why it should not, as in the work of Reddi et al. (2018) who pointed out that Adam does not converge even on convex problems, but this does not answer why Adam outperforms SGD. The limited effectiveness of standard theory. We usually analyse optimization algorithms under assumptions like Lipschitz continuous function/gradient and convexity (e.g. Nesterov, 2018, Chapters 2-3). Many works have focused on improving the analysis of Adam and its variants under those same assumptions. But these assumptions are only models of how losses behave. They do not convey the complexity of the optimization process in complex architectures, and are limited to showing that Adam does not do much worse than gradient descent (Défossez et al., 2022; Alacaoglu et al., 2020) . Analyses in online learning also struggle to illuminate the gap. The assumption that the gradients come from an adversary requires decreasing step-sizes (e.g. Hazan, 2022, Thm 3.1), which decrease too quickly to perform well in practice. Our theoretical understanding is thus still limited in that we cannot describe the empirical behavior we observe-that Adam outperforms SGD in many settings. As a result, there is a sentiment in the community that the success of these heuristics need not be due to robust theoretical underpinnings, but rather to social dynamics and a co-evolution of deep learning architectures and optimization heuristics (see for example Orabona, 2020). These "adaptive" algorithms might actually be adapted to type of problems where they outperform SGD. But this suggests that they are leveraging some problem structure that our current theory and theory-derived algorithms are missing. Understanding this structure may be key to develop better practical algorithms. 2020b) hypothesize that heavier tails might be the cause of this gap. Top: Distribution of errors in stochastic gradients at initialization (∥g -g∥ where g is stochastic and g is a full gradient) compared against a Gaussian (QQ-plot). Bottom: SGD and Adam with and without momentum (+m/-m) with small batch sizes. The heavy-tailed assumption. Recent works have proposed alternative assumptions to model the behavior of optimizers on neural networks. One such assumption comes from J. Zhang et al. ( 2020b), who hypothesize that the gap in performance might arise from a heavy-tailed distribution of the error induced by stochasticity. They notice a larger and more consistent gap in the performance of SGD and Adam on language models than on image models, which coincides with a heavier tail in the distribution of the stochastic gradient error. 1 We reproduce their observation in Figure 1 . The proposed mechanism is that heavy-tailed errors have a larger impact on SGD than on Adam or gradient clipping, as the introduction of bias in the estimation of the gradient reduces its variance. This hypothesis suggests that a path to designing better algorithms is to improve robustness to heavy-tailed noise. For example, Gorbunov et al. ( 2020) combine acceleration and gradient clipping, while Srinivasan et al. (2021) leverage estimators tailored to heavy-tailed noise. Inconsistency with large batch results. J. Zhang et al. (2020b) note a correlation between heavytailedness and cases where Adam outperforms SGD, and give a mechanism to link heavy-tailed errors to this gap. However, the type of noise is not the only difference between image and language tasks. There is limited empirical evidence that stochasticity is the root cause of this gap. In fact, there are reasons to believe noise might not be a major contributor. For example, the lack of variance in the behavior of the optimizers on language tasks in Figure 1 suggests they are less sensitive to noise than image tasks. Moreover, we would expect the gap to diminish as the noise is reduced by increasing the batch size. However, methods such as LAMB (You et al., 2020) or plain Adam find success in large batch settings (Nado et al., 2021) , suggesting a competitive advantage even with reduced noise. Studies of batch size scaling also find that Adam scales better with batch size (G. Zhang et al., 2019) . Alternative explanations. These empirical results cast doubt on the idea that robustness to heavytailed noise is the primary factor behind the performance improvement of Adam over SGD. Hypotheses based on deterministic properties might provide better descriptions of the cause of this gap -we discuss them in more details in Sections 4 and 5. One such interpretation is the view of Adam as a variant of sign descent, which was a motivation of RMSprop (Tieleman and Hinton, 2012), as studied by Balles and Hennig (2018) and Bernstein et al. (2018) . However, it remains unclear whether the performance of Adam can be explained by its similarities to a simpler algorithm, or if the additional changes needed to obtain Adam are necessary to obtain good performance.



The distribution of errors between stochastic gradients g and full gradients g, ∥g -g∥, is well approximated by a Gaussian for image models but closer to an α-stable distribution for language models.



Figure1: The Heavy-Tail hypothesis: the gap between SGD and Adam is caused by a heavier tail in the distribution of the stochastic gradient error. The performance gap between SGD and Adam is larger and more consistent on transformers on text data (right: PTB, Wikitext2, SQuAD) than on CNNs on image data (left: MNIST, CIFAR-10), which coincides with a heavier tail in the distribution of the stochastic gradient error. J.Zhang et al. (2020b)  hypothesize that heavier tails might be the cause of this gap. Top: Distribution of errors in stochastic gradients at initialization (∥g -g∥ where g is stochastic and g is a full gradient) compared against a Gaussian (QQ-plot). Bottom: SGD and Adam with and without momentum (+m/-m) with small batch sizes.

