UNDERSTANDING THE GENERALIZATION OF ADAM IN LEARNING NEURAL NETWORKS WITH PROPER REGU-LARIZATION

Abstract

Adaptive gradient methods such as Adam have gained increasing popularity in deep learning optimization. However, it has been observed in many deep learning applications such as image classification, Adam can converge to a different solution with a worse test error compared to (stochastic) gradient descent, even with a fine-tuned regularization. In this paper, we provide a theoretical explanation for this phenomenon: we show that in the nonconvex setting of learning over-parameterized two-layer convolutional neural networks starting from the same random initialization, for a class of data distributions (inspired from image data), Adam and gradient descent (GD) can converge to different global solutions of the training objective with provably different generalization errors, even with weight decay regularization. In contrast, we show that if the training objective is convex, and the weight decay regularization is employed, any optimization algorithms including Adam and GD will converge to the same solution if the training is successful. This suggests that the generalization gap between Adam and SGD in the presence of weight decay regularization is closely tied to the nonconvex landscape of deep learning optimization, which cannot be covered by the recent neural tangent kernel (NTK) based analysis.

1. INTRODUCTION

Adaptive gradient methods (Duchi et al., 2011; Hinton et al., 2012; Kingma & Ba, 2015; Reddi et al., 2018) such as Adam are very popular optimizers for training deep neural networks. By adjusting the learning rate coordinate-wisely based on historical gradient information, they are known to be able to automatically choose appropriate learning rates to achieve fast convergence in training. Because of this advantage, Adam and its variants are widely used in deep learning. Despite their fast convergence, adaptive gradient methods have been observed to achieve worse generalization performance compared with gradient descent and stochastic gradient descent (SGD) (Wilson et al., 2017; Luo et al., 2019; Chen et al., 2020; Zhou et al., 2020) 2019) considered a setting of linear regression, and showed that Adam can fail when learning an overparameterized linear model on certain specifically designed data, while SGD can learn the linear model to achieve zero test error. This example in linear regression offers valuable insights into the difference between SGD and Adam. However, there is a gap between their theoretical results and the practical observations, since they consider a convex optimization setting, and the difference between Adam and SGD will no longer be observed when adding weight decay regularization. In fact, as we will show in this paper (Theorem 4.2), regularization can successfully correct the different implicit bias and push different algorithms to find the same solution, since the regularized training loss function of a convex model becomes strongly convex, which exhibits one unique global optimum. For this reason, we argue that the example in the convex setting cannot fully capture the differences between GD and Adam for training neural networks. More recently, Zhou et al. ( 2020) studied the expected escaping time of Adam and SGD from a local basin, and utilized this to explain the difference between SGD and Adam. However, their results do not take NN architecture into consideration, and do not provide an analysis of test errors either. In this paper, we aim at answering the following question Why is there a generalization gap between Adam and gradient descent in learning neural networks, even with weight decay regularization? Specifically, we study Adam and GD for training neural networks with weight decay regularization on an image-like data model, and demonstrate the different behaviors of Adam and GD based on the notion of feature learning/noise memorization decomposition. Inspired by the experimental observation in Figure 1 where Adam tends to overfit the noise component of the data, we consider a model where the data are generated as a combination of feature and noise patches, and analyze the convergence and generalization of Adam and GD for training a two-layer convolutional neural network (CNN). The contributions of this paper are summarized as follows. • We establish global convergence guarantees for Adam and GD with weight decay regularization. We show that, starting at the same random initialization, Adam and GD can both train a two-layer convolutional neural network to achieve zero training error after polynomially many iterations, despite the nonconvex optimization landscape. • We further show that GD and Adam in fact converge to different global solutions with different generalization performance: when performed on the considered image-like data model, GD can achieve nearly zero test error, while the generalization performance of the model found by Adam is no better than a random guess. In particular, we show that the reason for this gap is due to the different training behaviors of Adam and GD: Adam is more likely to fit dense noises and output a model that is largely contributed by the noise patches; GD prefers to fit training data using their feature patch and finds a solution that is mainly composed by the true features. • We also show that for convex settings with weight decay regularization, both Adam and gradient descent converge to the same solution and therefore have no test error difference. This suggests that the difference between Adam and GD cannot be fully explained by linear models or neural networks trained in the "almost convex" neural tangent kernel (NTK) regime (Jacot et al., 2018; Allen-Zhu et al., 2019b; Du et al., 2019a; Zou et al., 2019) . It also demonstrates that the inferior generalization performance of Adam is closely tied to the nonconvex landscape of deep learning optimization, and cannot be solved by adding regularization.

2. RELATED WORK

In this section, we discuss the works that are closely related to our paper. Generalization gap between Adam and SGD. The worse generalization of Adam compared with SGD has also been observed by some recent works and has motivated new variants of neural network training algorithms. Keskar & Socher (2017) proposed to switch between Adam and SGD to achieve better generalization. Merity et al. (2018) proposed a variant of the averaged stochastic gradient method to achieve good generalization performance for LSTM language models. Luo et al. (2019) proposed to use dynamic bounds on learning rates to achieve a smooth transition from adaptive methods to SGD to improve generalization. Our theoretical results for GD and Adam can also





Several recent works provided theoretical explanations of this generalization gap between Adam and GD by showing that Adam and GD have different implicit bias. Wilson et al. (2017); Agarwal et al. (

