EARLY STOPPING IN DEEP NETWORKS: DOUBLE DE-SCENT AND HOW TO ELIMINATE IT

Abstract

Over-parameterized models, such as large deep networks, often exhibit a double descent phenomenon, where as a function of model size, error first decreases, increases, and decreases at last. This intriguing double descent behavior also occurs as a function of training epochs and has been conjectured to arise because training epochs control the model complexity. In this paper, we show that such epoch-wise double descent occurs for a different reason: It is caused by a superposition of two or more bias-variance tradeoffs that arise because different parts of the network are learned at different epochs, and mitigating this by proper scaling of stepsizes can significantly improve the early stopping performance. We show this analytically for i) linear regression, where differently scaled features give rise to a superposition of bias-variance tradeoffs, and for ii) a wide two-layer neural network, where the first and second layers govern bias-variance tradeoffs. Inspired by this theory, we study two standard convolutional networks empirically and show that eliminating epoch-wise double descent through adjusting stepsizes of different layers improves the early stopping performance.

1. INTRODUCTION

Most machine learning algorithms learn a function that predicts a label from features. This function lies in a hypothesis class, such as a neural networks parameterized by its weights. Learning amounts to fitting the parameters of the function by minimizing an empirical risk over the training examples. The goal is to learn a function that performs well on new examples, which are assumed to come from the same distribution as the training examples. Classical machine learning theory says that the test error or risk as a function of the size of the hypothesis class is U-shaped: a small hypothesis class is not sufficiently expressive to have small error, and a large one leads to overfitting to spurious patterns in the data. The superposition of those two sources of errors, typically referred to as bias and variance, yields the classical U-shaped curve. However, increasing the model size beyond the number of training examples can decrease the error again. This phenomena, dubbed "double descent" by Belkin et al. (2019) has been observed as early as 1995 by Opper (1995) , and is relevant today because most modern machine learning models, in particular deep neural networks, operate in the over-parameterized regime, where the error often decreases again as a function of model size, and where the model is sufficiently expressive to describe any data, even noise. Interestingly, this double descent behavior also occurs as a function of training time, as observed by Nakkiran et al. (2020a) and as illustrated in Figure 1 . The left panel of Figure 1 shows that as a function of training epochs, the test error first decreases, increases, and then decreases again. It is important to understand this so-called epoch-wise double descent behavior to determine the early stopping time that gives the best performance. Early stopping, or other regularization techniques, are critical for learning from noisy labels (Arpit et al., 2017; Yilmaz & Heckel, 2020) . (2003) . Specifically, limiting the number of gradient descent iterations ensures that the functions parameters lie in a ball around the initial parameters. While this conjecture might be true for certain problem setups, it is not consistent with our empirical observation for the 5-layer CNN studied by Nakkiran et al. (2020a): Specifically, the empirically measured overall bias in Figure 1 is increasing for some iterations, whereas an increasing model size would imply that it is decreasing (see Appendix B.2 for details on this experiment). In this paper, we show empirically and theoretically that epoch-wise double descent-at least in the setups we observed it-arises for a different reason: It is explained by a superposition of biasvariance tradeoffs, as illustrated for a toy-regression example in the right panel of Figure 1 . If the risk can be decomposed into two U-shaped bias-variance tradeoffs with minima at different epochs/iterations, then the overall risk/test error has a double descent behavior. We also note that epoch-wise double descent is not a phenomena tied to over-parameterization. Both under-and overparameterized models can have epoch-wise double descent as we show in this paper. 1.1 CONTRIBUTIONS The goal of this paper is to understand the epoch-wise double descent behavior. Our main finding is that epoch-wise double descent can be explained as a superposition of bias variance tradeoffs, and arises naturally in some standard neural networks because parts of the network are learned faster than others. Our contributions are as follows: First, we consider a linear regression model and theoretically characterize the risk of early stopped least squares. We show that if features have different scales, then the early stopped least squares estimate as a function of the early stopping time is a superposition of bias-variance tradeoffs, which yields a double descent like curve (see Figure 1 , right panel). Second, we characterize the early stopped risk of a two-layer neural network theoretically and show that it is upper bounded by a curve consisting of over-lapping bias-variance tradeoffs that are governed by the initializations and stepsizes of the two layers. The initialization scales and stepsizes of the weights in the first and second layer determine whether double descent occurs or not. We provide numerical examples showing how epoch-wise double descent occurs when training such a two-layer network on data, and how it can be eliminated by scaling the stepsizes of the layers accordingly. Third, we study a standard 5-layer convolutional network as well as ResNet-18 empirically. For the 5-layer convolutional network we find-similarly as for the two-layer model-epoch-wise double descent occurs because the convolutional layers (representation layers) are learned faster than the final, fully connected layer.Similarly, for ResNet-18, we find that later layers are learned faster than early layers, which again results in double descent. In both cases, epoch-wise double descent can be eliminated through adjusting the stepsizes of different coefficients or layers. In summary, we provide new examples on when epoch-wise double descent occurs, as well as analytical results explaining epoch-wise double descent theoretically. Our theory is constructive in that it suggests a simple and effective mitigation strategy: scaling stepsizes appropriately. We also note that epoch-wise double descent should be eliminated by adjusting the stepsizes and/or the initialization, because this often translates to better overall performance.



Figure 1: Left: The test and train error curves of an over-parameterized 5-layer convolutional network trained on the CIFAR-10 training set with 20% random label noise. As observed by Nakkiran et al. (2020a), the performance shows a double descent behavior. Right: As we show here, the risk of a regression problem can be decomposed as the sum of two bias-variance tradeoffs. Both examples: Early stopping the training where the test error achieves its minima is critical for performance. et al. (2007); Raskutti et al. (2014); Bühlmann & Yu(2003). Specifically, limiting the number of gradient descent iterations ensures that the functions parameters lie in a ball around the initial parameters. While this conjecture might be true for certain problem setups, it is not consistent with our empirical observation for the 5-layer CNN studied byNakkiran et al. (2020a): Specifically, the empirically measured overall bias in Figure1is increasing for some iterations, whereas an increasing model size would imply that it is decreasing (see Appendix B.2 for details on this experiment).

