EARLY STOPPING BY GRADIENT DISPARITY

Abstract

Validation-based early-stopping methods are one of the most popular techniques used to avoid over-training deep neural networks. They require to set aside a reliable unbiased validation set, which can be expensive in applications offering limited amounts of data. In this paper, we propose to use gradient disparity, which we define as the 2 norm distance between the gradient vectors of two batches drawn from the training set. It comes from a probabilistic upper bound on the difference between the classification errors over a given batch, when the network is trained on this batch and when the network is trained on another batch of points sampled from the same dataset. We empirically show that gradient disparity is a very promising early-stopping criterion when data is limited, because it uses all the training samples during training. Furthermore, we show in a wide range of experimental settings that gradient disparity is not only strongly related to the usual generalization error between the training and test sets, but that it is also much more informative about the level of label noise.

1. INTRODUCTION

Early stopping is a commonly used regularization technique to avoid under/over fitting deep neural networks trained with iterative methods, such as gradient descent (Prechelt, 1998; Yao et al., 2007; Gu et al., 2018) . To have an unbiased proxy on the generalization error, early stopping requires a separate accurately labeled validation set. However, labeled data collection is an expensive and time consuming process that might require domain expertise (Roh et al., 2019) . Moreover, deep learning is becoming popular to use for new and critical applications for which there is simply not enough available data. Hence, it is advantageous to have a signal of overfitting that does not require a validation set, then all the available data can be used for training the model. w R 2 L S 2 L S 1 w (t) w (t+1) 2 w (t+1) 1 L S 2 (h w 1 ) L S 2 (h w 2 ) Loss Figure 1 : An illustration of the penalty term R 2 , where the y-axis is the loss, and the x-axis indicates the parameters of the model. L S1 and L S2 are the average losses over batches S 1 and S 2 , respectively. w (t) is the parameter at iteration t and w (t+1) i is the parameter at iteration t + 1 if batch S i was selected for the update step at iteration t, with i ∈ {1, 2}. Let S 1 and S 2 be two batches of points sampled from the available (training) dataset. Suppose that S 1 is selected for an iteration (step) of stochastic gradient descent (SGD), which then updates the parameter vector to w 1 . The average loss over S 1 is in principle reduced, given a sufficiently small learning rate. However, the average loss over the other batch S 2 (i.e., L S2 (h w1 )) is not as likely to be reduced. It will remain on average larger than the loss computed over S 2 , if it was S 2 instead of S 1 that had been selected for this iteration (i.e., L S2 (h w2 )). The difference is the penalty R 2 that we pay for choosing S 1 over S 2 (and similarly, R 1 is the penalty that we would pay for choosing S 2 over S 1 ). R 2 is illustrated in Figure 1 for a hypothetical non-convex loss as a function of a one dimensional parameter. The expected penalty measures how much, in an iteration, a model updated on one batch (S 1 ) is able to generalize on average to another batch (S 2 ) from the dataset. Hence, we call R the generalization penalty. We establish a probabilistic upper-bound on the sum of the expected penalties E [R 1 ] + E [R 2 ] by adapting the PAC-Bayesian framework (McAllester, 1999a; b; 2003) given a pair of batches S 1 and S 2 sampled from the dataset (Theorem 1). Interestingly, under some mild assumptions, this upper bound is essentially a simple expression driven by g 1 -g 2 2 , where g 1 and g 2 are the gradient vectors over the two batches S 1 and S 2 , respectively. We call this gradient disparity: it measures how a small gradient step on one batch negatively affects the performance on another one. Gradient disparity is simple to use and it is computationally tractable during the course of training. Our experiments on state-of-the-art configurations suggest a very strong link between gradient disparity and generalization error; we propose gradient disparity as an effective early stopping criterion. Gradient disparity is particularly useful when the available dataset has limited labeled data, because it does not require splitting the available dataset into training and validation sets so that all the available data can be used during training, unlike for instance k-fold cross validation. We observe that using gradient disparity, instead of an unbiased validation set, results in at least 1% predictive performance improvement for critical applications with limited and very costly available data, such as the MRNet dataset that is a small size image-classification dataset used for detecting knee injuries (Table 1 ). Bien et al., 2018) , comparing 5-fold cross validation (5-fold CV) and gradient disparity (GD), when both are used as early stopping criteria for detecting the presence of abnormally, ACL tears, and meniscal tears from the sagittal plane MRI scans. The corresponding curves during training are shown in Figure 10 . The results of early stopping are given, both when the metric has increased for 5 epochs from the beginning of training and between parenthesis when the metric has increased for 5 consecutive epochs. Moreover, when the available dataset contains noisy labels, the validation set is no longer a reliable predictor of the clean test set (see e.g., Figure 9 (a) (left)), whereas gradient disparity correctly predicts the performance on the test set and again can be used as a promising early-stopping criterion. Furthermore, we observe that gradient disparity is a better indicator of label noise level than generalization error, especially at early stages of training. Similarly to the generalization error, it decreases with the training set size, and it increases with the batch size. Paper Outline. In Section 2, we formally define the generalization penalty. In Section 3, we give the upper bound on the generalization penalty. In Section 4, we introduce the gradient disparity metric. In Section 5, we present experiments that support gradient disparity as an early stopping criterion. In Section 6, we assess gradient disparity as a generalization metric. Finally, in Section 7, we further discuss the observations and compare gradient disparity to related work. A detailed comparison to related work is deferred to Appendix H. For our experiments, we consider four image classification datasets: MNIST, CIFAR-10, CIFAR-100 and MRNet, and we consider a wide range of neural network architectures: ResNet, VGG, AlexNet and fully connected neural networks.

2. GENERALIZATION PENALTY

Consider a classification task with input x ∈ X := R n and ground truth label y ∈ {1, 2, • • • , k}, where k is the number of classes. Let h w ∈ H : X → Y := R k be a predictor (classifier) parameterized by the parameter vector w ∈ R d , and l(•, •) be the 0-1 loss function l (h w (x), y) = 1 [h w (x)[y] < max j =y h w (x) [j] ] for all h w ∈ H and (x, y) ∈ X × {1, 2, • • • , k}. The expected loss and the empirical loss over the training set S of size m are respectively defined as



The loss and area under the receiver operating characteristic curve (AUC score) on the MRNet test set (

