EARLY STOPPING BY GRADIENT DISPARITY

Abstract

Validation-based early-stopping methods are one of the most popular techniques used to avoid over-training deep neural networks. They require to set aside a reliable unbiased validation set, which can be expensive in applications offering limited amounts of data. In this paper, we propose to use gradient disparity, which we define as the 2 norm distance between the gradient vectors of two batches drawn from the training set. It comes from a probabilistic upper bound on the difference between the classification errors over a given batch, when the network is trained on this batch and when the network is trained on another batch of points sampled from the same dataset. We empirically show that gradient disparity is a very promising early-stopping criterion when data is limited, because it uses all the training samples during training. Furthermore, we show in a wide range of experimental settings that gradient disparity is not only strongly related to the usual generalization error between the training and test sets, but that it is also much more informative about the level of label noise.

1. INTRODUCTION

Early stopping is a commonly used regularization technique to avoid under/over fitting deep neural networks trained with iterative methods, such as gradient descent (Prechelt, 1998; Yao et al., 2007; Gu et al., 2018) . To have an unbiased proxy on the generalization error, early stopping requires a separate accurately labeled validation set. However, labeled data collection is an expensive and time consuming process that might require domain expertise (Roh et al., 2019) . Moreover, deep learning is becoming popular to use for new and critical applications for which there is simply not enough available data. Hence, it is advantageous to have a signal of overfitting that does not require a validation set, then all the available data can be used for training the model. w R 2 L S 2 L S 1 w (t) w (t+1) 2 w (t+1) 1 L S 2 (h w 1 ) L S 2 (h w 2 ) Loss Figure 1 : An illustration of the penalty term R 2 , where the y-axis is the loss, and the x-axis indicates the parameters of the model. L S1 and L S2 are the average losses over batches S 1 and S 2 , respectively. w (t) is the parameter at iteration t and w (t+1) i is the parameter at iteration t + 1 if batch S i was selected for the update step at iteration t, with i ∈ {1, 2}. Let S 1 and S 2 be two batches of points sampled from the available (training) dataset. Suppose that S 1 is selected for an iteration (step) of stochastic gradient descent (SGD), which then updates the parameter vector to w 1 . The average loss over S 1 is in principle reduced, given a sufficiently small learning rate. However, the average loss over the other batch S 2 (i.e., L S2 (h w1 )) is not as likely to be reduced. It will remain on average larger than the loss computed over S 2 , if it was S 2 instead of S 1 that had been selected for this iteration (i.e., L S2 (h w2 )). The difference is the penalty R 2 that we pay for choosing S 1 over S 2 (and similarly, R 1 is the penalty that we would pay for choosing S 2 over S 1 ). R 2 is illustrated in Figure 1 for a hypothetical non-convex loss as a function of a one dimensional parameter. The expected penalty measures how much, in an iteration, a model updated on one batch (S 1 ) is able to generalize on average to another batch (S 2 ) from the dataset. Hence, we call R the generalization penalty.

