EARLY STOPPING BY GRADIENT DISPARITY

Abstract

Validation-based early-stopping methods are one of the most popular techniques used to avoid over-training deep neural networks. They require to set aside a reliable unbiased validation set, which can be expensive in applications offering limited amounts of data. In this paper, we propose to use gradient disparity, which we define as the 2 norm distance between the gradient vectors of two batches drawn from the training set. It comes from a probabilistic upper bound on the difference between the classification errors over a given batch, when the network is trained on this batch and when the network is trained on another batch of points sampled from the same dataset. We empirically show that gradient disparity is a very promising early-stopping criterion when data is limited, because it uses all the training samples during training. Furthermore, we show in a wide range of experimental settings that gradient disparity is not only strongly related to the usual generalization error between the training and test sets, but that it is also much more informative about the level of label noise.

1. INTRODUCTION

Early stopping is a commonly used regularization technique to avoid under/over fitting deep neural networks trained with iterative methods, such as gradient descent (Prechelt, 1998; Yao et al., 2007; Gu et al., 2018) . To have an unbiased proxy on the generalization error, early stopping requires a separate accurately labeled validation set. However, labeled data collection is an expensive and time consuming process that might require domain expertise (Roh et al., 2019) . Moreover, deep learning is becoming popular to use for new and critical applications for which there is simply not enough available data. Hence, it is advantageous to have a signal of overfitting that does not require a validation set, then all the available data can be used for training the model. where the y-axis is the loss, and the x-axis indicates the parameters of the model. L S1 and L S2 are the average losses over batches S 1 and S 2 , respectively. w (t) is the parameter at iteration t and w (t+1) i is the parameter at iteration t + 1 if batch S i was selected for the update step at iteration t, with i ∈ {1, 2}. Let S 1 and S 2 be two batches of points sampled from the available (training) dataset. Suppose that S 1 is selected for an iteration (step) of stochastic gradient descent (SGD), which then updates the parameter vector to w 1 . The average loss over S 1 is in principle reduced, given a sufficiently small learning rate. However, the average loss over the other batch S 2 (i.e., L S2 (h w1 )) is not as likely to be reduced. It will remain on average larger than the loss computed over S 2 , if it was S 2 instead of S 1 that had been selected for this iteration (i.e., L S2 (h w2 )). The difference is the penalty R 2 that we pay for choosing S 1 over S 2 (and similarly, R 1 is the penalty that we would pay for choosing S 2 over S 1 ). R 2 is illustrated in Figure 1 for a hypothetical non-convex loss as a function of a one dimensional parameter. The expected penalty measures how much, in an iteration, a model updated on one batch (S 1 ) is able to generalize on average to another batch (S 2 ) from the dataset. Hence, we call R the generalization penalty. We establish a probabilistic upper-bound on the sum of the expected penalties E [R 1 ] + E [R 2 ] by adapting the PAC-Bayesian framework (McAllester, 1999a; b; 2003) given a pair of batches S 1 and S 2 sampled from the dataset (Theorem 1). Interestingly, under some mild assumptions, this upper bound is essentially a simple expression driven by g 1 -g 2 2 , where g 1 and g 2 are the gradient vectors over the two batches S 1 and S 2 , respectively. We call this gradient disparity: it measures how a small gradient step on one batch negatively affects the performance on another one. Gradient disparity is simple to use and it is computationally tractable during the course of training. Our experiments on state-of-the-art configurations suggest a very strong link between gradient disparity and generalization error; we propose gradient disparity as an effective early stopping criterion. Gradient disparity is particularly useful when the available dataset has limited labeled data, because it does not require splitting the available dataset into training and validation sets so that all the available data can be used during training, unlike for instance k-fold cross validation. We observe that using gradient disparity, instead of an unbiased validation set, results in at least 1% predictive performance improvement for critical applications with limited and very costly available data, such as the MRNet dataset that is a small size image-classification dataset used for detecting knee injuries (Table 1 ). (Bien et al., 2018) , comparing 5-fold cross validation (5-fold CV) and gradient disparity (GD), when both are used as early stopping criteria for detecting the presence of abnormally, ACL tears, and meniscal tears from the sagittal plane MRI scans. The corresponding curves during training are shown in Figure 10 . The results of early stopping are given, both when the metric has increased for 5 epochs from the beginning of training and between parenthesis when the metric has increased for 5 consecutive epochs. Moreover, when the available dataset contains noisy labels, the validation set is no longer a reliable predictor of the clean test set (see e.g., Figure 9 (a) (left)), whereas gradient disparity correctly predicts the performance on the test set and again can be used as a promising early-stopping criterion. Furthermore, we observe that gradient disparity is a better indicator of label noise level than generalization error, especially at early stages of training. Similarly to the generalization error, it decreases with the training set size, and it increases with the batch size. Paper Outline. In Section 2, we formally define the generalization penalty. In Section 3, we give the upper bound on the generalization penalty. In Section 4, we introduce the gradient disparity metric. In Section 5, we present experiments that support gradient disparity as an early stopping criterion. In Section 6, we assess gradient disparity as a generalization metric. Finally, in Section 7, we further discuss the observations and compare gradient disparity to related work. A detailed comparison to related work is deferred to Appendix H. For our experiments, we consider four image classification datasets: MNIST, CIFAR-10, CIFAR-100 and MRNet, and we consider a wide range of neural network architectures: ResNet, VGG, AlexNet and fully connected neural networks.

2. GENERALIZATION PENALTY

Consider a classification task with input x ∈ X := R n and ground truth label y ∈ {1, 2, • • • , k}, where k is the number of classes. Let h w ∈ H : X → Y := R k be a predictor (classifier) parameterized by the parameter vector w ∈ R d , and l(•, •) be the 0-1 loss function l (h w (x), y) = 1 [h w (x)[y] < max j =y h w (x) [j] ] for all h w ∈ H and (x, y) ∈ X × {1, 2, • • • , k}. The expected loss and the empirical loss over the training set S of size m are respectively defined as L(h w ) = E (x,y)∼D [l (h w (x), y)] and L S (h w ) = 1 m m i=1 l(h w (x i ), y i ), where D is the probability distribution of the data points and (x i , y i ) are i.i.d. samples drawn from S ∼ D m . L S (h w ) is also called the training classification error. Similar to the notation used in (Dziugaite & Roy, 2017) , distributions on the hypotheses space H are simply distributions on the underlying parameterization. With some abuse of notation, ∇L Si refers to the gradient with respect to the surrogate differentiable loss function, which in our experiments is the cross entropy. In a mini-batch gradient descent (SGD) setting, consider two batches of points, denoted by S 1 and S 2 , which have respectively m 1 and m 2 number of samples, with m 1 + m 2 ≤ m. The average loss functions over these two sets of samples are L S1 (h w ) and L S2 (h w ), respectively. Let w = w (t) be the parameter vector at the beginning of an iteration t. If S 1 is selected for the next iteration, w gets updated to w 1 = w (t+1) with w 1 = w -γ∇L S1 (h w ) , (2) where γ is the learning rate. Conversely, if S 2 had been selected instead of S 1 , the updated parameter vector at the end of this iteration would have been w 2 = w -γ∇L S2 (h w ) . Therefore, the generalization penalty on batch S 2 is defined as R 2 = L S2 (h w1 ) -L S2 (h w2 ) , which is the gap between the loss over S 2 , L S2 (h w1 ), and its target value, L S2 (h w2 ), at the end of iteration t. When selecting S 1 for the parameter update, Equation (2) makes a step towards learning the inputoutput relations of batch S 1 . If this negatively affects the performance on batch S 2 , R 2 will be large; the model is learning the data structures that are unique to S 1 and that do not appear in S 2 . Because S 1 and S 2 are batches of points sampled from the same distribution D, they have data structures in common. If, throughout the learning process, we consistently observe that, in each update step, the model learns structures unique to only one batch, then it is very likely that the model is memorizing the labels instead of learning the common data-structures. This is captured by the generalization penalty R.

3. BOUND ON THE GENERALIZATION PENALTY

We adapt the PAC-Bayesian framework (McAllester, 1999a; b) to account for the trajectory of the learning algorithm; For each learning iteration we define a prior, and two possible posteriors depending on the choice of the batch selection. Let w ∼ P be an initial parameter vector that follows a prior distribution P which is a F t -measurable function, where F t denotes the filtration of the available information at the beginning of iteration t. Let h w1 , h w2 be the two learned single predictors, at the end of iteration t, from S 1 and S 2 , respectively. In this framework, for i ∈ {1, 2}, each predictor h wi is randomized and becomes h νi with ν i = w i + u i , where u i is a random variable whose distribution might depend on S i . Let Q i be the distribution of ν i , which is a distribution over the predictor space H that depends on S i via w i and possibly u i . Let G i be a σ-field such that σ(S i ) ∪ F t ⊂ G i and that the posterior distribution Q i is G i -measurable for i ∈ {1, 2}. We further assume that the random variable ν 1 ∼ Q 1 is statistically independent from the draw of the batch S 2 and, vice versa, that ν 2 ∼ Q 2 is independent from the batch S 1 1 , i.e., G 1 ⊥ ⊥ σ(S 2 ) and G 2 ⊥ ⊥ σ(S 1 ). Theorem 1. For any δ ∈ (0, 1], with probability at least 1 -δ over the sampling of sets S 1 and S 2 , the sum of the expected penalties conditional on S 1 and S 2 , respectively, satisfies E [R 1 ] + E [R 2 ] ≤ 2KL(Q 2 ||Q 1 ) + 2 ln 2m2 δ m 2 -2 + 2KL(Q 1 ||Q 2 ) + 2 ln 2m1 δ m 1 -2 . ( ) Theorem 1, whose proof is given in Appendix B, shows why generalization penalties are better suited to our setting where the two batches S 1 and S 2 are both drawn from the training set S than the usual generalization errors. After an iteration, the network learns a posterior distribution Q 1 on its parameters from S 1 , yielding to the parameter vector ν 1 ∼ Q 1 . The expected generalization error at that time is defined as GE 1 = E ν1∼Q1 [L(h ν1 )] -E ν1∼Q1 [L S1 (h ν1 )]. In practice, L(h ν1 ) is estimated by the test loss over a batch of unseen data, which is independent from ν 1 ∼ Q 1 . If S 2 is this batch, then 2 GE 1 ≈ E ν1∼Q1 [L S2 (h ν1 )] -E ν1∼Q1 [L S1 (h ν1 )]. However, this estimate requires to set S 2 aside from S not only during that step but also during all the previous steps, because otherwise the model h ν1 would not be independent from S 2 , making the estimate of GE 1 biased. Therefore S 2 must be sampled from the validation set, and cannot be used during training. In contrast, Theorem 1 is valid even if the trained model h ν1 depends on the samples within the batch S 2 . Therefore, the bound on the sum of the (expected) generalization penalties does no longer require to set S 2 aside from S in previous iterations; all data previously reserved for validation can now be used for training. This is what makes these penalties appealing to measure generalization especially when the available dataset is limited and/or noisy, as we will see in Section 5. Theorem 1 remains valid if batch S 2 was sampled from the validation set, in which case it can be compared with known generalization error (GE) bounds, as now h ν1 does not depend on samples of S 2 . Similarly to GE 1 , let GE 2 be the generalization error when S 2 is the training set, while S 1 is the test set. By adding GE 1 and GE 2 we obtain E [R 1 ] + E [R 2 ], hence Theorem 1 also upper bounds an estimate of GE 1 + GE 2 . We could have obtained another upper bound by directly applying the bounds from (McAllester, 2003; Neyshabur et al., 2017b) , which gives E [R 1 ] + E [R 2 ] ≤ 2 2KL(Q 2 ||P ) + 2 ln 2m2 δ m 2 -1 + 2 2KL(Q 1 ||P ) + 2 ln 2m1 δ m 1 -1 . ( ) The main difference between Equations ( 3) and ( 4) is that the former needs the difference only between the two posterior distributions, Q 1 and Q 2 , whereas the latter requires the difference between the prior distribution P and the posterior distributions Q 1 and Q 2 . Besides the minor difference of the multiplicative factor 2, the upper bound in Equation ( 3) is non-vacuous for a larger class of distributions than the upper bound in Equation ( 4): When Q 1 and Q 2 are close to each other but not to P , the upper bound in Equation ( 3) is much tighter than the one in Equation ( 4). Moreover, in the next section, we show that, under reasonable assumptions, the upper bound in Equation ( 3) boils down to a very tractable generalization metric that we call gradient disparity.

4. GRADIENT DISPARITY

The randomness modeled by u i , conditioned on the current batch S i , comes from (i) the parameter vector at the beginning of the iteration w, which itself comes from the random parameter initialization and the stochasticity of the parameter updates until that iteration, and (ii) the gradient vector ∇L Si , which may also be random because of the possible additional randomness in the network structure due for instance to dropout (Srivastava et al., 2014) . A common assumption made in the literature is that the random perturbation u i follows a normal distribution (Bellido & Fiesler, 1993; Neyshabur et al., 2017b) . The upper bound in Theorem 1 takes a particularly simple form if we assume that for i ∈ {1, 2} the random perturbations u i are zero mean i.i.d. normal distributions (u i ∼ N (0, σ 2 I)), and that w i is fixed, as in the setting of (Dziugaite & Roy, 2017). KL(Q 1 ||Q 2 ) is then the KL-divergence between two multivariate normal distributions .

Let us denote ∇L

S1 (h w ) by g 1 ∈ R d and ∇L S2 (h w ) by g 2 ∈ R d . As w i = w -γg i for i ∈ {1, 2}, the KL-divergence between Q 1 = N (w 1 , σ 2 I) and Q 2 = N (w 2 , σ 2 I) (Lemma 1 in Appendix A) is simply KL(Q 1 ||Q 2 ) = 1 2 γ 2 σ 2 g 1 -g 2 2 2 = KL(Q 2 ||Q 1 ), which shows that, keeping a constant step size γ and assuming the same variance for the random perturbations σ 2 in all the steps of the training, the bound in Theorem 1 is driven by g 1 -g 2 2 . 2 More formally, GE1 = Eν 1 ∼Q 1 [LS 2 (hν 1 )] -Eν 1 ∼Q 1 [LS 1 (hν 1 )] + ζ (Eν 1 ∼Q 1 [L(hν 1 )] -Eν 1 ∼Q 1 [LS 2 (hν 1 )]) where from Hoeffding's bound (Theorem 2 in Appendix A) P (|ζ| ≥ t) ≤ exp -2m2t 2 , and GE1 is approximated with the first term. This indicates that the smaller the 2 distance between gradient vectors is, the lower the upper bound on the generalization penalty is, and therefore the closer the performance of a model trained on one batch is to a model trained on another batch. For two batches of points S i and S j , with gradient vectors g i and g j , respectively, we define the gradient disparity (GD) between S i and S j as D i,j = g i -g j 2 . ( ) Gradient disparity is empirically tractable, and provides a probabilistic guarantee on the sum of the generalization penalties of S i and S j , modulo the Gaussianity assumptions made in this section. Gradient disparity can be computed within batches of the training or the validation set. As discussed in Section 3, we focus on the first case to have a generalization metric that does not require validation data and that provides an early stopping criterion with all the available data used for training. We focus on the vanilla SGD optimizer. In Appendix G, we extend the analysis to other adaptive optimizers: SGD with momentum (Qian, 1999) , Adagrad (Duchi et al., 2011 ), Adadelta (Zeiler, 2012) , and Adam (Kingma & Ba, 2014) . In all these optimizers, we observe that gradient disparity (Equation ( 6)) appears in KL(Q 1 ||Q 2 ) with other factors that depend on a decaying average of past gradient vectors. Experimental results support the use of gradient disparity as an early stopping metric also for these popular optimizers (see Figure 22 in Appendix G). Computing the gradient disparity averaged over B batches requires all the B gradient vectors at each iteration, which is computationally expensive if B is large. We approximate it by computing it over only a much smaller subset of the batches, of size s B, D = s i=1 s j=1,j =i D i,j /s(s -1). In the experiments presented in this paper, s = 5; we observed that such a small subset is already sufficient (see Appendix C.2 for an experimental comparison of different values of s). We also present an alternative approach that relies on the distribution of gradient disparity instead of its average in Appendix F to have a finer-grain signal of overfitting. As training progresses, the gradient magnitude starts to decrease, and therefore the value of gradient disparity might decrease not necessarily because the distance between two gradient vectors is decreasing, but because their magnitudes are decreasing. Hence, in order to compare different stages of training, we re-scale the loss values within each batch, before computing gradient disparity in D (refer to Appendix C.1 for more details). Moreover, we consider the mean square error (MSE) for the choice of the surrogate loss in Appendix C.3 and we observe that gradient disparity is positively correlated with the MSE test loss as well.

5. GRADIENT DISPARITY AS AN EARLY STOPPING CRITERION

Comparison to k-fold Cross Validation. Early stopping is a popular technique used in practice to avoid overfitting (Prechelt, 1998; Yao et al., 2007; Gu et al., 2018) . The optimization is stopped when the performance of the model on a validation set starts to diverge from its performance on the training set. Early stopping is of particular interest in the presence of label noise, because the model first learns the samples with correct labels, and next the corrupted samples (Li et al., 2019) . To emphasize the particular application of gradient disparity, we compare it to k-fold cross validation in two settings: (i) when the available dataset is limited and (ii) when the available dataset has corrupted labels. We simulate the limited data scenario by using a small subset of three image classification benchmark datasets: MNIST, CIFAR-10 and CIFAR-100, and the noisy labeled data scenario is simulated by using a corrupted version of these datasets. We also evaluate gradient disparity on a medical dataset (MRNet dataset) with limited available data and in this setting, we use the entire dataset for training. (i) We compare gradient disparity with k-fold cross validation (Stone, 1974) used as an early stopping criterion in Table 2 (top) when there is limited labeled data. We observe that gradient disparity performs well as an early stopping criterion, especially when the data is complex (CIFAR-100 is more complex than CIFAR-10). As it uses every available sample, instead of only a 1 -1/k portion of the dataset, it results in a better performance on the final unseen (test) data (see also Table 4 and Figure 8 in Appendix D). In many real-world applications, collecting (labeled) data is very costly. In some medical applications, it requires high costs of patient data collection and medical staff expertise. For an example of such an application, we consider the MRNet dataset (Bien et al., 2018) , which contains limited number of MRI scans to study the presence of abnormally, ACL tears and meniscal tears in knee injuries. We observe that using gradient disparity instead of a validation set, results in over 1% improvement (on average over all three tasks) in the test AUC score, and therefore additional correct detection for more than one patient for each task (see Table 1 and Figure 10 in Appendix D). (ii) When the labels of the available data are noisy, the validation set is no longer a reliable estimate of the unseen set (this can be clearly observed in Figure 9 (left column)). Nevertheless, and although it is computed over the noisy training set, gradient disparity reflects the performance on the test set quite well (Figure 9 (middle left column)). As a result (see Table 2 (bottom)), gradient disparity performs better than k-fold cross validation as an early stopping criterion. The same applies to two other datasets (see Table 6 and Figure In this section, we demonstrate that factors that contribute to improve or degrade the generalization performance of a model (e.g., label noise level, training set size and batch size), have an often strikingly similar effect on the value of gradient disparity as well. Label Noise Level. Deep neural networks, trained with the SGD algorithm, achieve excellent generalization performance (Hardt et al., 2015) , while achieving zero training error on randomly labeled data in classification tasks (Zhang et al., 2016) . Understanding what distinguishes the model when it is trained on correct labels and when it is trained on randomly labeled data is still an evolving area of research. We conjecture that as label noise level increases, the gradient vectors diverge more. Hence, when the network is trained with correctly labeled samples, the gradient disparity is low, whereas when it is trained with corrupted samples the gradient disparity is high. The experimental results support this conjecture in a wide range of settings and show that gradient disparity is indeed very sensitive to label noise level (see also Figures 12, 15, 19 and 20) . Batch Size. In practice, the test error increases with the batch size (Figure 4 (left)). We observe that gradient disparity also increases with the batch size (Figure 4 (right)). This observation is counter-intuitive because one might expect that gradient vectors get more similar when they are averaged over a larger batch. This might be the explanation behind the decrease in gradient disparity from batch size 256 to 512 for the VGG-19 network. Observe also that gradient disparity correctly predicts that VGG-19 generalizes better than ResNet-34 for this dataset. Gradient disparity matches the ranking of test errors for different networks, trained with different batch sizes, as long as the batch sizes are not too large (see also Figure 18 in Appendix E). Width. In practice, the test error has been observed to decrease with the network width. We observe that gradient disparity (normalized with respect to the number of parameters) also decreases with network width for ResNet, VGG and fully connected neural networks that are trained on the CIFAR-10 dataset (see Figure 14 in Appendix E.2).

7. DISCUSSION AND RELATED WORK

Finding a practical metric that completely captures the generalization properties of deep neural networks, and in particular indicates the level of randomness in the labels and decreases with the size of the training set, is still an active research direction (Dziugaite & Roy, 2017; Neyshabur et al., 2017a; Nagarajan & Kolter, 2019) . A very recent line of work assesses the similarity between the gradient updates of two batches (samples) in the training set. The coherent gradient hypothesis (Chatterjee, 2020) states that the gradient is stronger in directions where similar examples exist and towards which the parameter update is biased. He & Su (2020) presents the local elasticity phenomenon, which measures how prediction over one sample changes, as the network is updated on another sample. The generalization penalty introduced in our work measures how the prediction over one sample (batch) changes when the network is updated on the same sample instead of being updated on another sample, which can signal overfitting implicitly within the training set. Tracking generalization by measuring the similarity between gradient vectors is particularly beneficial as it is empirically tractable during training and does not require access to unseen data. Sankararaman et al. (2019) proposes gradient confusion, which is a bound on the inner product of two gradient vectors, and shows that the larger the gradient confusion is, the slower the convergence takes place. Gradient interference (when the inner product of gradient vectors is negative) has been studied in multi-task learning, reinforcement learning and temporal difference learning (Riemer et al., 2018; Liu et al., 2019; Bengio et al., 2020) . Yin et al. (2017) studies the relation between gradient diversity, which measures the dissimilarity between gradient vectors, and the convergence performance of distributed SGD algorithms. Fort et al. (2019) proposes a metric called stiffness, which is the cosine similarity between two gradient vectors, and shows empirically that it is related to generalization. Fu et al. (2020) studies the cosine similarity between two gradient vectors for natural language processing tasks. Mehta et al. (2020) measures the alignment between the gradient vectors within the same class (denoted by Ω c ), and studies the relation between Ω c and generalization as the scale of initialization is increased. Another interesting line of work is the study of the variance of gradients in deep learning settings. Negrea et al. (2019) derives mutual information generalization error bounds for stochastic gradient Langevin dynamics (SGLD) as a function of the sum (over the iterations) of square gradient incoherences, which is closely related to gradients variance. Two-sample gradient incoherences also appear in Haghifam et al. (2020) , there are taken between a training sample and a "ghost" sample that is not used during training and therefore taken from a validation set (unlike gradient disparity). The upper bounds in Negrea et al. (2019) ; Haghifam et al. (2020) are not intended to be used as early stopping criteria and are cumulative bounds that increase with the number of iterations. As shown in Appendix G, gradient disparity can be used as an early stopping criterion not only for SGD with additive noise (such as SGLD), but also other adaptive optimizers. Jastrzebski et al. (2020) studies the effect of the learning rate on the variance of gradients and hypothesizes that gradient variance counter-intuitively increases with the batch size, which is consistent with our observations. However, Qian & Klabjan (2020) shows that the variance of gradients is a decreasing function of the batch size. Jastrzebski et al. (2020) ; Qian & Klabjan (2020) mention the connection between variance of gradients and generalization as promising future directions. Our study shows that variance of gradients used as an early stopping criterion outperforms k-fold cross validation (see Table 8 ). Mahsereci et al. (2017) proposes an early stopping criterion called evidence-based criterion (EB) that eliminates the need for a held-out validation set, similarly to gradient disparity. The EB-criterion is negatively related to the signal-to-noise ratio (SNR) of the gradient vectors. Liu et al. (2020) also proposes a relation between gradient SNR (called GSNR) and the one-step generalization error, with the assumption that both the training and the test sets are large, whereas gradient disparity targets limited datasets. Nevertheless, we have compared gradient disparity to these metrics (namely, EB, GSNR, gradient inner product, sign of the gradient inner product, variance of gradients, cosine similarity, and Ω c ) in Appendix H. In Table 8 , we observe that gradient disparity and variance of gradients used as early stopping criteria are the only metrics that consistently outperform k-fold cross validation, and that are more informative of the level of label noise compared to other metrics. We observe that the correlation between gradient disparity and the test loss is however in general larger than the correlation between variance of gradients and the test loss (Table 9 ). A common drawback of the metrics based on the similarity between two gradient vectors, including gradient disparity, is that they are not informative when the gradient vectors are very small. In practice however, we observe (see for instance Figure 13 ) that the time at which the test and training losses start to diverge, which is the time when overfitting kicks in, does not only coincide with the time at which gradient disparity increases, but also occurs much before the training loss becomes infinitesimal. Hence, this drawback is unlikely to cause a problem for gradient disparity when it is used as an early stopping criterion. Nevertheless, Theorem 1 trivially holds when the gradient values are infinitesimal.

Conclusion.

In this work, we propose gradient disparity, which is the 2 norm of the difference between the gradient vectors of pairs of batches in the training set. Our empirical results on state-ofthe-art configurations show indeed a strong link between gradient disparity and generalization error. Gradient disparity, similar to the test error, increases with the label noise level, decreases with the size of the training set and increases with the batch size. We therefore suggest, particularly when the available dataset is limited or noisy, gradient disparity as a promising early stopping criterion that does not require access to a validation set.

A ADDITIONAL THEOREM

Hoeffding's bound is used in the proof of Theorem 1, and Lemma 1 is used in Section 4. Theorem 2 (Hoeffding's Bound). Let Z 1 , • • • , Z n be independent bounded random variables on [a, b] (i.e., Z i ∈ [a, b] for all 1 ≤ i ≤ n with -∞ < a ≤ b < ∞). Then P 1 n n i=1 (Z i -E[Z i ]) ≥ t ≤ exp - 2nt 2 (b -a) 2 and P 1 n n i=1 (Z i -E[Z i ]) ≤ -t ≤ exp - 2nt 2 (b -a) 2 for all t ≥ 0. Lemma 1 If N 1 = N (µ 1 , Σ 1 ) and N 2 = N (µ 2 , Σ 2 ) are two multivariate normal distributions in R d , where Σ 1 and Σ 2 are positive definite, KL(N 1 ||N 2 ) = 1 2 tr Σ -1 2 Σ 1 -d +(µ 2 -µ 1 ) T Σ -1 2 (µ 2 -µ 1 ) + ln det Σ 2 det Σ 1 .

B PROOF OF THEOREM 1

Proof. We compute the upper bound in Equation (3) using a similar approach as in McAllester (2003) . The main challenge in the proof is the definition of a function X S2 of the variables and parameters of the problem, which can then be bounded using similar techniques as in McAllester (2003) . S 1 is a batch of points (with size m 1 ) that is randomly drawn from the available set S at the beginning of iteration t, and S 2 is a batch of points (with size m 2 ) that is randomly drawn from the remaining set S \ S 1 . Hence, S 1 and S 2 are drawn from the set S without replacement (S 1 ∩ S 2 = ∅). Similar to the setting of Negrea et al. (2019) ; Dziugaite et al. ( 2020), as the random selection of indices of S 1 and S 2 is independent from the dataset S, σ(S 1 ) ⊥ ⊥ σ(S 2 ), and as a result, G 1 ⊥ ⊥ σ(S 2 ) and G 2 ⊥ ⊥ σ(S 1 ). Recall that ν i is the random parameter vector at the end of iteration t that depends on S i , for i ∈ {1, 2}. For a given sample set S i , denote the conditional probability distribution of ν i by Q Si . For ease of notation, we represent Q Si by Q i .

Let us denote

∆ (h ν1 , h ν2 ) (L S2 (h ν1 ) -L(h ν1 )) -(L S2 (h ν2 ) -L(h ν2 )) , and X S2 sup Q1,Q2 m 2 2 -1 E ν1∼Q1 E ν2∼Q2 (∆ (h ν1 , h ν2 )) 2 -KL(Q 2 ||Q 1 ). (8) Note that X S2 is a random function of the batch S 2 . Expanding the KL-divergence, we find that m 2 2 -1 E ν1∼Q1 E ν2∼Q2 (∆ (h ν1 , h ν2 )) 2 -KL(Q 2 ||Q 1 ) = E ν1∼Q1 m 2 2 -1 E ν2∼Q2 (∆ (h ν1 , h ν2 )) 2 + E ν2∼Q2 ln Q 1 (ν 2 ) Q 2 (ν 2 ) ≤ E ν1∼Q1 ln E ν2∼Q2 e ( m 2 2 -1)(∆(hν 1 ,hν 2 )) 2 Q 1 (ν 2 ) Q 2 (ν 2 ) = E ν1∼Q1 ln E ν 1 ∼Q1 e ( m 2 2 -1) ∆ hν 1 ,h ν 1 2 , where the inequality above follows from Jensen's inequality as logarithm is a concave function. Therefore, again by applying Jensen's inequality X S2 ≤ ln E ν1∼Q1 E ν 1 ∼Q1 e ( m 2 2 -1) ∆(hν 1 ,h ν 1 ) 2 . Taking expectations over S 2 , we have that E S2 e X S 2 ≤ E S2 E ν1∼Q1 E ν 1 ∼Q1 e ( m 2 2 -1) ∆(hν 1 ,h ν 1 ) 2 = E ν1∼Q1 E ν 1 ∼Q1 E S2 e ( m 2 2 -1) ∆(hν 1 ,h ν 1 ) 2 , where the change of order in the expectation follows from the independence of the draw of the set S 2 from ν 1 ∼ Q 1 and ν 1 ∼ Q 1 , i.e., Q 1 is G 1 -measurable and G 1 ⊥ ⊥ σ(S 2 ). Now let Z i l(h ν1 (x i ), y i ) -l(h ν 1 (x i ), y i ), for all 1 ≤ i ≤ m 2 . Clearly, Z i ∈ [-1, 1 ] and because of Equations ( 1) and of the definition of ∆ in Equation ( 7), ∆ h ν1 , h ν 1 = 1 m 2 m2 i=1 (Z i -E[Z i ]) . Hoeffding's bound (Theorem 2) implies therefore that for any t ≥ 0, P S2 |∆ h ν1 , h ν 1 | ≥ t ≤ 2e -m 2 2 t 2 . ( ) Denoting by p(∆) the probability density function of |∆ h ν1 , h ν 1 |, inequality (10) implies that for any t ≥ 0, ∞ t p(∆)d∆ ≤ 2e -m 2 2 t 2 . ( ) The density p(∆) that maximizes ∞ 0 e ( m 2 2 -1)∆ 2 p(∆)d∆ (the term in the first expectation of the upper bound of Equation ( 9)), is the density achieving equality in (11), which is p(∆) = 2m 2 ∆e -m 2 2 ∆ 2 . As a result, E S2 e ( m 2 2 -1)∆ 2 ≤ ∞ 0 e ( m 2 2 -1)∆ 2 2m 2 ∆e -m 2 2 ∆ 2 d∆ = ∞ 0 2m 2 ∆e -∆ 2 d∆ = m 2 and consequently, inequality (9) becomes E S2 e X S 2 ≤ m 2 . Applying Markov's inequality on X S2 , we have therefore that for any 0 < δ ≤ 1, P S2 X S2 ≥ ln 2m 2 δ = P S2 e X S 2 ≥ 2m 2 δ ≤ δ 2m 2 E S2 e X S 2 ≤ δ 2 . Replacing X S2 by its expression defined in Equation ( 8), the previous inequality shows that with probability at least 1 -δ/2 m 2 2 -1 E ν1∼Q1 E ν2∼Q2 (∆(h ν1 , h ν2 )) 2 -KL(Q 2 ||Q 1 ) ≤ ln 2m 2 δ . Using Jensen's inequality and the convexity of (∆(h ν1 , h ν2 )) 2 , and assuming that m 2 > 2, we therefore have that with probability at least 1 -δ/2, (E ν1∼Q1 E ν2∼Q2 [∆ (h ν1 , h ν2 )]) 2 ≤ E ν1∼Q1 E ν2∼Q2 (∆ (h ν1 , h ν2 )) 2 ≤ KL(Q 2 ||Q 1 ) + ln 2m2 δ m2 2 -1 . Replacing ∆(h ν1 , h ν2 ) by its expression Equation (7) in the above inequality, yields that with probability at least 1 -δ/2 over the choice of the sample set S 2 , E ν1∼Q1 [L S2 (h ν1 ) -L(h ν1 )] ≤ E ν2∼Q2 [L S2 (h ν2 ) -L(h ν2 )] + 2KL(Q 2 ||Q 1 ) + 2 ln 2m2 δ m 2 -2 . ( ) Under review as a conference paper at ICLR 2021 Similar computations with S 1 and S 2 switched, and considering that m 1 > 2, yields that with probability at least 1 -δ/2 over the choice of the sample set S 1 , E ν2∼Q2 [L S1 (h ν2 ) -L(h ν2 )] ≤ E ν1∼Q1 [L S1 (h ν1 ) -L(h ν1 )] + 2KL(Q 1 ||Q 2 ) + 2 ln 2m1 δ m 1 -2 . ( ) The events in Equations ( 12) and ( 13) jointly hold with probability at least 1 -δ over the choice of the sample sets S 1 and S 2 (using the union bound and De Morgan's law), and by adding the two inequalities we therefore have E ν1∼Q1 [L S2 (h ν1 )] + E ν2∼Q2 [L S1 (h ν2 )] ≤ E ν2∼Q2 [L S2 (h ν2 )] + E ν1∼Q1 [L S1 (h ν1 )] + 2KL(Q 2 ||Q 1 ) + 2 ln 2m2 δ m 2 -2 + 2KL(Q 1 ||Q 2 ) + 2 ln 2m1 δ m 1 -2 , which concludes the proof.

C COMMON EXPERIMENTAL DETAILS

The training objective in our experiments is to minimize the cross-entropy loss, and both the cross entropy and the error percentage are displayed. The training error is computed using Equation (1) over the training set. The empirical test error also follows Equation (1) but it is computed over the test set. The generalization loss (respectively, error) is the difference between the test and the training cross entropy losses (resp., classification errors). The batch size in our experiments is 128 unless otherwise stated, the SGD learning rate is γ = 0.01 and no momentum is used (unless otherwise stated). All the experiments took at most few hours on one Nvidia Titan X Maxwell GPU. All the reported values throughout the paper are an average over at least 5 runs. To present results throughout the training, in the x-axis of figures, both epoch and iteration are used: an epoch is the time spent to pass through the entire dataset, and an iteration is the time spent to pass through one batch of the dataset. Thus, each epoch has B iterations, where B is the number of batches. The convolutional neural network configurations we use are: AlexNet (Krizhevsky et al., 2012) , VGG (Simonyan & Zisserman, 2014) and ResNet (He et al., 2016) . In those experiments with varying width, we use a scaling factor to change both the number of channels and the number of hidden units in convolutional and fully connected layers, respectively. The default configuration is with scaling factor = 1. In experiments with a random labeled training set, we modify the dataset similar to Chatterjee (2020). For a fraction of the training samples, which is the amount of noise (0%, 25%, 50%, 75%, 100%), we choose the labels at random. For a classification dataset with a number k of classes, if the label noise is 25%, then on average 75% + 25% * 1/k of the training points still have the correct label.

C.1 RE-SCALING THE LOSS

Let us track the evolution of gradient disparity (Equation ( 6)) during training. As training progresses, the training losses of all the batches start to decrease when they get selected for the parameter update. Therefore, the value of gradient disparity might decrease, not necessarily because the distance between the two gradient vectors is decreasing, but because the value of each gradient vector is itself decreasing. To avoid this, a re-scaling or normalization of the loss is needed to compare gradient disparity at different stages of training. Note that this re-scaling or normalization does not affect the training algorithm, only the computation of gradient disparity. We perform both re-scaling and normalization. The re-scaling of the loss values is given by L Sj = 1 m j mj i=1 l i std i (l i ) , where with some abuse of notation, l i is the cross entropy loss for the data point i in the batch S j . The normalization of the loss values is given by L Sj = 1 m j mj i=1 l i -Min i (l i ) Max i (l i ) -Min i (l i ) . We experimentally compare these two ways of computing gradient disparity in Figure 5 . Both the re-scaled and normalized losses, might get unbounded if within each batch the loss values are very close to each other. However, in our experiments, we do not observe gradient disparity becoming unbounded either way. In the presence of outliers, re-scaling is more reliable than normalizing, because with normalization non-outlier data might end up in a very small interval between 0 and 1. This might explain the mismatch between the normalized gradient disparity and generalization loss at the end of training in Figure 5 . Therefore, in all experiments presented in the paper, we re-scale the loss values before computing the gradient disparityfoot_1 . C.2 THE HYPER-PARAMETER s In this section, we briefly study the choice of the size s of the subset of batches to compute the average gradient disparity D = 1 s(s -1) s i=1 s j=1,j =i D i,j . Figure 6 shows the average gradient disparity when averaged over s number of batchesfoot_2 . When s = 2, gradient disparity is the 2 norm distance of the gradients of two randomly selected batches and has a quite high variance. Although with higher values of s the results have lower variance, choosing a higher s is more computationally expensive (refer to Appendix D for more details). Therefore, we find the choice of s = 5 to be sufficient enough to track down overfitting; in all the presented experiments in this paper, we use s = 5.

C.3 THE SURROGATE LOSS FUNCTION

It has been shown that cross entropy is better suited for computer-vision classification tasks compared to the mean square error (Kline & Berardi, 2005; Hui & Belkin, 2020) . Hence, we choose the cross entropy criterion for all our experiments to avoid possible pitfalls of the mean square error, such as not tracking the confidence of the predictor. Soudry et al. (2018) argues that when using cross entropy, as training proceeds, the magnitude of the network parameters increases. This can potentially affect the value of gradient disparity. Therefore, we compute the magnitude of the network parameters over iterations in various settings. We observe that this increase is very low both at the end of the training and, more importantly, at the time when gradient disparity signals overfitting (denoted by GD epoch in Table 3 ). Therefore, it is unlikely that the increase in the magnitude of the network parameters affects the value of gradient disparity. Furthermore, we examine gradient disparity for models trained on the mean square error, instead of the cross entropy criterion. We observe a high correlation between gradient disparity and test error/loss (Figure 7 ), which is consistent with the results obtained using the cross entropy criterion. Therefore, the applicability of gradient disparity as a generalization metric is not limited to settings with the cross entropy criterion. 

D k-FOLD CROSS VALIDATION

k-fold cross validation is done by splitting the available dataset into k sets, training on k -1 of them and validating on the remaining one. This is repeated k times so that every set is used once as the validation set. We can adopt two different early stopping approaches: stop the optimization either (i) when the validation loss (respectively, gradient disparity) has increased for m = 5 epochs from the beginning of training (which is marked by the gray vertical bar in Figures 8 and 9 ), or (ii) when the validation loss (resp., gradient disparity) has increased for 5 consecutive epochs (which is indicated by the magenta vertical bar in Figures 8 and 9 ). When there is low variations and a sharp increase in the value of the metric, the two coincide (for instance, Figure 8 (b) (middle left)). In our experiments, we observe that gradient disparity appears to be less sensitive to the choice of the approaches (i) and (ii) compared to k-fold cross validation. Moreover, in Table 5 we study different values of m, which is usually referred to as patience among practitioners. Early stopping should optimally occur when there is a minimum valley throughout the training in the test loss/error curves, or when the generalization loss/error starts to increase. In those experiments where such a minimum in the test loss curve exists, we compare gradient disparity to the test loss. Otherwise, we compare gradient disparity to the generalization loss. For k-fold cross validation, we compare validation loss to the test loss, because validation loss is expected to predict the test loss. Note that in the experiments with noisy labeled data, all the available data contains corrupted samples, hence both validation loss and gradient disparity are computed over sets which may contain corrupted samples. Limited Data. We present the results for limited data scenario in Figure 8 and Table 4 for MNIST, CIFAR-10 and CIFAR-100 datasets, where we simulate the limited data scenario by using a small subset of the training set. For the CIFAR-100 experiment (Figure 8 (a) and Table 4 (top row), we observe (from the left figure) that validation loss predicts the test loss pretty well. We observe (from the middle left figure) that gradient disparity also predicts the test loss quite well. However, the main difference between the two settings is that when using cross validation, 1/k of the data is set aside for validation and 1 -1/k of the data is used for training. Whereas when using gradient disparity, all the data (1 -1/k + 1/k) is used for training. Hence, the test loss in the leftmost and middle left figures differ. The difference between the test accuracy (respectively, test loss) obtained in each setting is visible in the rightmost figure (resp., middle right figure). We observe that there is over 3% improvement in the test accuracy when using gradient disparity as an early stopping criterion. This improvement is consistent for MNIST and CIFAR-10 datasets (Figures 8 (b ) and (c) and Table 4 ). We conclude that in the absence of label noise, both k-fold cross validation and gradient disparity predict the optimal early stopping moment, but the final test loss/error is much lower for the model trained with all the available data (as when gradient disparity is used), than the model trained with a (1 -1/k) portion of the data (as in k-fold cross validation). To further test on a dataset that is itself limited, a medical application with limited labeled data is empirically studied later in this section (Appendix D.1). The same conclusion is made for this dataset. Noisy Labeled Data. The results for datasets with noisy labels are presented in Figure 9 and Table 6 for MNIST, CIFAR-10 and CIFAR-100 datasets. We observe (from Figure 9 (a) (left)) that for the CIFAR-100 experiment, the validation loss does no longer predict the test loss. Nevertheless, although gradient disparity is computed on a training set that contains corrupted samples, it predicts the test loss quite well (Figure 9 (a) (middle left)). As a result, there is a 2% improvement in the final test accuracy (for top-5 accuracy there is a 9% improvement) (Table 6 (top two rows)) when using gradient disparity instead of a validation set as an early stopping criterion. This is also consistent for other configurations and datasets (Figure 9 and Table 6 ). We conclude that, in the presence of label noise, k-fold cross validation does no longer predict the test loss and fails as an early stopping criterion, unlike gradient disparity. Computational Cost. Denote the time, in seconds, to compute one gradient vector, to compute the 2 norm between two gradient vectors, to take the update step for the network parameters, and to evaluate one batch (find its validation loss and error) by t 1 , t 2 , t 3 and t 4 , respectively. Then, one epoch of k-fold cross validation takes  CV epoch = k × k -1 k B(t 1 + t 3 ) + B k

D.1 MRNET DATASET

So far, we have presented the improvement of gradient disparity over cross validation for limited subsets of MNIST, CIFAR-10 and CIFAR-100 datasets. In this sub-section, we present the results for when the available dataset is by itself limited. We consider the MRNet dataset Table 5 : The test accuracies achieved by using k-fold cross validation (CV) and by using gradient disparity (GD) as early stopping criteria for different patience values. For a given patience value of m, the training is stopped after m increases in the value of the validation loss in k-fold CV (top rows) and of GD (bottom rows). Throughout the paper, we have chosen m = 5 as the default patience value for all methods without optimizing it even for GD. However, in this Table , we observe that even if we tune the patience value for k-fold CV and for GD separately (which is indicated in bold), GD still outperforms k-fold CV. set. To perform cross validation, we split the set used for training in Bien et al. (2018) into a first subset used for training in our experiments, and a second subset used as validation set. Note that, in this dataset, because slice S changes from one case to another, it is not possible to stack the data into batches, hence the batch size is 1, which may explain the fluctuations of validation loss and gradient disparity in this setting. We used the SGD optimizer with the learning rate 10 -4 for training the model. Each task in this dataset is a binary classification with unbalanced set of samples and hence we report the area under the curve of the receiver operating characteristic (AUC score). The results for three tasks: detecting ACL tears, meniscal tears and abnormality, are shown in Figure 10 and Table 7 . We can observe that both the validation loss, despite a small bias, and the gradient disparity predict the generalization loss quite well. Yet, when using gradient disparity, the final test AUC score is higher (Figure 10 (right)). As mentioned earlier, for this dataset, both the validation loss and gradient disparity vary a lot. Hence, in Table 7 , we present the results of early stopping, both when the metric has increased for 5 epochs from the beginning of training, and (in parenthesis) when the metric has increased for 5 consecutive epochs. We conclude that with both approaches, the use of gradient disparity as an early stopping criterion results in more than 1% improvement in the test AUC score. Because the test set used in Bien et al. (2018) is not publicly available, it is not possible to compare our predictive results with Bien et al. (2018) . Nevertheless, a baseline may be the results presented in https://github.com/ahmedbesbes/mrnet, which report a test AUC score of 88.5% for the task of detecting ACL tears, whereas we observe in 9 . The above results are obtained by stopping the optimization when the metric (either validation loss or gradient disparity) has been increased for five epochs from the beginning of training. The last row in each setting, which we call plug-in, refers to when we plug-in the epoch suggested by 10-fold CV and then report the test loss and accuracy at that epoch for a network trained on the entire set. In all these settings, using GD would still result in a higher test accuracy and therefore the advantage of GD over 10-fold CV is a better characterization of overfitting. test AUC score for this task. With further tuning, and combining the predictions found on two other MRI planes of each patient (axial and coronal), our final prediction results could even be improved. The loss and AUC score on the test set, comparing 5-fold cross validation to gradient disparity both as early stopping criterion for the MRNet dataset for three different tasks using the sagittal plane MRI scans. Note that an unassisted general radiologist gives on average 92%, 84% and 89% accuracy for detecting ACL tears, meniscal tears and abnormality, respectively (Bien et al., 2018) . The corresponding curves during training are presented in Figure 10 . We present the results of early stopping, both when the metric has increased for 5 epochs from the beginning of training, and in parenthesis when the metric has increased for 5 consecutive epochs. To investigate the relation between the average gradient disparity D and generalization, we compare two sets of experiments. The first one exhibits clear overfitting, whereas in the second one, the model generalizes quite well. To test gradient disparity as a generalization metric, not only should it have very different values for each of these two sets of experiments, but it should also be well aligned with the generalization error. This is indeed what the experiments show. In the first set of experiments, the network configuration is ResNet-18 (He et al., 2016) and it is trained on the CIFAR-10 dataset (Figure 11 (top) ). Around iteration 500 (which is indicated by a thick gray vertical bar in the figures), the training and test losses (and errors) start to diverge, and the test loss reaches its minimum. This should be the early stopping point as the model is starting to overfit. Interestingly, around the same time (indicated in Figures 11 (b ) and (c)), we observe a sharp increase in D. The second set of experiments is on an AlexNet (Krizhevsky et al., 2012) trained on the MNIST dataset (Figure 11 (bottom) ). This model generalizes quite well for this dataset. We observe that, throughout the training, the test curves are even below the training curves, which is due to the dropout regularization technique (Srivastava et al., 2014) being applied during training and not during testing. The generalization loss/error is almost zero, until around iteration 1100 (indicated in the figure by the gray vertical bar), which is when overfitting starts and the generalization error becomes non-zero. At approximately the same time, the average gradient disparity (Figures 11 (e ) and (f)) starts to slightly increase, but much more slowly compared to Figures 11 (b ) and (c). In both these experiments, we observe that as overfitting starts, the gradient vectors start to diverge, resulting in larger gradient disparity. These observations suggest gradient disparity as an effective early stopping criterion, as it is well aligned with the generalization error.

E.1 MNIST EXPERIMENTS

Figure 12 shows the results for a 4-layer fully connected neural network trained on the entire MNIST training setfoot_3 . Figures 12 (e ) and (f) show the generalization losses. We observe that at the early stages of training, generalization losses do not distinguish between different label noise levels, whereas gradient disparity (Figures 12 (g ) and (h)) does so from the beginning. At the middle stages of training we can observe that, surprisingly in this setting, the network with 0% label noise has higher generalization loss than the networks trained with 25%, 50% and 75%, and this is also captured by average gradient disparity. The final gradient disparity values for the networks trained with higher label noise level are also larger. For the network trained with 0% label noise we present the results with more detail in Figure 13 and observe again how gradient disparity is well aligned with the generalization loss/error. In this experiment, the early stopping time suggested by gradient disparity is epoch 9, which is the exact same time when the training and test losses/errors start to diverge, which signals therefore the start of overfitting. d. We observe in Figure 14 that both the normalizedfoot_4 gradient disparity and test error decrease with the network width (the scale is a hyper-parameter used to change both the number of channels and hidden units in each configuration). Figure 15 shows the results for a 4-layer fully connected neural network, which is trained on the entire CIFAR-10 training setfoot_5 . We observe that gradient disparity reflects the test error at the early stages of training quite well. In the later stages of training we observe that the ranking of gradient disparity values for different label noise levels matches with the ranking generalization losses and errors. In all experiments the average gradient disparity is indeed very informative about the test error. Figure 16 shows the effect of adding data augmentation on both the test error and gradient disparity. Figure 17 shows test error and gradient disparity for networks that are trained with different training set sizes. In Figure 18 , we observe that, as discussed in Section 6, gradient disparity, similar to the test error, increases with the batch size for not too large batch sizes, and as expected, when the batch size is very large (512 for the CIFAR-10 experiment and 256 for the CIFAR-100 experiments) gradient disparity starts to decrease, because gradient vectors are averaged over a large batch. Note that even with such large batch sizes, gradient disparity correctly detects the early stopping time, but the value of gradient disparity can no longer be compared to the one found with other batch sizes. 

E.3 CIFAR-100 EXPERIMENTS

Figure 19 shows the results for a ResNet-18 that is trained on the CIFAR-100 training setfoot_6 . Clearly, the model is not to learn the complexity of the CIFAR-100 dataset: It has 99% error for the network with 0% label noise, as if it had not learned anything about the dataset and is just making a random guess for classification (because there are 100 classes, random guessing would give 99% error on average). We observe from Figure 19 (f) that as training progresses, the network overfits more, and the generalization error increases. Although the test error is high (above 90%), very surprisingly for this example, the networks with higher label noise level, have a lower test loss and error (Figures 19 (b)  and (d)) . Quite interestingly gradient disparity (Figure 19 (g)) captures also this surprising trend as well. 

F NUMBER OF BATCHES WITH LOW GRADIENT DISPARITY

In this paper, the upper bound in Theorem 1 is tracked by computing the average gradient disparity D over a subset of batches. In this section, to gain some finer-grain signal of overfitting during the training process, we track down the distribution of this bound by studying the distribution of the gradient disparity, i.e., P(D i,j < ζ) for some threshold ζ. We find the number of pairs of batches in the training set, denoted by T ζ , whose gradient disparity is below the given threshold ζ. For these pairs of batches, the upper bound in Theorem 1, with probability at least 1 -δ and for two batches with the same size m, is below 2 (γ 2 ζ 2 /σ 2 + 2 ln(2m/δ)) /(m -2). The lower T ζ is, the higher the average upper bound is, hence the more likely overfitting becomes. As before, for the sake of computational tractability, instead of going through all the possible pairs, we only compare s(s -1) pairs of batches (s = 5 for our experiments), so 0 ≤ T ζ ≤ s(s -1). We empirically estimate P(D i,j < ζ) over s batches by T ζ /(s(s -1)). In our experiments, we compute T ζ for ζ ∈ {10, 20}. We show that as an early stopping criterion, T ζ is sometimes (slightly) more accurate than D. 

G BEYOND SGD

In the following, we discuss how the analysis of Section 4 can be extended for other optimizers (refer to Ruder (2016) for an overview on popular optimizers).

G.1 SGD WITH MOMENTUM

The momentum method (Qian, 1999 ) is a variation of SGD which adds a fraction of the update vector of the previous step to the current update vector to accelerate SGD: υ (t+1) = ηυ (t) + γg (t) , w (t+1) = w (t) -υ (t+1) , where g (t) is either g 1 or g 2 depending on the selection of the batch S 1 or S 2 for the current update As υ (t) remains the same for either choice, the KL-divergence between Q 1 and Q 2 for SGD with momentum, is the same as Equation (5).

G.2 ADAGRAD

Adagrad (Duchi et al., 2011) performs update steps with a different learning rate for each individual parameter. By denoting each coordinate of the parameter vector w by d, one update step of the Adagrad algorithm is w (t+1) d = w (t) d - γ G (t) dd + g (t) d , where the vector g (t) is either g 1 or g 2 depending on the selection of the batch for the current update step, and G (t) dd is the accumulative squared norm of the gradients up until iteration t. Hence, for Adagrad, Equation ( 5) is replaced by KL(Q 1 ||Q 2 ) = 1 2 γ 2 σ 2 1 G (t) + (g 1 -g 2 ) 2 2 ≤ 1 2 γ 2 σ 2 1 G (t) + 2 2 g 1 -g 2 2 2 , ( ) where denotes the element-wise product of two vectors, where division is also taken element-wise and where is a small positive constant that avoids a possible division by 0. To compare the upper bound in Theorem 1 from one iteration to the next one (as needed to determine the early stopping moment in Section 5), gradient disparity is not the only factor in Equation ( 15) that evolves over time. Indeed G (t) is an increasing function of t. However, after a few iterations when the gradients become small, this value becomes approximately constant (the initial gradient values dominate the sum in G (t) ). Then the right hand side of Equation ( 15) varies mostly as a function of gradient disparity, and therefore gradient disparity approximately tracks down the generalization penalty upper bound.

G.3 ADADELTA AND RMSPROP

Adadelta (Zeiler, 2012) is an extension of Adagrad, which computes a decaying average of the past gradient vectors instead of the accumulative squared norm of the gradients during the previous update steps. G (t) dd in Equation ( 14) is then replaced by υ (t+1) d where υ (t+1) d = ηυ (t) d + (1 -η)(g (t) d ) 2 . As training proceeds, the gradient magnitude decreases. Also, η is usually close to 1. Therefore, the dominant term in υ (t+1) d becomes ηυ (t) d . Then, if we approximate υ (t+1) 1 = ηυ (t) + (1 -η) (g 1 ) 2 ≈ ηυ (t) + (1 -η) (g 2 ) 2 = υ (t+1) 2 (squares are done element-wise), then for Adadelta we have KL(Q 1 ||Q 2 ) ≤ 1 2 γ 2 σ 2 1 υ (t+1) + 2 2 g 1 -g 2 2 2 , where again the division is done element-wise. The denominator in Equation ( 16) is smaller than the denominator in Equation ( 15). In both equations, the first non-constant factor in the upper bound of KL(Q 1 ||Q 2 ) decreases as a function of t, and therefore an increase in the value of KL(Q 1 ||Q 2 ) should be accounted for by an increase in the value of gradient disparity. Moreover, as training proceeds, gradient magnitudes decrease and the first factor on the upper bound of Equations ( 15) and ( 16) becomes closer to a constant. Therefore, an upper bound on the generalization penalties can be tracked by gradient disparity. The update rule of RmsPropfoot_7 is very similar to Adadelta, and the same conclusions can be made.

G.4 ADAM

Adam (Kingma & Ba, 2014) combines Adadelta and momentum by storing an exponentially decaying average of the previous gradients and gradients: m (t+1) = β 1 m (t) + (1 -β 1 )g (t) , υ (t+1) = β 2 υ (t) + (1 -β 2 ) g (t) 2 , m(t+1) = m (t+1) 1 -(β 1 ) t , υ(t+1) = υ (t+1) 1 -(β 2 ) t , w (t+1) = w (t) - γ √ υ(t+1) + m(t+1) . All the operations in the above equations are done element-wise. As β 2 is usually very close to 1 (around 0.999), and as squared gradient vectors at the current update step are much smaller than the accumulated values during the previous steps, we approximate: υ (t+1) 1 = β 2 υ (t) + (1 -β 2 ) (g 1 ) 2 ≈ β 2 υ (t) + (1 -β 2 ) (g 2 ) 2 = υ (t+1) 2 (squares are done elementwise). Hence, Equation (5) becomes KL(Q 1 ||Q 2 ) ≤ 1 2 γ 2 σ 2 1 -β 1 1 -(β 1 ) t 1 √ υ(t+1) + 2 2 g 1 -g 2 2 2 . ( ) The first non-constant factor in equation above decreases with t (because β 1 < 1). However it is not clear how the second factor varies as training proceeds. Therefore, unlike previous optimizers, it is more hazardous to claim that the factors other than gradient disparity in Equation ( 17) become constant as training proceeds. Hence, tracking only gradient disparity for the Adam optimizer may be insufficient. This is empirically investigated in the next sub-section.

G.5 EXPERIMENTS

Figure 22 shows gradient disparity and test loss curves during the course of training for adaptive optimizers. The epoch in which the fifth increase in the value of the test loss and gradient disparity has happened is presented in the caption of each experiment. We observe that the two suggested epochs for stopping the optimization (the one suggested by gradient disparity (GD) and the other one suggested by test loss) are extremely close to each other except in Figure 22 (c) where the fifth epoch with an increase in the value of gradient disparity is much later than the epoch with the fifth increase in the value of test loss. However, in this experiment, there is a 23% improvement in the test accuracy if the optimization is stopped according to GD compared to test loss, due to many variations of test loss compared to gradient disparity. As an early stopping criterion, the increase in the value of gradient disparity coincides with the increase in the test loss in all our experiments presented in Figure 22 . In Figure 22 (h), for the Adam optimizer, we observe that after around 20 epochs, the value of gradient disparity starts to decrease, whereas the test loss continues to increase. This mismatch between test loss and gradient disparity might be due to the effect of the other factors that appear in Equation ( 17). Nevertheless, even in this experiment, the increase in the test loss and the gradient disparity coincide, and hence gradient disparity can correctly detect early stopping time. These experiments are a first indication that gradient disparity can be used as an early stopping criterion for optimizers other than SGD. ) and test loss (TL) achieved by using various metrics as early stopping criteria. On the leftmost column, the minimum values of TE and TL over all the iterations are reported (which is not accessible during training). The results of 5-fold cross validation are reported on the right-most column, which serve as a baseline. For each experiment, we have underlined those metrics that result in a better performance than 5-fold cross validation. We observe that gradient disparity (GD) and variance of gradients (Var) consistently outperform k-fold cross validation, unlike other metrics. On the rightmost column (No ES) we report the results without performing early stopping (ES) (training is continued until the training loss is below 0.01). In Table 8 we compare gradient disparity (GD) to a number of metrics that were proposed either directly as an early stopping criterion, or as a generalization metric. For those metrics that were not originally proposed as early stopping criteria, we choose a similar method for early stopping as the one we use for gradient disparity. We consider two datasets (MNIST and CIFAR-10), and two levels of label noise (0% and 50%). Here is a list of the metrics that we compute in each setting: 1. Gradient disparity (GD) (ours): we report the error and loss values at the time when the value of GD increases for the 5th time (from the beginning of the training). 2. The EB-criterion (Mahsereci et al., 2017) : we report the error and loss values when EB becomes positive. 3. Gradient signal to noise ratio (GSNR) (Liu et al., 2020) : we report the error and loss values when the value of GSNR decreases for the 5th time (from the beginning of the training). 4. Gradient inner product, g i • g j (Fort et al., 2019) : we report the error and loss values when the value of g i • g j decreases for the 5th time (from the beginning of the training). 5. Sign of the gradient inner product, sign(g i • g j ) (Fort et al., 2019) : we report the error and loss values when the value of sign(g i • g j ) decreases for the 5th time (from the beginning of the training). 6. Cosine similarity between gradient vectors, cos(g i • g j ) (Fort et al., 2019) : we report the error and loss values when the value of cos(g i • g j ) decreases for the 5th time (from the beginning of the training). 7. Variance of gradients (Var) (Negrea et al., 2019) : we report the error and loss values when the value of Var increases for the 5th time (from the beginning of the training). Variance is computed over the same number of batches used to compute gradient disparity, in order to compare metrics given the same computational budget. 8. Average gradient alignment within the class, Ω c (Mehta et al., 2020) : we report the error and loss values when the value of Ω c decreases for the 5th time (from the beginning of the training). On the leftmost column of Table 8 , report the minimum values of the test error and the test loss over all the iterations, which may not necessarily coincide. For instance, in setting (c), the test error is minimized at iteration 196, whereas the test loss is minimized at iteration 126. On the rightmost column of Table 8 , we report the values of the test error and the test loss when using 5-fold cross validation, which serves as a baseline. It is interesting to observe that gradient disparity and variance of gradients produce the exact same results when used as early stopping criteria (Table 8 ). Moreover, these two are the only metrics that consistently outperform k-fold cross validation. However, in Section H.1, we observe that the correlation between gradient disparity and the test loss is in general larger than the correlation between variance of gradients and the test loss. The EB-criterion, sign(g i • g j ), and cos(g i • g j ) are metrics that perform quite well as early stopping criteria, although not as well as GD and Var. In Section H.2, we observe that these metrics are not informative of the label noise level.

H.1 GRADIENT DISPARITY VERSUS VARIANCE OF GRADIENTS

It has been shown that generalization is related to gradient alignment experimentally in Fort et al. (2019) , and to variance of gradients theoretically in Negrea et al. (2019) . Gradient disparity can be viewed as bringing the two together. Indeed, one can check that E D 2 i,j = 2σ 2 g + 2µ T g µ g -2E g T i g j , given that µ g = E[g i ] = E[g j ] and σ 2 g = tr (Cov [g i ]) = tr (Cov [g j ]). This shows that gradient variance σ 2 g and gradient alignment g T i g j both appear as components of gradient disparity. We conjecture that the dominant term in gradient disparity is the variance of gradients, hence as early stopping criteria these two metrics almost always signal overfitting simultaneously. This is indeed what our experiments show; we show that variance of gradients is also a very promising early stopping criterion (Table 8 ). However, because of the additional term in gradient disparity (the gradients inner product), gradient disparity emphasizes the alignment or misalignment of the gradient vectors. This could be the reason why gradient disparity in general outperforms variance of gradients in tracking the value of the generalization loss; the positive correlation between gradient disparity and the test loss is often larger than the positive correlation between variance of gradients and the test loss ( 

H.2 CAPTURING LABEL NOISE LEVEL

In this section, we show in particular three metrics that even though perform relatively well as early stopping criteria, fail to account for the level of label noise, contrary to gradient disparity. • The sign the gradient inner product, sign(g i • g j ), should be inversely related to the test loss; it should decrease when overfitting increases. However, we observe that the value of sign(g i • g j ) is larger for the setting with the higher label noise level; it incorrectly detects the setting with the higher label noise level as the setting with the better generalization performance (see Figure 23 ). • The EB-criterion should be larger for settings with higher overfitting. In most stages of training, the EB-criterion does not distinguish between settings with different label noise levels, contrary to gradient disparity (see Figure 23 ). At the end of the training, the EBcriterion even mistakenly signals the setting with the higher label noise level as the setting with the better generalization performance. • The cosine similarity between gradient vectors, cos(g i • g j ), should decrease when overfitting increases and therefore with the level of label noise in the training data. But cos(g i •g j ) appears not to be sensitive to the label noise level, and in some cases (Figure 24 (a)) it even increases with the noise level. Gradient disparity is much more informative of the label noise level compared to cosine similarity and the correlation between gradient disparity and the test error is larger than the correlation between cosine similarity and the test accuracy (see Figure 24 ). 



Batches S1 and S2 are drawn without replacement, and the random selection of indices of batches S1 and S2 is independent from the dataset S. Hence, similarly to Negrea et al. (2019); Dziugaite et al. (2020), we have σ(S1) ⊥ ⊥ σ(S2). Note that in Figure5, both the gradient disparity and the generalization loss are increasing from the very first epoch. If we would use gradient disparity as an early stopping criterion, optimization would stop at epoch 5 and we would have a 0.36 drop in the test loss value, compared to the loss reached when the model achieves 0 training loss. In the setting of Figure6, if using gradient disparity as an early stopping criterion, optimization would stop at epoch 9 and we would have a 0.28 drop in the test loss value compared to the loss reached when the model achieves 0 training loss. http://yann.lecun.com/exdb/mnist/ Note that the normalization with respect to the number of parameters is different than the normalization mentioned in Section C.1 which was with respect to the loss values. The value of gradient disparity reported everywhere is the re-scaled gradient disparity; further if comparison between two different architectures is taking place the normalization with respect to dimensionality will also take place. https://www.cs.toronto.edu/ ˜kriz/cifar.html https://www.cs.toronto.edu/ ˜kriz/cifar.html https://www.cs.toronto.edu/ ˜tijmen/csc321/slides/lecture_slides_lec6. pdf



Figure1: An illustration of the penalty term R 2 , where the y-axis is the loss, and the x-axis indicates the parameters of the model. L S1 and L S2 are the average losses over batches S 1 and S 2 , respectively. w(t) is the parameter at iteration t and w

Figure 2: The error percentage and D during training with different amounts of randomness in the training labels for an AlexNet trained on a subset of 12.8 k points of the MNIST training dataset. Pearson's correlation coefficient between gradient disparity and test error (TE)/test loss (TL) over all the iterations and over all levels of randomness are ρ D,TE = 0.861 and ρ D,TL = 0.802. The generalization error (gap) is the difference between the train and test errors.

Figure2shows the test error for networks trained with different amounts of label noise. Interestingly, observe that for this setting the test error for the network trained with 75% label noise remains relatively small, indicating the good resistance of the model against memorization of corrupted samples. As suggested both from the test error (Figure2(a)) and the average gradient disparity (Figure 2 (c)), there is no proper early stopping time for these experiments. The generalization error (Figure 2 (b)) remains close to zero, regardless of the level of label noise, and hence fails to account for label noise. In contrast, the average gradient disparity is very sensitive to the label noise level in all stages of training as shown in Figure 2 (c), as desired for a metric measuring generalization.

Figure 3: The test error (TE) and average gradient disparity (D) for networks that are trained (until reaching the training loss value of 0.01) over training sets with different sizes. We observe a very strong positive correlation: ρ D, TE = 0.984. Training Set Size. The test error decreases with the size of the training set (Figure 3 (left)) and a reliable generalization metric should therefore reflect this property. Many of the previous metrics fail to do so, as shown by (Neyshabur et al., 2017a; Nagarajan & Kolter, 2019). In contrast, the average gradient disparity indeed clearly decreases with the size of the training set, as shown in Figure 3 (right) (see also Figure 17 in Appendix E).

Figure 4: The test error and average gradient disparity for networks that are trained with different batch sizes. A ResNet-34 and a VGG-19 network that are trained on the CIFAR-10 dataset. The correlation between D and test error (TE) for ResNet-34, VGG-19, and both graphs combined are ρ D, TE = 0.985, ρ D, TE = 0.926, and ρ D, TE = 0.893, respectively.

Figure 5: Normalizing versus re-scaling loss before computing average gradient disparity D for a VGG-11 trained on 12.8 k points of the CIFAR-10 dataset.

Figure 6: Average gradient disparity for different averaging parameter s for a ResNet-18 that has been trained on 12.8k points of the CIFAR-10 dataset.

The ratio of the magnitude of the network parameter vector at epoch t to the magnitude of the network parameter vector at epoch 0, for t ∈ {0, GD, 200}, where GD stands for the epoch when gradient disparity signals to stop the training.

Figure 7: Test error (TE), test loss (TL), and gradient disparity (D) for VGG-16 trained with different training set sizes to minimize the mean square error criterion on the CIFAR-10 dataset. The Pearson correlation coefficient between TE and D and between TL and D are ρ D,TE = 0.976 and ρ D,TL = 0.943, respectively.

Figure 8: Comparing 5-fold cross validation (CV) with gradient disparity (GD) as an early stopping criterion when the available dataset is limited. (left) Validation loss versus test loss in 5-fold cross validation. (middle left) Gradient disparity versus test and generalization losses. (middle right and right) Performance on the unseen (test) data for GD versus 5-fold CV. (a) The parameters are initialized by Xavier techniques with uniform distribution. (b, c) The parameters are initialized using He technique with normal distribution. (c) The batch size is 32. The gray and magenta vertical bars indicate the epoch in which the metric (the validation loss or gradient disparity) has increased for 5 epochs from the beginning of training and for 5 consecutive epochs, respectively. In (b) the middle left figure, these two bars meet each other.

Figure 9: Comparing 10-fold cross validation with gradient disparity as early stopping criteria when the available dataset is noisy. (left) Validation loss versus test loss in 10-fold cross validation. (middle left) Gradient disparity versus test and generalization losses. (middle right and right) Performance on the unseen (test) data for GD versus 10-fold CV. (a) The parameters are initialized by Xavier techniques with uniform distribution. (b, c, and d) The parameters are initialized using He technique with normal distribution.

Generalization loss for epoch < 80 (g) D (h) D for epoch < 80

Figure 12: The cross entropy loss, error percentage, and average gradient disparity during training with different amounts of randomness in the training labels for a 4-layer fully connected neural network with 500 hidden units trained on the entire MNIST dataset. The parameter initialization is the He initialization with normal distribution.

Figure14: Test error and normalized gradient disparity for networks trained on the CIFAR-10 dataset with different number of channels and hidden units for convolutional neural networks (CNN) and fully connected neural networks (FCNN). The correlation between normalized gradient disparity and test loss ρ D,TL and between normalized gradient disparity and test error ρ D,TE are reported in the captions.

Figure 15: The cross entropy loss, error percentage, and average gradient disparity during training with different amounts of randomness in the training labels for a 4-layer fully connected neural network with 500 hidden units trained on the entire CIFAR-10 dataset. The parameter initialization is the Xavier initialization with uniform distribution. The training is stopped when the training loss gets below 0.01.

Figure 17: Test error and gradient disparity for networks that are trained with different training set sizes. The training is stopped when the training loss is below 0.01.

Figure 18: Test error and gradient disparity for networks that are trained with different batch sizes trained on 12.8 k points of the CIFAR-10 and CIFAR-100 datasets. The training is stopped when the training loss is below 0.01.

Figure 19: The cross entropy loss, error percentage, and average gradient disparity during training with different amounts of randomness in the training labels for a ResNet-18 trained on the CIFAR-100 training set. The parameter initialization is the Xavier initialization.

For instance, in Figure 20 (d) we highlighted in gray the minima valley of the test error for the network trained with 25% noise and we observe that this aligns with the drop in T ζ for ζ = 20 (light pink curve in Figure 20 (g)) better than the increasing time of D (Figure 20 (f)). Also, the slow increase in the generalization loss for the AlexNet (green curve in Figure 21 (a)) is captured by the drop in(a) Training loss (b) Test loss (c) Training error (d) Test error (e) Generalization error (f) Average gradient disparity D (g) T ζ for ζ 20

Figure 20: The cross entropy loss, error percentage, average gradient disparity and T ζ for ζ = 20 during training with different amounts of randomness in the training labels for a ResNet-18 trained on a subset of 12.8 k points of the CIFAR-10 training set. We can observe that, in this setting, average gradient disparity distinguishes different label noise levels from the beginning of training, unlike generalization error.

T ζ for ζ = 10

Figure 21: The cross entropy loss, average gradient disparity D, and T ζ for ζ = 10 during training for an AlexNet trained on a subset of 12.8 k points of the MNIST training set.

Figure 22: (a-d) VGG-19 configuration trained on 12.8 k training points of CIFAR-10 dataset. (e-h) VGG-11 configuration trained on 12.8 k points of the CIFAR-10 dataset. The training is stopped when the training loss gets below 0.01. The presented results are an average over 5 runs. The captions below each figure give the epoch number where test loss and gradient disparity have respectively been increased for 5 epochs from the beginning of training. The Pearson correlation coefficient ρ is presented in each figure between gradient disparity and test loss.

Figure 23: Test loss, gradient disparity, EB-criterion Mahsereci et al. (2017), and sign(g i • g j ) for a ResNet-18 trained on the CIFAR-10 dataset, with 0% and 50% random labels. Gradient disparity, contrary to EB-criterion and sign(g i • g j ), clearly distinguishes the setting with real labels from the setting with random labels.

9 in Appendix D). The final test loss and accuracy when using gradient disparity (GD) and k-fold cross validation (CV) as early stopping criteria, (top) when the available dataset is limited and (bottom) when the available data has noisy labeled samples. To simulate a limited-data scenario, we consider as a training set a subset of 1280 samples of the CIFAR-100 dataset. The configurations in the top row and the bottom row are ResNet-34 and ResNet-18, respectively. In both methods, the optimization is stopped when the metric (validation loss or GD) increases for 5 epochs.

t 4 seconds, where B is the number of batches. Performing one epoch of training the network and computing the gradient disparity takes GD

The loss and accuracy on the test set comparing 5-fold cross validation and gradient disparity as early stopping criterion when the available dataset is limited. The corresponding curves during training are presented in Figure8. The above results are obtained by stopping the optimization when the metric (either validation loss or gradient disparity) has been increased for five epochs from the beginning of training.

Table 7 that stopping training after 5 consecutive increases in gradient disparity leads to 91.52% The loss and accuracy on the test set comparing 10-fold cross validation and gradient disparity as early stopping criterion when the available dataset is noisy. In all the experiments, 50% of the available data has random labels. The corresponding curves during training are presented in Figure



• g j sign(g i • g j ) cos(g i • g j ) Var EB GSNR g i • g j sign(g i • g j ) cos(g i • g j ) Test error (TE

Table9). Pearson's correlation coefficient between gradient disparity (D) and test loss (TL) over the training iterations is compared to the correlation between variance of gradients (Var) and test loss.

