THE DYNAMIC OF CONSENSUS IN DEEP NETWORKS AND THE IDENTIFICATION OF NOISY LABELS Anonymous authors Paper under double-blind review

Abstract

Deep neural networks have incredible capacity and expressibility, and can seemingly memorize any training set. This introduces a problem when training in the presence of noisy labels, as the noisy examples cannot be distinguished from clean examples by the end of training. Recent research has dealt with this challenge by utilizing the fact that deep networks seem to memorize clean examples much earlier than noisy examples. Here we report a new empirical result: for each example, when looking at the time it has been memorized by each model in an ensemble of networks, the diversity seen in noisy examples is much larger than the clean examples. We use this observation to develop a new method for noisy labels filtration. The method is based on a statistics of the data, which captures the differences in ensemble learning dynamics between clean and noisy data. We test our method on three tasks: (i) noise amount estimation; (ii) noise filtration; (iii) supervised classification. We show that our method improves over existing baselines in all three tasks using a variety of datasets, noise models, and noise levels. Aside from its improved performance, our method has two other advantages. (i) Simplicity, which implies that no additional hyperparameters are introduced. (ii) Our method is modular: it does not work in an end-to-end fashion, and can therefore be used to clean a dataset for any other future usage.

1. INTRODUCTION

Deep neural networks dominate the state of the art in an ever increasing list of application domains, but for the most part, this incredible success relies on very large datasets of annotated examples available for training. Unfortunately, large amounts of high-quality annotated data are hard and expensive to acquire, whereas cheap alternatives (obtained by way of crowd-sourcing or automatic labeling, for example) often introduce noisy labels into the training set. By now there is much empirical evidence that neural networks can memorize almost every training set, including ones with noisy and even random labels (Zhang et al., 2017) , which in turn increases the generalization error of the model. As a result, the problems of identifying the existence of label noise and the separation of noisy labels from clean ones, are becoming more urgent and therefore attract increasing attention. Henceforth, we will call the set of examples in the training data whose labels are correct "clean data", and the set of examples whose labels are incorrect "noisy data". While all labels can be eventually learned by deep models, it has been empirically shown that most noisy datapoints are learned by deep models late, after most of the clean data has already been learned (Arpit et al., 2017) . Therefore many methods focus on the learning time of an example in order to classify it as noisy or clean, by looking at its loss (Pleiss et al., 2020; Arazo et al., 2019) or loss per epoch (Li et al., 2020 ) in a single model. However, these methods struggle to classify correctly clean and noisy datapoints that are learned at the same time, or worse -noisy datapoints that are learned early. Additionally, many of these methods work in an end-to-end manner, and thus neither provide noise level estimation nor do they deliver separate sets of clean and noisy data for novel future usages. Our first contribution is a new empirical results regarding the learning dynamics of an ensemble of deep networks, showing that the dynamics is different when training with clean data vs. noisy data. The dynamics of clean data has been studied in (Hacohen et al., 2020; Pliushch et al., 2021) , where it is reported that different deep models learn examples in the same order and pace. This means that when training a few models and comparing their predictions, a binary occurrence (approximately) is seen at each epoch e: either all the networks correctly predict the example's label, or none of them does. This further implies that for the most part, the distribution of predictions across points is bimodal. Additionally, a variety of studies showed that the bias and variance of deep networks decrease as the networks complexity grow (Nakkiran et al., 2021; Neal et al., 2018) , providing additional evidence that different deep networks learn data at the same time simultaneously. In Section 3 we describe a new empirical result: when training an ensemble of deep models with noisy data, and in contrast to what happens when using clean data, different models learn different datapoints at different times (see Fig. 1 ). This empirical finding tells us that in an ensemble of networks, the learning dynamics of clean data and noisy data can be distinguished. When training such an ensemble with a mixture of clean and noisy data, the emerging dynamics reflects this observation, as well as the tendency of clean data to be learned faster as previously observed. In our second contribution, we use this result to develop a new algorithm for noise level estimation and noise filtration, which we call DisagreeNet (see Section 4). Importantly, unlike most alternative methods, our algorithm is simple (it does not introduce any new hyperparameters), parallelizable, easy to integrate with any supervised or semi-supervised learning method and any loss function, and does not rely on prior knowledge of the noise amount. When used for noise filtration, our empirical study (see Section 5) shows the superiority of DisagreeNet as compared to the state of the art, using different datasets, different noise models and different noise levels. When used for supervised classification by way of pre-processing the training set prior to training a deep model, it provides a significant boost in performance, more so than alternative methods.

Relation to prior art

Work on the dynamics of learning in deep models has received increased attention in recent years (e.g., Nguyen et al., 2020; Hacohen et al., 2020; Baldock et al., 2021) . Our work adds a new observation to this body of knowledge, which is seemingly unique to an ensemble of deep models (as against an ensemble of other commonly used classifiers). Thus, while there exist other methods that use ensembles to handle label noise (e.g., Sabzevari et al., 2018; Feng et al., 2020; Chai et al., 2021; de Moura et al., 2018) , for the most part they cannot take advantage of this characteristic of deep models, and as a result are forced to use additional knowledge, typically the availability of a clean validation set and/or prior knowledge of the noise amount. Work on deep learning with noisy labels (see Song et al. (2022) for a recent survey) can be coarsely divided to two categories: general methods that use a modified loss or network's architecture, and methods that focus on noise identification. The first group includes methods that aim to estimate the underlying noise transition matrix (Goldberger and Ben-Reuven, 2016; Patrini et al., 2017) , employ a noise-robust loss (Ghosh et al., 2017; Zhang and Sabuncu, 2018; Wang et al., 2019; Xu et al., 2019) , or achieve robustness to noise by way of regularization (Tanno et al., 2019; Jenni and Favaro, 2018) . Methods in the second group, which is more inline with our approach, focus more directly on noise identification. Some methods assume that clean examples are usually learned faster than noisy examples (e.g. Liu et al., 2020) . Others (Arazo et al., 2019; Li et al., 2020) generate soft labels by interpolating the given labels and the model's predictions during training. Yet other methods (Jiang et al., 2018; Han et al., 2018; Malach and Shalev-Shwartz, 2017; Yu et al., 2019; Lee and Chung, 2019) , like our own, inspect an ensemble of networks, usually in order to transfer information between networks and thus avoid agreement bias. Notably, we also analyze the behavior of ensembles in order to identify the noisy examples, resembling (Pleiss et al., 2020; Nguyen et al., 2019; Lee and Chung, 2019) . But unlike these methods, which track the loss of the networks, we track the dynamics of the agreement between multiple networks over epochs. We then show that this statistic is more effective, and achieves superior results. Additionally (and not less importantly), unlike these works, we do not assume prior knowledge of the noise amount or the presence of a clean validation set, and do not introduce new hyper-parameters in our algorithm. Recently, the emphasis has somewhat shifted to the use of semi-supervised learning and contrastive learning (Li et al., 2020; Liu et al., 2020; Ortego et al., 2020; Wei et al., 2020; Yao et al., 2021; Zheltonozhskii et al., 2022; Li et al., 2022; Karim et al., 2022) . Semi-supervised learning is an effective paradigm for the prediction of missing labels. This paradigm is especially useful when the identification of noisy points cannot be done reliably, in which case it is advantageous to remove labels whose likelihood to be true is not negligible. The effectiveness of semi-supervised learning in providing reliable pseudo-labels for unlabeled points will compensate for the loss of clean labels. However, semi-supervised learning is not universally practical as it often relies on the extraction of effective representations based on unsupervised learning tasks, which typically introduces implicit priors (e.g., that contrastive loss is appropriate). In contrast, our goal is to reliably identify noisy points, to be subsequently removed. Thus, our method can be easily incorporated into any SOTA method which uses supervised or semi-supervised learning (with or without contrastive learning), and may provide benefit even when semi-supervised learning is not viable.

2. INTER-NETWORK AGREEMENT: DEFINITION AND SCORES

Measuring the similarity between deep models is not a trivial challenge, as modern deep neural networks are complex functions defined by a huge number of parameters, which are invariant to transformations hidden in the model's architecture. Here we measure the similarity between deep models in an ensemble by measuring inter-model prediction agreement at each datapoint. Accordingly, in Section 2.2 we describe scores that are based on the state of the networks at each epoch e, while in Section 2.3 we describe cumulative scores that integrate these states through many epochs. Practically (see Section 4), our proposed method relies on the cumulative scores, which are shown empirically to provide more accurate results in the noise filtration task. These scores promise added robustness, as it is no longer necessary to identify the epoch at which the score is to be evaluated. , y i = g(l i ). The two most common models of label noise are termed symmetric noise and asymmetric noise (Patrini et al., 2017) . In both cases it is assumed that some fixed percentage of the labels are corrupted by g(l). With symmetric noise, g(l) assigns any new label from the set [C] \ {l} with equal probability. With asymmetric noise, g(l) is the deterministic permutation function (see App. F for details). Note that the asymmetric noise model is considered much harder than the symmetric noise model.

2.2. PER-EPOCH AGREEMENT SCORE

Following Hacohen et al. (2020) , we define the True Positive Agreement (TPA) score of ensemble F e (X) at each datapoint (x, y), where T P A(x, y; F e (X)) = 1 N N i=1 1 [f e i (x)=y] . The TPA score measures the average accuracy of the models in the ensemble, when seeing x, after each model has been trained for exactly e epochs on X. Note that T P A measures the average accuracy of multiple models on one example, as opposed to the generalization error that measures the average error of one model on multiple examples.

2.3. CUMULATIVE SCORES

When inspecting the dynamics of the TPA score on clean data, we see that at the beginning the distribution of {T P A(x i , y i )} is concentrated around 0, and then quickly shifts to 1 as training proceeds (see side panels in Fig. 2a ). This implies that empirically, data is learned in a specific order by all models in the ensemble. To measure this phenomenon we use the Ensemble Learning Pace (ELP) score defined below, which essentially integrates the TPA using all training epochs: ELP (x, y) = 1 E e∈[1,..

.,E]

T P A(x, y; F e (X)) (1) ELP (x, y) captures both the time of learning by a single model, and its consistency across models. For example, if all the models learned the example early, the score would be high. It would be significantly lower if some of them learned it later than others (see pseudo-code in App. C). In our study we evaluated two additional cumulative scores of inter-model agreement: 1. Cumulative loss: CumLoss(x, y) = 1 N E i,e∈[1,...,E] CE(f e i (x), y) Above CE denotes the cross entropy function. This score is very similar to ELP, engaging the average of the cross-entropy loss instead of the accuracy indicator 1 [f e i (x)=y] . 2. Area under the margin: following (Pleiss et al., 2020) , the MeanMargin score is defined as follows M eanM argin(x, y) = 1 N E i,e∈[1,...,E] [f e i (x)] yi -argmax j̸ =yi [f e i (x)] j The MeanMargin score is the mean of the 'margin', the difference between the value of the ground-truth logit (before softmax) and the value of the otherwise maximal logit.

3. THE DYNAMICS OF AGREEMENT: NEW EMPIRICAL OBSERVATION

In this section we analyze, both theoretically and empirically, how measures of inter-network agreement may indicate the detrimental phenomenon of Overfit. Overfit is a condition that can occur during the training of deep neural networks. It is characterized by the co-occurring decrease of train error or loss and the increase of test error or loss. Recall that train loss is the quantity that is being continuously minimized during the training of deep models, while the test error is the quantity linked to generalization error. When these quantities change in opposite directions, training harms the final performance and thus early stopping is recommended. We begin by showing in Section 3.1 that in an ensemble of linear regression models, overfit and the agreement between models are negatively correlated. When this is the case, an epoch in which the agreement between networks reaches its maximal value is likely to indicate the beginning of overfit. Our next goal is to examine the relevance of this result to deep learning in practice. Yet inexplicably, at least as far as image datasets are concerned, overfit rarely occurs in practice when deep learning is used for image recognition. However, when label noise is introduced, significant overfit occurs. Capitalizing on this observation, we report in Section 3.3 that when overfit occurs in the independent training of an ensemble of deep networks, the agreement between the networks starts to decrease. The approach we describe in Section 4 is motivated by these results: Since it has been observed that noisy data are memorized later than clean data, we hypothesize that overfit occurs when the memorization of noisy labels becomes dominant. This suggests that measuring the dynamics of agreement between networks, which is correlated with overfit as shown below, can be effectively used for the identification of label noise.

3.1. OVERFIT AND AGREEMENT: THEORETICAL RESULT

Since deep learning models are not amenable to a rigorous theoretical analysis, and in order to gain computational insight into such general phenomena as overfit, simpler models are sometimes analyzed (e.g. Weinshall and Amir, 2020) . Accordingly, in App. A we analyze the relation between overfit and inter-model agreement in an ensemble of linear regression models. Our analysis culminates in a theorem, which states that the agreement between linear regression models decreases when overfit occurs in all the models, namely, when the generalization error in all the models increases. Here is a brief sketch of the theorem's proof (see Appendix A): • Disagreement is measured by the empirical variance over models of the error vector at each test point, averaged over the test examples. • We prove the intuitive Lemma 1, stating the following: overfit occurs in a model iff the gradient step of the model, which is computed from the training set, is negatively correlated with a vector unknown to the learner -the gradient step defined by the test set. • Using some asymptotic assumptions and a lengthy and technical derivation, we show that the disagreement is approximately the sum of the correlations between each network's gradient step and its "test gradient step". Then, it follows immediately from Lemma 1 that if overfit occurs in all the models then the aforementioned disagreement score increases.

3.2. MEASURING THE AGREEMENT BETWEEN MODELS

In order to obtain a score that captures the level of disagreement between networks, we inspect more closely the distribution of T P A(x, y; F e (X)), defined in Section 2.2, over a sample of datapoints, and analyze its dynamics as training proceeds. First, note that if all of the models in ensemble F e (X) give identical predictions at each point, the TPA score would be either 0 (when all the networks predict a false label) or 1 (when all the networks predict the correct label). In this case, the TPA distribution is perfectly bimodal, with only two peaks at 0 and 1. If the predictions of the models at each point are independent with mean accuracy p, then it can be readily shown that TPA is approximately the binomial random variable with a unimodal distribution around p. Empirically, (Hacohen et al., 2020) showed that in ensembles of deep models trained on 'real' datasets as we use here, the TPA distribution is highly bimodal. Since commonly used measures of bimodality, such as the Pearson bimodality score, are ill-fitted for the discrete TPA distribution, we measure bimodality with the following Bimodal Index score: BI(e) = 1 M M i=1 1 [T P A(xi,yi;F e (X))=N ] + 1 M M i=1 1 [T P A(xi,yi;F e (X))=0] (2) BI(e) measures how many examples are either correctly or incorrectly classified by all the models in the ensemble, rewarding distributions where points are (roughly) equally divided between 0 and 1. Here we use this score to measure the agreement between networks at epoch e. If we were to draw the Bimodality Index (BI) of the TPA score as a function of the epochs (Fig. 2a ), we often see two distinct phases. Initially (phase 1), BI is monotonically increasing, namely, both test accuracy and agreement are on the rise. We call it the 'learning' phase. Empirically, in this phase most of the clean examples are being learned (or memorized), as can also be seen in the left side panels of Fig. 2a (cf. Li et al., 2015) . At some point BI may start to decrease, followed by another possible ascent. This is phase 2, in which empirically the memorization of noisy examples dominates the learning (see the right side panels of Fig. 2a ). This fall and rise is explained by another set of empirical observations, that noisy labels are not being learned in the same order by an ensemble of networks (see App. B), which therefore predicts a decline in BI when noisy labels are being learned.

3.3. OVERFIT AND AGREEMENT: EMPIRICAL EVIDENCE

Earlier work, investigating the dynamics of learning in deep networks, suggests that examples with noisy labels are learned later (Krueger et al., 2017; Zhang et al., 2017; Arpit et al., 2017; Arora et al., 2019) . Since the learning of noisy labels is unlikely to improve the model's test accuracy, we hypothesize that this may be correlated with the occurrence (or increase) of overfit. The theoretical result in Section 3.1 suggests that this may be correlated with a decrease in the agreement between networks. Our goal now is to test this prediction empirically. We next outline empirical evidence that this is indeed the case in actual deep models. In order to boost the strength of overfit, we adopt the scenario of recognition with label noise, where the occurrence of overfit is abundant. When overfit indeed occurs, our experiments show that if the test accuracy drops, then the disagreement score BI also decreases (see example in Fig. 2b-bottom ). This observation is confirmed with various noise models and different datasets. When overfit does not occur, the prediction is no longer observed (see example in Fig. 2b-top ). These results suggest that a consistent drop in the BI index of some training set X can be used to estimate the occurrence of overfit, and possibly even the beginning of noisy label memorization. Aiming to achieve modular handling of noisy labels, we propose the following two-step approach: 1. Run DisagreeNet. 2. Run SOTA supervised learning method using the filtered data. In step 2 it is possible to invoke semi-supervised SOTA methods, using the noisy group as unsupervised data. However, given that semi-supervised learning typically involves additional assumptions (or prior knowledge) as well as high computational complexity (that restricts its applicability to smaller datasets), as discussed in Section 1, we do not consider this scenario here.

5. EMPIRICAL EVALUATION

We evaluate our method in the following scenarios and tasks: 1. Noise identification (Section 5.2), with two complementary sub-tasks: (i) estimate the noise level in the given dataset; (ii) identify the noisy examples. 2. Supervised classification (Section 5.3), after the removal of the noisy examples.

5.1. DATASET AND BASELINES (DETAILS ARE DEFERRED TO APP. F)

Datasets We evaluate our method on a few standard image classification datasets, including Cifar10 and Cifar100 (Krizhevsky et al., 2009) , Tiny imagenet (Le and Yang, 2015) , subsets of Imagenet (Deng et al., 2009) , Clothing1M (Xiao et al., 2015) and Animal10N (Song et al., 2019a) , see App. F for details. These datasets were used in earlier work to evaluate the success of noise estimation (Pleiss et al., 2020; Arazo et al., 2019; Li et al., 2020; Liu et al., 2020) .

Baselines and comparable methods

We report results with the following supervised learning methods for learning from noisy data: DY-BMM and DY-GMM (Arazo et al., 2019) , INCV (Chen et al., 2019) , AUM (Pleiss et al., 2020) , Bootstrap (Reed et al., 2014) , D2L (Ma et al., 2018) , MentorNet (Jiang et al., 2018) , FINE (Kim et al., 2021) . We also report the results of two absolute baselines: (i) Oracle, which trains the model on the clean dataset; (ii) Random, which trains the model after the removal of a random fraction of the whole data, equivalent to the noise level.

Other methods

The following methods use additional prior information, such as a clean validation set or known level of noise: Co-teaching (Han et al., 2018) , O2U (Huang et al., 2019b) , LEC (Lee and Chung, 2019) and SELFIE (Song et al., 2019b) . Direct comparison does injustice to the previous group of methods, and is therefore deferred to App. H. Another group of methods is excluded from the comparison because they invoke semi-supervised or contrastive learning (e.g., Ortego et al., 2020; Li et al., 2020; Karim et al., 2022; Li et al., 2022; Wei et al., 2020; Yao et al., 2021) , which is a different learning paradigm (see discussion of prior art in Section 1). (ii) Noise level estimation, shown in Fig. 4c-4d , showing good noise level estimation especially in the case of symmetric noise. We also compare DisagreeNet to MeanMargin and CumLoss, see Fig. 5 . Not surprisingly, dealing with datasets that are presumed to include inherent label noise proved more difficult, and quite different, than dealing with synthetic noise. As claimed in (Ortego et al., 2020) , non-malicious label noise does less damage to networks' generalization than random label noise: on Clothing1M, for example, hardly any overfit is seen during training, even though the data is believed to contain more than 35% noise. Still, here too, DisagreeNet achieves improved accuracy without access to a clean validation set or known noise level (see Table 1 ). In App. H, Table 8 we compare Disagreenet to methods that do use such prior knowledge. Surprisingly, we see that DisagreeNet still achieves better results even without using any additional prior knowledge.

5.4. ABLATION STUDY

How many networks are needed? We report in Table . 2 the F1 score for noisy label identification, using DisagreeNet with varying numbers of networks. The main boost in performance provided by the use of additional networks is seen when using DisagreeNet on hard noise scenarios, such as the asymmetric noise, or with small amounts of noise. 6 . Finally, in Fig. 5 we compare DisagreeNet using ELP to disagreeNet using the MeanMargin and CumLoss scores, as defined in Section 2.3. In symmetric noise scenarios all scores perform well, while in asymmetric noise scenarios the ELP score performs much better, as can be seen in Figs. 5b, 5d . Additional comparisons of the 3 scores are reported in Apps. E and G. 

6. SUMMARY AND DISCUSSION

We presented a new empirical observation, that the variability in the predictions of an ensemble of deep networks is much larger when labels are noisy, than it is when labels are clean. This observation is used as a basis for a new method for classification with noisy labels, addressing along the way the tasks of noise level estimation, noisy labels identification, and classifier construction. Our method is easy to implement, and can be readily incorporated into existing methods for deep learning with label noise, including semi-supervised methods, to improve the outcome of the methods. Importantly, our method achieves this improvement without making additional assumptions, which are commonly made by alternative methods: (i) Noise level is expected to be unknown. (ii) There is no need for a clean validation set, which many other methods require, but which is very difficult to acquire when the training set is corrupted. (iii) Almost no additional hyperparameters are introduced. • Let ∆(i, j) denote the cross gradient: ∆(i, j) = e(i, j)X(j) ⊤ = w(i)Σ XX (j) -Σ Y X (j) =⇒ ∆w(i) = ∆(i, i) (5) After each GD step, the model and the error are updated as follows: w(i) = w(i) -µ∆w(i) ẽ(i, j) = w(i)X(j) -y(j) = e(i, j) -µ∆(i, i)X(j) We note that at step s and ∀i, j ∈ Q, w(i) is a random vector in R d , and ẽ(i, j) is a random vector in R M . If j ∈ {t}, then ẽ(i, j) = ẽ(i, t) is a random vector in R N . Test error random variable. Using the above notations, {e(i, t)} Q i=1 is a set of Q test errors vectors in R N , where the n th component of the i th vector e(i, t) n captures the test error of model i on test example n. In effect, it is a sample of size Q from the random variable e( * , t) n . This random variable captures the error over test point n of a model computed from a random sample of size M . The empirical variance of this random variable will be used to estimate the agreement between the models. Overfit. Overfit occurs at step s if ∥ẽ(i, t)∥ 2 F > ∥e(i, t)∥ 2 F (6) Measuring inter-model agreement. In classification problems, bi-modality of the ELP score captures the agreement between a set of classifiers, all trained on the same training matrix X(i) = X. Since here we are analyzing a regression problem, we need a comparable score to measure agreement between the predictions of Q linear functions. This measure is chosen to be the variance of the test error among models. Accordingly, we will measure disagreement by the empirical variance of the test error random variable ẽ( * , t) n , average over all test examples n ∈ [N ]. More specifically, consider an ensemble of linear models {w(i)} Q i=1 trained on set X to minimize (3) with s gradient steps, where i denotes the index of a network instance and Q the number of network instances. Using the test error vectors of these models e(i, t), we compute the empirical variance of each element var[e( * , t) n ], and sum over the test examples n ∈ [N ]: N n=1 σ 2 [e( * , t) n ] = N n=1 1 2Q 2 Q i=1 Q j=1 |e(i, t) n -e(j, t) n | 2 = 1 2Q 2 Q i=1 Q j=1 ∥e(i, t) -e(j, t)∥ 2 F Definition 1 (Inter-model DisAgreement. ). The disagreement among a set of Q linear models {w(i)} Q i=1 at step s is defined as follows DisAg(s) = 1 2Q 2 Q i=1 Q j=1 ∥e(i, t) -e(j, t)∥ 2 F (7) A.2

OVERFIT AND INTER-NETWORK AGREEMENT

We first prove Lemma 1, which has the following intuitive interpretation: overfit occurs in model i iff the gradient step of model i (denoted ∆w(i)), which is computed using the training set, is negatively correlated with the 'correct' gradient step -the one we would have obtained had we known the test set (this unattainable vector is denoted ∆(i, t)). Lemma 1. Assume that the learning rate µ is small enough so that we can neglect terms that are O(µ 2 ). Then in each gradient descent step s, overfit occurs iff the gradient step ∆w(i) of network i is negatively correlated with the cross gradient ∆(i, t). Proof. Starting from ( 6) (overfit) ⇐⇒ ∥ẽ(i, t)∥ 2 F > ∥e(i, t)∥ 2 F ⇐⇒ ∥ẽ(i, t)∥ 2 F -∥e(i, t)∥ 2 F = ∥e(i, t) -µ∆(i, i)X(t)∥ 2 F -∥e(i, t)∥ 2 F > 0 ⇐⇒ -2µ∆(i, i)X(t)e(i, t) ⊤ + O(µ 2 ) > 0 ⇐⇒ ∆(i, i) • ∆(i, t) < 0 ⇐⇒ ∆w(i) • ∆(i, t) < 0 (8) Lemma 2 claims that if the magnitude of the gradient step µ is small enough, then the operator norm of matrix I -µΣ XX is smaller than 1. The implication is that a geometric sum of this matrix converges, a technical result which will be used later. Lemma 2. For any invertible covariance matrix Σ XX there exists μ > 0, such that µ < μ =⇒ ∥I -µΣ XX ∥ < 1. Proof. Since Σ XX is positive-definite, we can write Σ XX = U SU ⊤ for orthogonal matrix U and the diagonal matrix of singular values S = diag{s i }. It follows that I -µΣ XX = U diag{1 -µs i }U ⊤ , a matrix whose largest singular value is 1 -µs d . Since by assumption s d > 0, the lemma follows. Our last Lemma 3 claims that eventually, after sufficiently many gradient steps, the expected value of the solution is exactly the closed-form solution of the vetor that minimizes the loss. Lemma 3. Assume that ∥I -µΣ XX ∥ < 1 and Σ XX is invertible. If the number of gradient steps s is large enough so that ∥I -µΣ XX ∥ s can be neglected, then E[w s ] ≈ Σ Y X Σ -1 XX (9) Proof. Starting from (4), we can show that w s = w 0 (I -µΣ XX ) s-1 + µΣ Y X s-1 k=1 (I -µΣ XX ) k-1 Since E(w 0 ) = 0 E(w s ) = E(w 0 )(I -µΣ XX ) s-1 + µΣ Y X s-1 k=1 (I -µΣ XX ) k-1 = µΣ Y X s-1 k=1 (I -µΣ XX ) k-1 Given the lemma's assumptions, this expression can be evaluated and simplified: E(w s ) = µΣ Y X [I -(I -µΣ XX )] -1 [I -(I -µΣ XX ) s-1 ] = Σ Y X Σ -1 XX -Σ Y X Σ -1 XX (I -µΣ XX ) s-1 ≈ Σ Y X Σ -1 XX (10) From ( 7) it follows that a decrease in inter-model agreement at step s, which is implied by increased test variance among models, is indicated by the following inequality: C = DisAg(s) -DisAg(s -1) = 1 2Q 2 Q i,j=1 ∥ẽ(i, t) -ẽ(j, t)∥ 2 F - 1 2Q 2 Q i,j=1 ∥e(i, t) -e(j, t)∥ 2 F > 0 Theorem. Assume that all models see the same training set, denoted as X(i) = X ∀i ∈ [Q], and that the training data covariance matrix Σ XX is full rank. We make the following asymptotic assumptions, which are loosely phrased but can be rigorously defined with additional notations: 1. The learning rate µ is small enough so that ∥I -µΣ XX ∥ < 1 (from Lemma 2), and additionally we can neglect terms that are O(µ 2 ). 2. The number of gradient steps s is large enough so that ∥I -µΣ XX ∥ s can be neglected. 3. The number of models Q is large enough so that using the law of large numbers, we get 1 Q Q i=1 w(i) ≈ E[w]. Finally, we assume that overfit occurs at time s in all the models of the ensemble. In other words, at time s the generalization error does not decrease in all the models. When these assumptions hold, the agreement between the models decreases. Proof. (11) can be rearranged as follows C = 1 2Q 2 Q i,j=1 ∥[e(i, t) -µ∆(i, i)X(t)] -[e(j, t) -µ∆(j, j)X(t)]∥ 2 F - 1 2Q 2 Q i,j=1 ∥e(i, t) -e(j, t)∥ 2 F = 1 Q 2 Q i,j=1 -µ[e(i, t) -e(j, t)] • [∆(i, i)X(t)] -∆(j, j)X(t)] + O(µ 2 ) = µ Q 2 Q i,j=1 [∆(i, i) • ∆(j, t) + ∆(j, j) • ∆(i, t)] -[∆(i, i) • ∆(i, t) + ∆(j, j) • ∆(j, t)] + O(µ 2 ) where the last transition follows from e(i, t)X(t) ⊤ = ∆(i, t). Using assumption 2 C = µ(C ′ -C ′′ ) + O(µ 2 ) ≈ µ(C ′ -C ′′ ) where C ′′ = 1 Q 2 Q i,j=1 [∆(i, i) • ∆(i, t) + ∆(j, j) • ∆(j, t)] = 2 Q Q i=1 ∆(i, i) • ∆(i, t) and C ′ = 1 Q 2 Q i,j=1 [∆(i, i) • ∆(j, t) + ∆(j, j) • ∆(i, t)] = 1 Q Q i=1 ∆(i, i) • 1 Q Q j=1 ∆(j, t) + 1 Q Q j=1 ∆(j, j) • 1 Q Q i=1 ∆(i, t) = 1 Q Q i=1 ∆(i, i) • 2 Q Q j=1 ∆(j, t) Next, we prove that C ′ is approximately 0. We first deduce from assumptions 1 and 4 that 1 Q Q i=1 ∆(i, i) = 1 Q Q i=1 w(i)Σ XX (i)-Σ Y X (i) = 1 Q Q i=1 w(i) Σ XX -Σ Y X ≈ E[w]Σ XX -Σ Y X From assumption 3 and Lemma 3, we have that E[w] ≈ Σ Y X Σ -1 XX . Thus 1 Q Q i=1 ∆(i, i) ≈ E[w]Σ XX -Σ Y X ≈ Σ Y X Σ -1 XX Σ XX -Σ Y X = 0 From this derivation and ( 14) we may conclude that C ′ ≈ 0. Thus C ≈ -µC ′′ = -µ 2 Q Q i=1 ∆(i, i) • ∆(i, t) If overfit occurs at time s in all the models of the ensemble, then C > 0 from Lemma 1 and (15). From (11) we may conclude that the inter-model agreement decreases, which concludes the proof.

B NOISY LABELS AND INTER-MODEL AGREEMENT

Here we show empirical evidence, that noisy labels are not being learned in the same order by an ensemble of networks. To see this, we measure the distance between the TPA distribution, computed separately for clean examples and for noisy examples, and the binomial distribution, which is the methods are not always stable and robust as a solution for label noise, and sometimes its best to identify the noisy labels and retrain on the filtered dataset. Alternative scores We evaluate the two alternative scores defined in Section 2.3: CumLoss and MeanMargin, in which case Step 2 of DisagreeNet is executed using one of them instead of the ELP score. Fig. 9 shows the Probability Distribution Function (PDF) of the three scores, revealing that ELP is more consistency bimodal (especially in the difficult asymmetric case), with modes (peaks) that appear more separable. This benefit translates to superior performance in the noise filtration task (Figs. 5b, 5d) . In addition, we evaluated the importance of using solely the last epoch instead of the entire history in Fig. 10 , which shows that using ELP is superior to using the TPA for filtration, even if one uses the best epoch for TPA filtration. We believe that this empirical observation, of increased mode separation, is due to significant difference in the pace of change in agreement values during training between clean and noisy data, in contrast with the pace of change in smoother measures of confidence like Margin and Loss (see App. G). Note that with the easier symmetric noise, we do not see this difference, and indeed the other scores exhibit two nicely separated modes, sometimes achieving even better results in noise Additionally, in table 7 we evaluate the confusion score in (Simsek et al., 2022) , which is the mean (over the epochs) entropy of the mean (over multiple networks) logit of each class (used in the original paper for our of distribution study). We see that this score is also much inferior to ours in the binary classification task of noisy labels identification. 

F METHODOLOGY AND TECHNICAL DETAILS

Datasets We evaluated our method on a few standard image classification datasets, including Cifar10 and Cifar100 (Krizhevsky et al., 2009) and Tiny imagenet (Le and Yang, 2015) . Cifar10/100 consist of 60k 32 × 32 color images of 10 and 100 classes respectivaly. Tiny ImageNet consists of 100,000 images from 200 classes of ImageNet (Deng et al., 2009) , downsampled to size 64 × 64. Animal10N dataset contains 5 pairs of confusing animals with a total of 55,000 64x64 images. Clothing1M (Xiao et al., 2015) contains 1M clothing images in 14 classes. These datasets were used in earlier work to evaluate the success of noise estimation (Pleiss et al., 2020; Arazo et al., 2019; Li et al., 2020; Liu et al., 2020) . Baseline methods for comparison We evaluate our method in the context of two approaches designed to deal with label noise: methods that focus on improving the supervised learning by identifying noisy labels and removing/reducing their influence on the training, and methods that use iterative methods and utilize semi-supervised algorithms in order to learn with noisy labels. First approach: ⋄ DY-BMM and DY-GMM (Arazo et al., 2019) Technical Details Unless stated otherwise, we used an SGD optimizer with 0.9 momentum and a learning rate of 0.01, weight decay of 5e-4, and batch size of 32. We used a Cosine-annealing scheduler in all of our experiments and used standard augmentation (horizontal flips, random crops) during training. We inspected the effect of different hyperparameters in the ablation study. All of our experiments were conducted on the internal cluster of the Hebrew University, on GPU type AmpereA10.

G COMPARING AGREEMENT TO CONFIDENCE IN NOISE FILTRATION

While the learning time of an example has been shown to be effective for noise filtration, it fails to separate noisy and clean data that are learned more or less at the same time. To tackle this problem, one needs additional information, beyond the learning time of a single network. When using an ensemble, we can use the TPA score, or else the average probability assigned to the ground truth label (denoted the "correct" logit) by the networks. The latter score conveys the model's confidence in the ground truth label, and is used by our two alternative scores -CumLoss and MeanMargin. Going beyond learning time, we propose to look at "how quickly" the agreement value rises from 0 to 1, denoted as the "slope" of the agreement. Since our empirical results indicate that the learning time of noisy data is much more varied, we expect a slower rise in agreement over noisy data as compared to clean data. In our experiments, ELP achieved superior results in noise filtration. We hypothesize that the difference in slope between clean and noisy data may underlie the superiority of ELP in noise filtration. To check this hypothesis, we compare between two scores computed at each data example: ELP and Logits Mean (denoted LM for simplicity). LM is defined as follows: LM (x) = k i=1 T j=1 [p i,j (x)] y kT where k is the number of networks, T is the number of epochs during training, (x, y) is a data example and its assigned label, and [p i,j (x)] y is the probability assigned by network i in epoch j to y (the ground truth label). In order to compare between the pace of increase (slope) of ELP and LM, we conduct the following analysis: We select the two groups of clean and noisy data that are learned (roughly) at the same time by some net in the ensemble, and then compute the average agreement and "correct" logit functions as a function of epoch, separately for clean and noisy data. We then compute the difference per epoch between the noisy and clean average agreement, which we denote as ∆Agreement and ∆logit. Note that ∆Agreement and ∆logit encode the difference in the slope between noisy and clean data, since they begin to rise at (roughly) the same time. Finally, we plot in Fig. 11 the difference between ∆Agreement and ∆logit, recalling that larger ∆ indicates stronger separation between the clean and noisy data. Indeed, our analysis shows that with asymmetric noise, the difference between the agreement slope on clean and noisy data of the ELP score is consistently larger than the agreement slope difference between the average logits on clean and noisy data. This, we believe, is the key reason as to why ELP outperforms LM in noise filtration. Note that this effect is much less pronounced when using the easier symmetric noise, and indeed, our empirical results show that ELP does not outperform LM significantly in this case. To conclude, we believe that the signal encoded by the agreement values is stronger than the signal encoded in measures of confidence in the networks' prediction when true labels are concerned, which We see that for most of the training, the difference is positive, implying that ELP provides stronger separation between these groups. Bottom: X-axis is the difference between ∆Agreement and ∆logit; Y -axis is the ratio between the amount of clean and noisy data. The color represents the learning time of the groups. These graphs show that while at the end of the training the difference between ∆Agreement and ∆logit is negative, implying that LM would be better at separating these groups, these are in fact very small sets of data, as most of the data is learned by some network at an earlier stage of the training explains its capability to classify correctly even some hard-clean examples and easy-noisy examples as clean and noise (respectively). This, we believe, is a result of the polarization effect caused by the binary indicators inside TPA, which disregard misleading positive probabilities assigned to noisy labels even before they are learned by the networks.

H COMPARING TO METHODS WITH DIFFERENT ASSUMPTIONS

Here we compare DisagreeNet to methods that assume known noise level -O2U (Huang et al., 2019b) and LEC (Lee and Chung, 2019) , using a 9-layered CNN for the training (with standard hyper parameters as detailed in App. F). Since the noise level is assumed known, we replace the estimation provided by DisagreeNet with the actual noise level. The results are summarized in 



Figure 1: With noisy labels models show higher disagreement. The noisy examples are not only learned at a later stage, but each model learns the example at its own different time.

Let f e : R d → [0, 1] |C| denote a deep model, trained with Stochastic Gradient Descent (SGD) for e epochs on training set X = {(x i , y i )} M i=1 , where x i ∈ R d denotes a single example and y i ∈ [C] its corresponding label. Let F e (X) = {f e 1 , ..., f e N } denote an ensemble of N such models, where each model f e i∈[N ] is initialized and trained independently on X for E epochs. Noise model We analyze the training dynamics of an ensemble of models in the presence of label noise. Label noise is different from data noise (like image distortion or additive Gaussian noise). Here it is assumed that after the training set X = {(x i , l i )} M i=1 is sampled, the labels {l i } are corrupted by some noise function g : [C] → [C], and the training set becomes X = {(x i , y i )} M i=1

Figure 2: (a) Main panel: bimodality in an ensemble of 10 DenseNet networks, trained to classify Cifar10 with 20% symmetric noise. Side panels: TPA distribution in 6 epochs (blue -clean examples, orange -noisy ones). (b) Scatter plots of test accuracy vs train bimodality, measured by BI(e) as defined in (2), where changes in color from blue to yellow correspond with advancing epochs.

Figure 3: ELP distribution, shown separately for the clean data in blue and the noisy data in orange. Superimposed, in blue and orange lines, is the bi-modal BMM fit to the ELP total (not separated) distribution

Figure 4: (a)-(b) Noise identification: F1 score for noisy label identification task, using different noise levels (X-axis), with asymmetric (a) and asymmetric (b) noise models. Results reflect 3 repetitions involving an ensemble of 10 Densenets each. (c)-(d) Noise level estimation: different noise levels are evaluated (X-axis), with asymmetric (c) and asymmetric (d) noise models (the 3 comparison baselines did not report this estimate).

Figure 5: (a)-(b): F1 score (Y -axis) for the noisy label identification task, using different noise levels (X-axis), with asymmetric (a) and asymmetric (b) noise models. Results with 3 variants of DisagreeNet are shown, based on 3 scores: MeanMargin, ELP and CumLoss. (c)-(d): Error in noise level estimation (Y -axis) using different noise levels (X-axis), with asymmetric (c) and asymmetric (d) noise models. As can be seen, ELP very significantly outperforms the other 2 scores when handling asymmetric noise.

Figure 6: (a)-(d): The empirical distribution of the agreement values over epochs (X-axis: epochs, Y -axis: agreement, color code: blue for low and red for high). Clearly, the distribution of noisy examples resembles the binomial distribution with matched expected value, while the clean examples distribution is far from binomial. (e) Wasserstein distance between the binomial distribution and the empirical agreement distribution over epochs.

Figure 9: Distribution of the CumLoss, MeanMargin and ELP scores during training. ELP remains bimodal even for hard noise models, where the other scores become unimodal.

Figure 10: F1 score over the epochs, when using the entire history (ELP) or just the last epoch (TPA), evaluated with prior knowledge

Figure11: Top: X-axis is the learning time of the chosen clean and noisy data; Y -axis is the difference between ∆Agreement and ∆logit. We see that for most of the training, the difference is positive, implying that ELP provides stronger separation between these groups. Bottom: X-axis is the difference between ∆Agreement and ∆logit; Y -axis is the ratio between the amount of clean and noisy data. The color represents the learning time of the groups. These graphs show that while at the end of the training the difference between ∆Agreement and ∆logit is negative, implying that LM would be better at separating these groups, these are in fact very small sets of data, as most of the data is learned by some network at an earlier stage of the training

Test accuracy (%), average and standard error, in the best epoch of retraining after filtration. Results of benchmark methods (see Section 5.1) are taken from(Pleiss et al., 2020) except FINE(Kim et al., 2021), which was re-implemented by us using the official code. The top and middle tables show CIFAR-10, CIFAR-100 and Tiny Imagenet, with simulated noise. The bottom table shows three 'real noise' datasets, and includes in addition results of noise level estimation (when applicable). The presumed noise level for these datasets is indicated in the top line following(Huang et al., 2019a; Song et al., 2019b).DisagreeNet is used to remove noisy examples, after which we train a deep model from scratch using the remaining examples only. We report our main results using the Densenet architecture, and report results with other architectures in the ablation study. Table1summarizes the results for simulated symmetric and asymmetric noise on 5 datasets, and 3 repetitions. It also shows results on 2 real datasets, which are assumed (in previous work) to contain significant levels of 'real' label noise. Additional results are reported in App. H, including methods that require additional prior knowledge.

F1 score of DisagreeNet, using different numbers of models.

F1 score for Cifar100 with 2 levels of symmetric noise. Different ablation conditions are marked in columns: ResNet34 indicates a change of architecture, no Aug indicates that image augmentations are not used, and lr 0.01 indicates that no scheduler or learning rate drop are used during training.

test accuracy after retraining on filtered dataset when using DisagreeNet with different numbers of models.

Test accuracy (%), average and standard error, in the best epoch of retraining after filtration. Results of ELR and DivideMix were computed by us, using the official implementation for the methods, using same hyperparameters as ours.

Final accuracy results when changing the backbone architecture.

AUC (area under the curve) score for noisy labels identification using our ELP score and the confusion score from(Simsek et al., 2022)

estimate mixture models on the loss to separate noisy and clean examples. ⋄ INCV(Chen et al., 2019) iteratively filter out noisy examples by using cross-validation. ⋄ AUM(Pleiss et al., 2020) inserts corrupted examples to determine a filtration threshold, using the mean margin as a score. ⋄ Bootstrap(Reed et al., 2014) interpolates between the net predictions and the given label. ⋄ D2L(Ma et al., 2018) follows Bootstrap, and uses the examples dimensional attributes for the interpolation.Second approach: ⋄ SELF(Nguyen et al., 2019) iteratively uses an exponential moving average of a net prediction over the epochs, compared to the ground truth labels, to filter noisy labels and retrain . ⋄ Meta learning(Li et al., 2019) uses a gradient based technique to update the networks weights with noise tolerance. ⋄ DivideMix(Li et al., 2020) uses 2 networks to flag examples as noisy and clean with two component mixture, after which the SSL technique MixMatch(Berthelot et al., 2019) is used. ⋄ ELR(Liu et al., 2020) identifies early learned example, and uses them to regulate the learning process. ⋄ C2D(Zheltonozhskii et al., 2022) uses the same algorithm as ELR and Dividemix, and uses a pretrain net with unsupervised loss.

Table. 8. We also compare DisagreeNet to other methods that use prior knowledge, where DisagreeNet does not use prior knowledge. The results are summarized in Table. 9

Test accuracy (%) comparison with methods that utilize prior knowledge with 9-layered CNN Under review as a conference paper at ICLR 2023 ± 0.2 91.1 ± 0.1 83.9 ± 0.08 77.3 ± 0.2 71.8 ± 0.3 64.7 ± 0.3

Test accuracy (%) comparison with methods that utilize prior knowledge of the real noise level.

APPENDIX A OVERFIT AND INTER-MODEL CORRELATION

In this section we formally analyze the relation between two type of scores, which measure either overfit or inter-model agreement. Overfit is a condition that can occur during the training of deep neural networks. It is characterized by the co-occurring decrease of train error or loss, which is continuously minimized during the training of a deep model, and the increase of test error or loss, which is the ideal measure one would have liked to minimize and which determines the network's generalization error. An agreement score measures how similar the models are in their predictions.We start by introducing the model and some notations in Section A.1. In Section A.2 we prove the main result (Prop. A.2): the occurrence of overfit at time s in all the models of the ensemble implies that the agreement between the models decreases.

A.1 MODEL AND NOTATIONS

Model. We analyze the agreement between an ensemble of Q models, computed by solving the linear regression problem with Gradient Descent (GD) and random initialization. In this problem, the learner estimates a linear function f (x) : R d → R, where x ∈ R d denotes an input vector and y ∈ R the desired output. Given a training set of M pairs {x m , y m } M m=1 , let X ∈ R d×M denote the training input -a matrix whose m th column is x m ∈ R d , and let row vector y ∈ R M denote the output vector whose m th element is y m . Let N denote the size of the test set. When solving a linear regression problem, we seek a row vector ŵ ∈ R d that satisfiesTo solve (3) with GD, we perform at each iterative step s ≥ 1 the following computation:for some random initialization vector w 0 ∈ R d where usually E[w 0 ] = 0, and learning rate µ.Henceforth we omit the index s when self evident from context.As a final remark, when we use the notation ∥A∥ below for some matrix A, differently from ∥A∥ F , it denotes the operator norm of the symmetric matrix A, namely, its largest singular value.Additional notations is the model learned by network i, and ∆w(i) is the gradient step of w(i), wherej) denotes a function, which maps indices i ∈ Q, j ∈ Q ′ to the cross error of model i on data j -the classification error vector when using model w(i) to estimate y(j). LetNote that in this notation, e(i, t) is the classification error vector when using model i, which is trained on data X(i), to estimate the desired outcome on the test data -y(t). ∥e(i, t)∥ F is the test error, estimate of the generalization error, of classifier i. 

E ABLATION STUDY

Table 3 summarize experiments relating to architecture, scheduler, and augmentation usage. Table 4 provides additional support that our method can be used with only a few networks, as the test accuracy after retraining does not improve when using more than 2-3 networks. In table 5 we compared our method to two compatative methods from recent years that use semi-supervised learning or similar ideas (Li et al., 2020; Liu et al., 2020) . These methods have a different goal than ours (constructing a robust classifier, and not noisy labels identification), which is why we did not compare to them in the main text. However, we chose to include this comparison here, as it shows that semi-supervised

