DO WE ALWAYS NEED TO PENALIZE VARIANCE OF LOSSES FOR LEARNING WITH LABEL NOISE? Anonymous

Abstract

Algorithms which minimize the averaged loss have been widely designed for dealing with noisy labels. Intuitively, when there is a finite training sample, penalizing the variance of losses will improve the stability and generalization of the algorithms. Interestingly, we found that the variance of losses sometimes needs to be increased for the problem of learning with noisy labels. Specifically, increasing the variance of losses would boost the memorization effect and reduce the harmfulness of incorrect labels. Regularizers can be easily designed to increase the variance of losses and be plugged in many existing algorithms. Empirically, the proposed method by increasing the variance of losses could improve the generalization ability of baselines on both synthetic and real-world datasets.

1. INTRODUCTION

Learning with noisy labels can be dated back to 1980s (Angluin & Laird, 1988) . It has recently drawn a lot of attention (Liu & Tao, 2015; Nguyen et al., 2019; Li et al., 2020; 2021) because largescale datasets used in training modern deep learning models can easily contain label noise, e.g., ImageNet (Deng et al., 2009) and Clothing1M (Xiao et al., 2015) . The reason is that it is expensive and sometimes infeasible to accurately annotate large-scale datasets. Meanwhile, many cheap but imperfect surrogates such as crowdsourcing and web crawling are widely used to build large-scale datasets. Training with such data can lead to poor generalization abilities of modern deep learning models because they will overfit noisy labels (Han et al., 2018b; Zhang et al., 2021) . Generally, the algorithms of learning with noisy labels can be divided into two categories: statistically inconsistent algorithms and statistically consistent algorithms. Methods in the first category are heuristic, such as selecting reliable examples to train model (Han et al., 2018b; Malach & Shalev-Shwartz, 2017; Ren et al., 2018; Jiang et al., 2018) , correcting labels (Ma et al., 2018; Kremer et al., 2018; Tanaka et al., 2018; Reed et al., 2014) , and adding regularization (Han et al., 2018a; Guo et al., 2018; Veit et al., 2017; Liu et al., 2020) . Those methods empirically perform well. However, it is not guaranteed that the classifiers learned from noisy data are statistically consistent and often need extensive hyper-parameter tuning on clean data. To address this problem, many researchers explore algorithms in the second category. Those algorithms aim to learn statically consistent classifiers (Liu & Tao, 2015; Patrini et al., 2017; Liu et al., 2020; Xia et al., 2020) . Specifically, their objective functions are specially designed to ensure that minimizing their expected risks on the noise domain is equivalent to minimizing the expected risk on the clean domain. In practice, it is infeasible to calculate the expected risk. To approximate the expected risk, existing methods minimize the empirical risks, i.e., the averaged loss over the noisy training examples, which is an unbiased estimator to the expected risk (Xia et al., 2019; Li et al., 2021) because their difference will vanish when the training sample size goes to infinity. However, when the number of examples is limited, the variance of the empirical risk could be high, which leads to a large estimation error. However, we report that penalizing the variance of losses is not always helpful for the problem of learning with noisy labels. By contrast, in most cases, we need to increase the variance of losses, which will boost the memorization effect (Bai & Liu, 2021) and reduce the harmfulness of incorrect labels. This is because deep neural networks tend to learn easy and majority patterns first due to the memorization effect (Bai & Liu, 2021; Zhang et al., 2021) . The incorrectly labeled data is of minority and has a more complex relationship between instances and labels compared with correctly and instances with incorrect labels (yellow solid lines) obtained by penalizing the variance of losses, employing original loss, and increasing the variance of losses in (a)-(c), respectively. The dataset is CIFAR-10 with symmetry-flipping noise, and the noise rate is 0.2. The neural network ResNet-18 and the baseline Forward (Patrini et al., 2017) are employed. The transition matrix T is given and does not need to be estimated. labeled data, then incorrectly labeled data is harder for neural networks to remember. Therefore, the losses of instances with incorrect labels are likely to be larger than those of correct instances (Han et al., 2018b) . Penalizing the variance of losses could force the model to reduce the loss of the instances with incorrect labels because the correct labels are of majority and their losses are smaller, making it hard to distinguish correctly and incorrectly labeled data and will lead to performance degeneration. In contrast, increasing the variance of losses could efficiently prevent large losses from decreasing, then the model may not overfit instances with incorrect labels. In Section 3, we further show that increasing the variance of losses can be seen as a weighting method that assigns small weights to the gradients of large losses and large weights to the gradients of small losses, which could reduce the effect of instances with incorrect labels when updating model's parameters. More discussions about the memorization effect can be found in the Appendix. Intuitively, as illustrated in Fig. 1 , change of the variance of losses does not have much influence on the averaged training loss of instances with correct labels, but makes the averaged training loss of instances with incorrect labels very different. Specifically, penalizing the variance of losses leads to the averaged training loss of instances with incorrect labels decreasing fast, which will encourage the model to overfit instances with incorrect labels. On the contrary, increasing variance of losses can prevent the averaged training loss of instances with incorrect labels from decreasing as shown in Fig. 1c . Therefore, the memorization effect are boosted. As a result, the test accuracy is improved significantly by encouraging the variance of losses. From the empirical risk minimization perspective, we are encouraged to reduce the variance of losses to increase algorithmic stability. However, to handle label noise, as explained, we may need to boost the variance of losses. This implies that the label noise issue should be carefully considered when designing the loss variance part of learning algorithms. We empirically report that the variance of losses should be boosted in most settings of learning with noisy labels studied in the literature. The rest of this paper is organized as follows. In Section 2, we introduce related work. In Section 3, we propose our method and its advantages. Experimental results on both synthetic and real-world datasets are provided in Section 4. Finally, we conclude the paper in Section 5.

2. RELATED WORK

Some methods proposed to reduce the side-effect of noisy labels using heuristics, For example, many methods utilize the memorization effect to select reliable examples (Han et al., 2020; Yao et al., 2020a; Yu et al., 2019; Jiang et al., 2018) or to correct labels (Ma et al., 2018; Kremer et al., 2018; Tanaka et al., 2018; Reed et al., 2014) . Those methods empirically perform well. However, most of them do not provide statistical guarantees for the learned classifiers on noisy data. Some methods treat incorrect labels as outliers and focus on designing bounded loss functions (Ghosh et al., 2017; Gong et al., 2018; Wang et al., 2019; Shu et al., 2020) . For example, a symmetric crossentropy loss has been proposed which has proven to be robust to label noise asymptotically (Wang et al., 2019) . These methods focus on the numerical property of loss functions, and the designed loss function can be proved to be noise-tolerant if the noise rate is not large. The label noise transition matrix T (x) ∈ [0, 1] C×C (Patrini et al., 2017; Liu & Tao, 2015; Li et al., 2021) , where C is the number of classes, has been widely employed to design statistically consistent classifiers (Liu & Tao, 2015; Patrini et al., 2017; Xia et al., 2020; Li et al., 2021) . Let the clean class posterior P (Y |X = x) := [P (Y = 1|X = x), . . . , P (Y = C|X = x)] ⊤ . It can be inferred by utilizing the noisy class posterior P ( Ỹ |X = x) and the transition matrix T (x), where T ij (x) = P ( Ỹ = i|Y = j, X), i.e., P (Y |X = x) = [T (x)] -1 P ( Ỹ |X = x). Then, the expected risk of a function f (X, Y ) modeling P (Y |X) can be formulated as the expected risk of a function g(X, Ỹ ) modeling P ( Ỹ |X) multiplied by T (X), i.e., R(f (X, Y )) = R([T (X)] -1 g(X, Ỹ )). In practice, the expected risk R([T (X)] -1 g(X, Ỹ )) can not be calculated, existing methods approximate the expected risk with the averaged loss over the noisy training examples. Normally, when the number of examples is limited, the variance of the losses or the empirical risk could be high, which could make the algorithm unstable and lead to a large estimation error. Different variance reduction methods have been developed in many fields. For example, Truncated Importance Sampling (Ionides, 2008) limits the maximum importance weight, which solves the problem of infinite variance and decreases the mean squared estimation error of the standard importance sampling. Anschel et al. (2017) proposed to stabilize training procedure and improve performance by reducing approximation error variance in target rewards. Achab et al. (2015) proposed to use stochastic gradient descent (SGD) with variance reduction for optimizing a finite average of smooth convex functions, and a linear rate of convergence under strong convexity is proved. Similarly, Allen-Zhu & Hazan (2016) proved that a fast convergence rate can be achieved by using variance reduction on the non-convex optimization problem. Although the definitions of variance are different, those works motivate us to explore the role of variance of losses in learning with noisy labels because it is natural to consider that penalizing the variance of losses will have some benefits similar to previous works.

3. ENHANCING VARIANCE OF LOSSES FOR LEARNING WITH NOISY LABEL

In this section, we propose our method, i.e., losses Variance Regularization for label-Noise Learning (VRNL). We reveal how the proposed method reduces the negative effects of incorrect labels. We also illustrate the advantage of our regularizer when working with existing methods.

3.1. METHODOLOGY

We show that the proposed method can efficiently prevent the model from learning incorrect labels. Intuitively, encouraging the variance of losses can prevent losses of instances with incorrect labels from decreasing, and promote the reduction of losses of instances with correct labels. Theoretically, the gradients of large-loss examples will be assigned with small weights, and gradients of small-loss examples will be assigned with large weights. As a result, the model puts more trust on small-loss examples; and small-loss examples will contribute more to the update of parameters, which could reduce the harmfulness of instances with incorrect labels. To analyze the regularization effect in our method, we have to define some notations here. Let Var[.] denote variance of a distribution. For a random variable X, Var[X] = E[X 2 ] -E 2 [X]. Let C denote the number of classes. Let f θ : X → ∆ C-1 be a mapping parameterized by θ (e.g., a neural network), where ∆ C-1 denotes a probability simplex. Generally, the expected risk w.r.t. noisy data is formulated as E (X, Ỹ ) [ℓ(f θ (X), Ỹ )], where ℓ(•) is the loss function employed. Following analysis can be applied to all methods which can be formulated as E (X, Ỹ ) [ℓ(f θ (X), Ỹ )], where ℓ(•), including statistically-consistent methods and cross-entropy loss-based methods. We propose to add a variance regularizer to the losses. Specifically, the objective function of our method is RG(f θ ) = E (X, Ỹ ) [ℓ(f θ (X), Ỹ )] -αVar (X, Ỹ ) [ℓ(f θ (X), Ỹ )], where Var (X, Ỹ ) [ℓ(f θ (X), Ỹ )] is a regularization term, and α is an adjustable hyper-parameter to control the strength of the regularization effect. To encourage the variance of losses, α is chosen to be a positive value. Usually, the strength of the regularization effect should not be too large, and α is much smaller than 1. A suitable α can be obtained by utilizing validation sets. More details and discussions can be found in Appendix. To exploit the influence of our designed regularizer with respect to the update of parameter θ, we first illustrate the derivative of the objective function to θ, i.e., RG(f θ ) ∂θ = ∂E (X, Ỹ ) [ℓ(f θ (X), Ỹ )] ∂θ -α ∂Var (X, Ỹ ) [ℓ(f θ (X), Ỹ )] ∂θ = E (X, Ỹ ) W ∂ℓ(f θ (X), Ỹ ) ∂θ , where W = 1 + 2α E (X, Ỹ ) [ℓ(f θ (X), Ỹ )] -ℓ(f θ (X), Ỹ ) . For a specific example (x, ỹ), its corresponding gradient is w ∂ℓ(f θ (x),ỹ) ∂θ , where the weight w is w = 1 + 2α(E (X, Ỹ ) [ℓ(f θ (X), Ỹ )] - ℓ(f θ (x), ỹ)). As aforementioned, α is chosen to be small such that w should be positive. The above equation shows that 1) if the loss of the example (x, ỹ) is smaller than the expectation of the losses, E (X, Ỹ ) [ℓ(f θ (X), Ỹ )] -ℓ(f θ (x) , ỹ) will be positive, and the weight associated with its gradient is larger than 1. Then the example contributes more to the update of the parameter θ. 2) If the loss of the example (x, ỹ) is larger than the expectation of the losses, E (X, Ỹ ) [ℓ(f θ (X), Ỹ )] -ℓ(f θ (x), ỹ) will be negative. The weight associated with its gradient is small. Then the example contributes less to the update of parameter θ. Due to the memorization effect, deep neural networks tend to learn easy examples first and gradually learn hard examples (Han et al., 2018b; Arpit et al., 2017) . In learning with noisy labels, large-loss examples are more likely to have incorrect labels and should not be trusted (Bai et al., 2021) . By employing the proposed method, the gradients of examples with incorrect labels are assigned with small weights. In such a way, incorrectly-labeled examples would have less contribution to update the parameter θ, which prevents the model from overfitting incorrect labels. Additionally, compared with existing small-loss based methods, our method can sufficiently exploit the information contained in the whole training dataset. Existing learning with noisy labels methods (Han et al., 2018b; Nguyen et al., 2019) usually divide the training sample into confident examples and unconfident examples based on the small-loss trick (Jiang et al., 2018; Malach & Shalev-Shwartz, 2017; Li et al., 2020) . To be specific, the examples with large losses are unconfident examples, and their labels are ignored. However, some of the unconfident examples are hard-clean examples that contain useful information for learning noise-robust classifiers (Bai & Liu, 2021) . In contrast, our method does not ignore unconfident examples but assign them with small weights such that all the label information of the training dataset has been carefully exploited.

3.2. PRACTICAL IMPLEMENTATION

In practice, the expected risk R G (f θ ) in Eq. 1 can not be calculated, the empirical risk is employed as an approximation. Let n be the number of training examples, generally, the empirical risk of our method is as follows: R(f θ ) = 1 n n i=1 ℓ(f θ (xi), ỹi) -α 1 n n i=1 ℓ(f θ (xi), ỹi) 2 - 1 N n i=1 ℓ(f θ (xi), ỹi) 2 . We further illustrate specific forms and settings of our designed regularization working with existing methods, i.e., Importance Reweighting (Liu & Tao, 2015) , Forward (Patrini et al., 2017) , and VolMinNet (Li et al., 2021) . Empirically, our method improves their classification accuracy. Work with Forward. Forward correction (Patrini et al., 2017) exploits the noise transition matrix T to estimate the clean class posterior distribution. We use the same method in the original paper (Patrini et al., 2017) to estimate the transition matrix. The objective loss function by combining our method with Forward can be formulated as follows: RForward(θ, T ) = 1 n n i=1 ℓCE( T f θ (xi), ỹi) -α 1 n n i=1 ℓCE( T f θ (xi), ỹi) 2 - 1 n n i=1 ℓCE( T f θ (xi), ỹi) 2 , where ℓ CE is the cross-entropy loss, T is the estimated transition matrix, f θ models the clean classposterior distribution, T f θ models the noisy class-posterior distribution. Work with Importance Reweighting. Importance Reweighting uses the weighted empirical risk to estimate the empirical risk with respect to clean class-posterior distribution (Liu & Tao, 2015) . To calculate the weight of the empirical risk, both noisy class-posterior distribution and clean classposterior distribution need to be estimated. The objective loss function by combining our method with Important Reweighting can be formulated as follows: RIR(fθ) = 1 n n i=1 βiℓCE(fθ(xi), ỹi) -α 1 n n i=1 β2 i ℓCE(f θ (xi), ỹi) 2 - 1 n n i=1 βiℓCE(fθ(xi), ỹi) 2 , where βi = PD (yi|xi) PDρ (ỹi|xi) , D is the clean distribution, D ρ is the noisy distribution. The gradient of RIR w.r.t. an example (x i , ỹi ) is as follows: ∇ RIR(fθ, (x, ỹ)) = 1 n n i=1 ŵi ℓCE(f θ (xi), ỹi) ∂ βi ∂θ + βi ∂ℓCE(f θ (xi), ỹi) ∂θ , where ŵi = 1 + 2α 1 n n j=1 βj ℓ CE (f θ (x j ), ỹj ) -βi ℓ CE (f θ (x i ), ỹi ) . The ℓ CE (f θ (x), ỹ) ∂ βi ∂θ + βi ∂ℓ CE (f θ (x),ỹ) ∂θ is the gradient of the original Importance Reweighting loss. When the label ỹi is incorrect, the reweighted loss βi ℓ CE (f θ (x i ), ỹi ) is usually larger than the averaged loss 1 n n j=1 βj ℓ CE (f θ (x j ), ỹj ). Then their difference is negative, which lead the weight ŵi to be small because the hyper-parameter α is positive. As a result, the instance with an incorrect label has a small contribution to the update of parameter θ, the proposed method can prevent the model from memorizing the incorrect labels. In the implementation, the early stopping technique is used for the approximation of the clean classposterior distribution. Specifically, the model f θ is trained on noisy data with 20 epochs, and we feed the model output to a softmax function, then use the output of the softmax function g(x) to approximate the clean class-posterior distribution. The noise transition matrix T has also been estimated by using the same approach as in Forward correction. Then the model f θ is further optimized by both weighted empirical risk and regularization for the variance of losses as follows: RIR(θ) = 1 n n i=1 ℓCE(f θ (xi), ỹi) gỹ(xi) ( T g)ỹ(xi) -ασ 2 θ , where σ2 θ = 1 n n i=1 ℓCE(f θ (xi), ỹi) gỹ(xi) ( T g)ỹ(xi) 2 - 1 n n i=1 ℓCE(f θ (xi), ỹi) gỹ(xi) ( T g)ỹ(xi) 2 . Work with VolMinNet. VolMinNet is an end-to-end label-noise learning method that learns the transition matrix and the clean class-posterior distribution simultaneously (Li et al., 2021) . It optimizes two objectives: 1) a trainable diagonally dominant column stochastic matrix T by minimizing the determinate log det( T ); 2) the parameter θ of the model by the cross-entropy loss between the noisy label and the predicted probability by the neural network. In experiments, our VRNL only regularizes the parameter θ by calculating the variance of cross-entropy losses. The objective by combining our method with VolMinNet can be formulated as follows:  Rvol (θ, T ) = 1 n n i=1 ℓCE( T f θ (xi), ỹi) + λ log det( T ) -α 1 n n i=1 ℓCE( T f θ (xi), ỹi) 2 - 1 n n i=1 ℓCE( T f θ (xi), ỹi) ORVV 7UDLQLQJORVVRILQVWDQFHVZLWKFRUUHFWODEHOV 7UDLQLQJORVVRILQVWDQFHVZLWKLQFRUUHFWODEHOV (a) (SRFKV (b) (SRFKV (c) (SRFKV

4. EXPERIMENTS

In this section, we first illustrate the empirical results of VRNL and other baselines on both synthetic and real-world noisy datasets. What is more, we will delve into the different effects of the proposed method on correct and incorrect examples to verify its effectiveness on Sec. 3. Finally, we will illustrate the robustness of the proposed method when the estimated transition matrix is biased. Datasets. We verify the performance of proposed method on the manually corrupted version of three datasets, i.e., MNIST (LeCun et al., 2010) , CIFAR-10 ( Krizhevsky et al., 2009) and CIFAR-100 (Krizhevsky et al., 2009) , and one real-world noisy dataset, i.e., Clothing1M (Xiao et al., 2015) . We leave out 10% of training data as validation sets. The experiments are repeated five times on the synthetic noisy datasets. Clothing1M (Xiao et al., 2015) contains 1M images with real-world noisy labels, it also contains 50k, 14k, and 10k images with clean labels for training, validation, and testing, respectively. Existing methods like Forward (Patrini et al., 2017) and T-revision (Xia et al., 2019) use the 50k clean data to initialize the transition matrix and validate on 14k clean data. We assume that the clean data is not accessible, therefore, the clean data are not used for training and validation. We leave out 10% of examples from 1M noisy data for validation. Baselines. The baselines used in our experiments: 1). CE, standard cross-entropy loss; 2). Decoupling (Malach & Shalev-Shwartz, 2017) trains two models at the same time, and only the instances which have different predictions from two networks are used to update the parameter; 3). Men-torNet (Jiang et al., 2018) pre-trains an extra model which is used to select clean examples for the main model training; 4). Co-teaching (Han et al., 2018b) trains two networks simultaneously, and each network is used to select small-loss examples as trust examples to its peer network for further training; 5). Forward (Patrini et al., 2017) estimates the transition matrix in advance, then uses it to approximate the clean class posteriors; 6). T-Revision (Xia et al., 2019) proposes a method to fine-tune the estimated transition matrix to improve the classification performance; 7). Dual T (Yao et al., 2020b) improves the estimation of the transition matrix by introducing an intermediate class, and then factorizes the transition matrix into the product of two easy-to-estimate transition matrices; 8). VolMinNet (Li et al., 2021) is an end-to-end label-noise learning method, which can learn the transition matrix and the classifier simultaneously; 9). Reweight (Liu & Tao, 2015) uses the importance reweighting technique to estimate the expected risk on the clean domain by using noisy data. Note that the aim of this paper is not to design a state-of-the-art noisy-label learning algorithm, we want to explore whether the variance of losses should be always penalized when learning with noisy labels. Therefore, we do not make comparisons with some methods which use the semi-supervised methods, such as DevideMix (Li et al., 2020) and PES (Bai et al., 2021) . Noise Types. To generate a noisy dataset, we corrupted the training and validation sets manually according to a special transition matrix T . Specifically, we conduct experiments on synthetic noisy datasets with three widely used types of noise: 1). Symmetry flipping (Sym-ϵ) (Patrini et al., 2017) ; 2). Asymmetry flipping (Asym-ϵ); 3). Pair flipping (Pair-ϵ) (Han et al., 2018b) . We manually corrupt the labels of instances according to the transition matrix T . Network structure and optimization. We implement the proposed methods and baseline using Pytorch 1.9.1 and train the models on TITAN Xp. The model structure and optimizer are as same as the state-of-the-art method (Li et al., 2021) . Specifically, we use a LeNet-5 network (LeCun et al., 1998) for MNIST, a ResNet-18 network for CIFAR-10, a ResNet-32 network (He et al., 2016) for CIFAR-100, a ResNet-50 pretrained on ImageNet for Clothing 1M. SGD is used to train the neural network with batch size 128, momentum 0.9, weight decay 10 -4 , and an initial learning rate 10 -2 . The algorithm is trained for 80 epochs, and the learning rate is divided by 10 after the 30-th and 60-th epochs. For Forward and Reweight, we set the hyperparameter α = 0.1 on symmetry-flipping noise, asymmetry-flipping noise, and we set α = 0.01 on pair-flipping noise. For VolMinNet, we set α = 0.005 on MNIST and CIFAR-100 with pair-flipping noise. For other experiments on synthetic noisy datasets, α = 0.05 is employed. When the dataset is Clothing-1M, for Forward and Reweight, SGD with batch size 64, momentum 0.9, weight decay 10 -4 is used to train the model, and α is set to be 0.1; for VolMinNet, SGD with batch size 64, momentum 0.9, weight decay 10 -3 is used, and α is set to be 0.005. For Forward and Reweight, the transition matrix T has to be estimated in advance. For the end-to-end method VolMinNet, the transition matrix T and the classifier are learned simultaneously. To estimate the transition matrix, we follow the same experimental settings described in their original papers (Patrini et al., 2017; Li et al., 2021) . The parameters of the model used to estimate the transition matrix will be used to initialize the weights of the classifier.

4.1. CLASSIFICATION ACCURACY EVALUATION

We embed VRNL into the label-noise learning methods, e.g., Forward, Reweight and VolMin-Net which are named Forward-VRNL, Reweight-VRNL and VolMinNet-VRNL, respectively. In Tab. 1, we illustrate classification accuracies on datasets containing symmetry-flipping, asymmetryflipping and pair-flipping noise. It shows that VRNL improves the classification accuracies of all the label-noise learning methods on different datasets and different types of noise. The performance of VolMinNet is usually better than that of Forward and Reweight and the estimation error of transition matrix used in Forward and Reweight is larger than in VolMinNet (Li et al., 2021) . However, by employing VRNL, the performance of Forward and Reweight are comparable to that of VolMinNet, which suggests that VRNL is robust to the biased transition matrix. The improvements of VRNL under pair-flipping noise are not large compared with symmetry-flipping and asymmetry-flipping noise. More experiments to analyse the reason can be found in the Appendix. We also provide the performance of VRNL under various noise rates. The experiment results are visualized in Fig. 4 . VRNL can improve the performance of existing methods on both little noise and extreme noise. Detailed digits are posed in Tab. 4 and Tab. 5 in Appendix. In Tab. 2, we illustrate the results on the real-world dataset Clothing1M. VRNL improves the generalization ability of backbone methods. The performance of VolMinNet-VRNL outperforms all other baselines.

4.2. THE INFLUENCE ON CLEAN AND NOISY CLASS POSTERIORS

To analyze the influence of variance of losses increase on clean class posteriors and noisy class posteriors. In Fig. 2 Similarly, the results also hold for VolMinNet. Specifically, the variance of noisy class posteriors in Fig. 3d increases compared with Fig. 3c , which could help VolMinNet better estimate the transition matrix. It is because our method encourages the diversity of noisy class posteriors, which makes the sufficiently scattered assumption easier to be satisfied when the sample size is limited. Meanwhile, it can be seen that the empirical risk defined by clean training examples decreases after using our method, as shown in Fig. 3a and Fig. 3b . It means that the model can classify the examples better.

4.3. PERFORMANCE WITH THE BIASED TRANSITION MATRIX

In practice, the noise transition matrix generally is not given and is required to be estimated. However, the estimated transition matrix could contain a large estimation error. One reason is that the transition matrix can be hard to accurately estimate when sample size is limited (Yao et al., 2020b) . Another reason is that the assumptions (Patrini et al., 2017; Li et al., 2021) used to identify the transition matrix may not hold. This motivates us to investigate the performance of our regularizer when the transition matrix contains bias. To simulate the estimation error, we manually inject noise into the transition matrix, i.e., T ρ = T + γ|∆|, where ∆ ∈ R C×C sampled from standard multivariate normal distribution, and γ ∈ [0.01, 0.15]. Then we normalize the column of the transition matrix T ρ sum up to 1 by T N ij = T ρ ij / C k=1 T ρ ik . The estimation error ϵ T of a transition matrix is calculated by employing the entrywise matrix norm, i.e., ϵ T = ∥T -T N ∥ 1,1 /∥T ∥ 1,1 . The biased transition matrix T N is adopted to Reweight, Reweight-VRNL, Forward and Forward-VRNL, respectively. Experimental results shown in Fig. 5 illustrate that our method is more robust to the bias transition matrix. Specifically, for most experiments and different levels of bias ϵ T , the test accuracies of Reweight-VRNL and Forward-VRNL are higher than Reweight and Forward. Additionally, the test accuracy of Reweight-VRNL drops much slower than Reweight with the increasing of bias ϵ T .

5. CONCLUSION

In this paper, we study whether we should always penalize the variance of losses for the problem of learning with noisy labels. Interestingly, we found that increasing the variance of losses could be helpful, which can boost the memorization effect and reduce the harmfulness of incorrect labels. Theoretically, we show that increasing variance of losses can reduce the weights of the gradient with respect to instances with incorrect labels, therefore these instances have a small contribution to the update of model parameters. A simple and effective method VRNL is also proposed which can be easily integrated into existing label-noise learning methods to improve their robustness. The experimental results on both synthetic and real-world noisy datasets demonstrate that VRNL can dramatically improve the performance of existing label-noise learning methods. Empirically, we have shown that the proposed method can help models better learn clean class posteriors. We have also illustrated that VRNL can improve the classification performance of existing methods when the transition matrix is poorly estimated, which makes our method be practically useful.

A THE GRADIENTS OF VRNL

The full derivation of the gradients of VRNL. RG(f θ ) ∂θ = ∂E (X, Ỹ ) [ℓ(f θ (X), Ỹ )] ∂θ -α ∂V ar (X, Ỹ ) [ℓ(f θ (X), Ỹ )] ∂θ = ∂E (X, Ỹ ) [ℓ(f θ (X), Ỹ )] ∂θ -α ∂E (X, Ỹ ) [ℓ 2 (f θ (X), Ỹ )] ∂θ - ∂E 2 (X, Ỹ ) [ℓ(f θ (X), Ỹ )] ∂θ = E (x,ỹ)∼Dρ [ ∂ℓ(f θ (X), Ỹ ) ∂θ ] -α E (x,ỹ)∼Dρ 2ℓ(f θ (X), Ỹ ) ∂ℓ(f θ (X), Ỹ ) ∂θ -2E (x,ỹ)∼Dρ [ℓ(f θ (X), Ỹ )]E (x,ỹ)∼Dρ ∂ℓ(f θ (X), Ỹ ) ∂θ = E (x,ỹ)∼Dρ 1 + 2αE (x,ỹ)∼Dρ [ℓ(f θ (X), Ỹ )] -2αℓ(f θ (X), Ỹ ) ∂ℓ(f θ (X), Ỹ ) ∂θ = E (X, Ỹ ) W ∂ℓ(f θ (X), Ỹ ) ∂θ , where Figure 6 : We increase the number of classes gradually, and the differences between the loss of instances with incorrect labels and the loss of instances with correct labels in symmetry-flipping noise are larger than the differences in pair-flipping noise. The differences are increasing gradually with the increase in the number of classes. W = 1 + 2α E (X, Ỹ ) [ℓ(f θ (X), Ỹ )] -ℓ(f θ (X), Ỹ ) . ( To investigate why the improvement of VRNL on pair-flipping noise is smaller than on symmetryflipping noise, we conduct a series of experiments. We train a ResNet-18 (He et al., 2016) Forward loss (Patrini et al., 2017) under the symmetry-flipping noise or pair flipping noise. The noise rate is 0.2. We reclassify 100 classes of CIFAR100 (Krizhevsky et al., 2009) , e.g., when the number of classes is 2, the classes of 0-49 are reclassified as 0, and the classes of 50-100 are reclassified as 1. The experiment results are shown in Fig. 6 . By comparing with the data containing pair flipping noise, we found that the memorization effect is stronger in the data containing symmetry-flipping noise. Then the improvement of VRNL on pair-flipping noise is smaller than symmetry-flipping noise because our method relies on the memorization effect. Specifically, the results show that the difference between the loss of instances with incorrect labels and the loss of instances with correct labels becomes larger in both symmetryflipping noise and pair-flipping noise. However, by comparing with the average difference under the pair-flipping noise, the difference between the loss of instances with correct labels and the loss of instances with correct labels is larger under symmetry-flipping noise. It implies that under symmetry-flipping noise, the memorization effect is strong because it is much easier to memorize the easy examples with correct labels than the hard examples with incorrect labels. We think the reason that the memorization effect is strong on symmetry-flipping noise is that in a noisy class, the contained noisy examples are from different classes, then memorizing all these noisy examples can be hard. By contrast, for pair-flipping noise, in a noisy class, the contained noisy examples are from one class, then it is easy for the learning model to find the common features and memorize these examples.

C EXPERIMENTS ON WEBVISION

We also conduct experiments on WebVision dataset 1.0 (Li et al., 2017) . Following previous work (Chen et al., 2019) , we train models on Google image subset and test model on Validation set. We first resize images to make shorter size as 320, then randomly crop a patch of image whose size is 299x299. Horizontal random flipping is used. The network structure is inception-resnet v2 (Szegedy et al., 2017) . To estimate transition matrix for Forward and Reweight, we train the network for 20 epochs. We use SGD with momentum as 0.9, learning rate as 0.01, weight decay as 10 -3 . Then we follow the previous work (Liu & Tao, 2015; Patrini et al., 2017) using anchor point assumption to estimate transition matrix. We train the classifier for 80 epochs, learning rate is divided by 10 after 30 and 60 epochs. We start increasing variance of losses at 30th epoch. α is set to 0.05 for Forward-VRNL and VolMinNet-VRNL. For Reweight-VRNL, α is set to 0.005. The experiment results are provided at Tab. 3. The experiment results show that the proposed method still can improve the performance of Forward, Reweight and VolMinNet when the noise rate is large.

E EXPERIMENTS UNDER LITTLE NOISE

We also conduct the experiments under little noise on CIFAR-10, the noise rate is 10%, noise type is symmetry-flipping. The experiments results are shown in Table 5 . The experiment results show that the proposed method still can improve the performance of Forward, Reweight and VolMinNet when the noise rate is small.

F EXPERIMENTS ON CLEAN DATASETS

We also conduct the experiment on clean datasets using standard cross entropy loss with VRNL, α is set to 0.01. The experiment results are shown in Tab. 6. The VRNL has little negative influence when the dataset is clean.

G SENSITIVITY ANALYSIS

We conduct the sensitivity analysis on one synthetic dataset, CIFAR-10, under symmetry-flipping noise, the noise rate is 50%. The α increases from 0.001 to 0.3. The experiment results are shown in Fig. 8 . Overall, the curve is smooth, thus VRNL is not sensitive. We also show corresponding validation accuracy. As shown in Fig. 9 , the tendency of validation accuracy is as same as test accuracy, even though the validation set is noisy. Therefore, the user can use the validation set to determine the best α. 

I EXPERIMENTS OF PROGRESSIVE EARLY STOPPING

We also conduct the experiment on Progressive Early Stopping with VRNL (PES-VRNL). The experiment results are shown and Tab. 7. The VRNL can still boost the performance of models.

J WARMING UP α

The initial weights of VolMinNet are random (The weights of Forward and Reweight are acquired through early stopping which is also used to estimate transition matrix), the loss of instances with incorrect labels might not be larger than correct ones. Therefore, the influence of VRNL should



, where λ > 0 is an adjustable hyper-parameter, we set λ = 0.0001 in all experiments. The transition matrix T should be differentiable, diagonally dominant and column stochastic.Our method could help the state-of-the-art method VolMinNet(Li et al., 2021) to better estimate the transition matrix and the clean class-posterior distribution. Specifically, VolMinNet requires the clean class-posteriors to be diverse, which is called the sufficiently scattered assumption(Li et al., 2021). By increasing the variance of the loss, the diversity of the estimated noisy class posteriors is encouraged, so the estimated clean class-posteriors are also encouraged. Then transition matrix can be better learned, which leads to the clean class-posterior distribution being better estimated.



Figure1: We visualize the averaged training loss of instances with correct labels (blue dashed lines) and instances with incorrect labels (yellow solid lines) obtained by penalizing the variance of losses, employing original loss, and increasing the variance of losses in (a)-(c), respectively. The dataset is CIFAR-10 with symmetry-flipping noise, and the noise rate is 0.2. The neural network ResNet-18 and the baseline Forward(Patrini et al., 2017) are employed. The transition matrix T is given and does not need to be estimated.

Figure 2: The change of losses with the increasing of training epochs for Reweighting. (a) and (b) illustrate CE losses of P (Y |X) without or with increasing variance of losses, respectively. (c) and (d) illustrate CE losses of P ( Ỹ |X) without or with increasing variance of losses, respectively.

Figure 4: Test accuracies of the models trained on CIFAR10 with symmetry-flipping noise and increasing noise rate.

and Fig. 3, we visualize the change of cross-entropy losses for instances with clean labels and instances with noisy labels during the model training, respectively. The average loss and the standard derivation of mislabeled examples and correctly labeled examples are visualized separately for better illustration. The methods used are Reweight and VolMinNet. By comparing Reweight-VRNL with Reweight, the loss of noisy labels for mislabeled examples is larger but the loss of noisy labels for correctly labeled examples is almost unchanged as shown in Fig. 2c and Fig. 2d. It means that the proposed method prevents the model from memorizing incorrect labels and has little influence on learning correctly labeled examples. By comparing Fig. 2b with Fig. 2a, the loss of clean labels for mislabeled examples becomes smaller when our method is employed. It implies that our method helps learn clean class posteriors of mislabeled examples.

Figure7: The differences between the loss of instances with incorrect labels and the loss of instances with correct labels in symmetry-flipping noise and pair-flipping noise.

Figure 8: Sensitivity analysis for α.

Means and standard deviations (percentage) of classification accuracy. Results with "*" mean that they are the highest accuracy. VRNL 99.02 ± 0.08 99.10 ± 0.08 * 90.86 ± 0.27 88.77 ± 0.51 * 70.18 ± 0.50 * 63.38 ± 1.72 *

Classification accuracy(percentage) on Clothing1M. Only noisy data are exploited for training and validation.

The experiment results on WebVision.

The test accuracy under Sym-80% noise.

The test accuracy under Sym-10% noise.

If the loss of incorrectly-labeled examples is larger than the loss of hard but correctly-labeled examples (e.g., the number of hard but correctly-labeled examples is more than the number of incorrectlylabeled examples), VRNL should not have a large negative impact on hard but correctly-labeled examples because it can still separate hard correctly-labeled examples from incorrectly-labeled examples.If the loss of incorrectly-labeled examples equals the loss of hard but correctly-labeled examples (i.e., these examples are entangled), it is hard to separate hard but correctly-labeled examples from Means and standard deviations (percentage) of classification accuracy. Results with "*" mean that they are the highest accuracy. ± 0.03 92.23 ± 0.09 71.30 ± 0.16 CE-VRNL 99.13 ± 0.08 92.02 ± 0.17 71.40 ± 0.32

Sym-20%

Sym-50% PES 92.57 ± 0.22 87.78 ± 0.34 PES-VRNL 92.66 ± 0.09 87.85 ± 0.17 not be large at first. We conduct the experiment on CIFAR-10 using VolMinNet, α is increased with time. Specifically, we increase α from 0 to 0.1 linearly every mini-batch. α peaks at 0.1 after 5 epochs. The experiment results are shown in Tab. 8. As can be seen in the experiments, the warming up strategy can increase the test accuracy compared with keeping α a constant. The experiment results imply that when the model uses VRNL, the variance of training losses will increase.

L THE LOSSES AND VARIANCE OF LOSSES DURING TRAINING ON CLEAN AND NOISY DOMAIN

We conduct experiments, we train a Forward model on a clean dataset and a noisy dataset separately, the noise type is symmetry-flipping, and the noise rate is 0.5. 

M THE GRADIENTS OF REWEIGHT-VRNL

The full derivation of the gradients of Reweight-VRNL.where) is a constant, we note it as Êir , therefore: 

