DO WE ALWAYS NEED TO PENALIZE VARIANCE OF LOSSES FOR LEARNING WITH LABEL NOISE? Anonymous

Abstract

Algorithms which minimize the averaged loss have been widely designed for dealing with noisy labels. Intuitively, when there is a finite training sample, penalizing the variance of losses will improve the stability and generalization of the algorithms. Interestingly, we found that the variance of losses sometimes needs to be increased for the problem of learning with noisy labels. Specifically, increasing the variance of losses would boost the memorization effect and reduce the harmfulness of incorrect labels. Regularizers can be easily designed to increase the variance of losses and be plugged in many existing algorithms. Empirically, the proposed method by increasing the variance of losses could improve the generalization ability of baselines on both synthetic and real-world datasets.

1. INTRODUCTION

Learning with noisy labels can be dated back to 1980s (Angluin & Laird, 1988) . It has recently drawn a lot of attention (Liu & Tao, 2015; Nguyen et al., 2019; Li et al., 2020; 2021) because largescale datasets used in training modern deep learning models can easily contain label noise, e.g., ImageNet (Deng et al., 2009) and Clothing1M (Xiao et al., 2015) . The reason is that it is expensive and sometimes infeasible to accurately annotate large-scale datasets. Meanwhile, many cheap but imperfect surrogates such as crowdsourcing and web crawling are widely used to build large-scale datasets. Training with such data can lead to poor generalization abilities of modern deep learning models because they will overfit noisy labels (Han et al., 2018b; Zhang et al., 2021) . Generally, the algorithms of learning with noisy labels can be divided into two categories: statistically inconsistent algorithms and statistically consistent algorithms. Methods in the first category are heuristic, such as selecting reliable examples to train model (Han et al., 2018b; Malach & Shalev-Shwartz, 2017; Ren et al., 2018; Jiang et al., 2018 ), correcting labels (Ma et al., 2018; Kremer et al., 2018; Tanaka et al., 2018; Reed et al., 2014) , and adding regularization (Han et al., 2018a; Guo et al., 2018; Veit et al., 2017; Liu et al., 2020) . Those methods empirically perform well. However, it is not guaranteed that the classifiers learned from noisy data are statistically consistent and often need extensive hyper-parameter tuning on clean data. To address this problem, many researchers explore algorithms in the second category. Those algorithms aim to learn statically consistent classifiers (Liu & Tao, 2015; Patrini et al., 2017; Liu et al., 2020; Xia et al., 2020) . Specifically, their objective functions are specially designed to ensure that minimizing their expected risks on the noise domain is equivalent to minimizing the expected risk on the clean domain. In practice, it is infeasible to calculate the expected risk. To approximate the expected risk, existing methods minimize the empirical risks, i.e., the averaged loss over the noisy training examples, which is an unbiased estimator to the expected risk (Xia et al., 2019; Li et al., 2021) because their difference will vanish when the training sample size goes to infinity. However, when the number of examples is limited, the variance of the empirical risk could be high, which leads to a large estimation error. However, we report that penalizing the variance of losses is not always helpful for the problem of learning with noisy labels. By contrast, in most cases, we need to increase the variance of losses, which will boost the memorization effect (Bai & Liu, 2021) and reduce the harmfulness of incorrect labels. This is because deep neural networks tend to learn easy and majority patterns first due to the memorization effect (Bai & Liu, 2021; Zhang et al., 2021) . The incorrectly labeled data is of minority and has a more complex relationship between instances and labels compared with correctly labeled data, then incorrectly labeled data is harder for neural networks to remember. Therefore, the losses of instances with incorrect labels are likely to be larger than those of correct instances (Han et al., 2018b) . Penalizing the variance of losses could force the model to reduce the loss of the instances with incorrect labels because the correct labels are of majority and their losses are smaller, making it hard to distinguish correctly and incorrectly labeled data and will lead to performance degeneration. In contrast, increasing the variance of losses could efficiently prevent large losses from decreasing, then the model may not overfit instances with incorrect labels. In Section 3, we further show that increasing the variance of losses can be seen as a weighting method that assigns small weights to the gradients of large losses and large weights to the gradients of small losses, which could reduce the effect of instances with incorrect labels when updating model's parameters. More discussions about the memorization effect can be found in the Appendix. Intuitively, as illustrated in Fig. 1 , change of the variance of losses does not have much influence on the averaged training loss of instances with correct labels, but makes the averaged training loss of instances with incorrect labels very different. Specifically, penalizing the variance of losses leads to the averaged training loss of instances with incorrect labels decreasing fast, which will encourage the model to overfit instances with incorrect labels. On the contrary, increasing variance of losses can prevent the averaged training loss of instances with incorrect labels from decreasing as shown in Fig. 1c . Therefore, the memorization effect are boosted. As a result, the test accuracy is improved significantly by encouraging the variance of losses. From the empirical risk minimization perspective, we are encouraged to reduce the variance of losses to increase algorithmic stability. However, to handle label noise, as explained, we may need to boost the variance of losses. This implies that the label noise issue should be carefully considered when designing the loss variance part of learning algorithms. We empirically report that the variance of losses should be boosted in most settings of learning with noisy labels studied in the literature. The rest of this paper is organized as follows. In Section 2, we introduce related work. In Section 3, we propose our method and its advantages. Experimental results on both synthetic and real-world datasets are provided in Section 4. Finally, we conclude the paper in Section 5.

2. RELATED WORK

Some methods proposed to reduce the side-effect of noisy labels using heuristics, For example, many methods utilize the memorization effect to select reliable examples (Han et al., 2020; Yao et al., 2020a; Yu et al., 2019; Jiang et al., 2018) or to correct labels (Ma et al., 2018; Kremer et al., 2018; Tanaka et al., 2018; Reed et al., 2014) . Those methods empirically perform well. However, most of them do not provide statistical guarantees for the learned classifiers on noisy data. Some methods treat incorrect labels as outliers and focus on designing bounded loss functions (Ghosh et al., 2017; Gong et al., 2018; Wang et al., 2019; Shu et al., 2020) . For example, a symmetric crossentropy loss has been proposed which has proven to be robust to label noise asymptotically (Wang et al., 2019) . These methods focus on the numerical property of loss functions, and the designed loss function can be proved to be noise-tolerant if the noise rate is not large. The label noise transition matrix T (x) ∈ [0, 1] C×C (Patrini et al., 2017; Liu & Tao, 2015; Li et al., 2021) , where C is the number of classes, has been widely employed to design statistically consistent



Figure1: We visualize the averaged training loss of instances with correct labels (blue dashed lines) and instances with incorrect labels (yellow solid lines) obtained by penalizing the variance of losses, employing original loss, and increasing the variance of losses in (a)-(c), respectively. The dataset is CIFAR-10 with symmetry-flipping noise, and the noise rate is 0.2. The neural network ResNet-18 and the baseline Forward(Patrini et al., 2017)  are employed. The transition matrix T is given and does not need to be estimated.

