IMPROVING GROUP ROBUSTNESS UNDER NOISY LABELS USING PREDICTIVE UNCERTAINTY

Abstract

The standard empirical risk minimization (ERM) can underperform on certain minority groups (i.e., waterbirds in lands or landbirds in water) due to the spurious correlation between the input and its label. Several studies have improved the worst-group accuracy by focusing on the high-loss samples. The hypothesis behind this is that such high-loss samples are spurious-cue-free (SCF) samples. However, these approaches can be problematic since the high-loss samples may also be samples with noisy labels in the real-world scenarios. To resolve this issue, we utilize the predictive uncertainty of a model to improve the worst-group accuracy under noisy labels. To motivate this, we theoretically show that the highuncertainty samples are the SCF samples in the binary classification problem. This theoretical result implies that the predictive uncertainty is an adequate indicator to identify SCF samples in a noisy label setting. Motivated from this, we propose a novel ENtropy based Debiasing (END) framework that prevents models from learning the spurious cues while being robust to the noisy labels. In the END framework, we first train the identification model to obtain the SCF samples from a training set using its predictive uncertainty. Then, another model is trained on the dataset augmented with an oversampled SCF set. The experimental results show that our END framework outperforms other strong baselines on several real-world benchmarks that consider both the noisy labels and the spurious-cues.

1. INTRODUCTION

The standard Empirical Risk Minimization (ERM) has shown a high error on specific groups of data although it achieves the low test error on the in-distribution datasets. One of the reasons accounting for such degradation is the presence of spurious-cues. The spurious cue refers to the feature which is highly correlated with labels on certain training groups-thus, easy to learn-but not correlated with other groups in the test set (Nagarajan et al., 2020; Wiles et al., 2022) . This spurious-cue is problematic especially occurs when the model cannot classify the minority samples although the model can correctly classify the majority of the training samples using the spurious cue. In practice, deep neural networks tend to fit easy-to-learn simple statistical correlations like the spurious-cues (Geirhos et al., 2020) . This problem arises in the real-world scenarios due to various factors such as an observation bias and environmental factors (Beery et al., 2018; Wiles et al., 2022) . For instance, an object detection model can predict an identical object differently simply because of the differences in the background (Ribeiro et al., 2016; Dixon et al., 2018; Xiao et al., 2020) . In nutshell, there is a low accuracy problem caused by the spurious-cues being present in a certain group of data. In that sense, importance weighting (IW) is one of the classical techniques to resolve this problem. Recently, several deep learning methods related to IW (Sagawa et al., 2019; 2020; Liu et al., 2021; Nam et al., 2020) have shown a remarkable empirical success. The main idea of those IW-related methods is to train a model with using data oversampled with hard (high-loss) samples. The assumption behind such approaches is that the high-loss samples are free from the spurious cues because these shortcut features generally reside mostly in the low-loss samples Geirhos et al. (2020) . For instance, Just-Train-Twice (JTT) trains a model using an oversampled training set containing the error set generated by the identification model. On the other hand, noisy labels are another factor of performance degradation in the real-world scenario. Noisy labels commonly occur in massive-scale human annotation data, biology and chem-istry data with inevitable observation noise (Lloyd et al., 2004; Ladbury & Arold, 2012; Zhang et al., 2016) . In practice, the proportions of incorrectly labeled samples in the real-world human-annotated image datasets can be up to 40% (Wei et al., 2021) . Moreover, the presence of noisy labels can lead to the failure of the high-loss-based IW approaches, since a large value of the loss indicates not only that the sample may belong to a minority group but also that the label may be noisy (Ghosh et al., 2017) . In practice, we observed that even a relatively small noise ratio (10%) can impair the highloss-based methods on the benchmarks with spurious-cues, such as Waterbirds and CelebA. This is because the high loss-based approaches tend to focus on the noisy samples without focusing on the minority group with spurious cues. Our observation motivates the principal goal of this paper: how can we better select only spuriouscue-free (SCF) samples while excluding the noisy samples? As an answer to this question, we propose the predictive uncertainty-based sampling as an oversampling criterion, which outperforms the error-set-based sampling. The predictive uncertainty has been used to discover the minority or unseen samples (Liang et al., 2017; Van Amersfoort et al., 2020) . We utilize such uncertainty to detect the SCF samples. In practice, we train the identification model via the noise-robust loss and the Bayesian neural network framework to obtain reliable uncertainty for the minority group samples. By doing so, the proposed identification model is capable of properly identifying the SCF sample while preventing the noisy labels from being focused on. After training the identification model, similar to JTT, the debiased model is trained with the SCF set oversampled dataset. Our novel framework, ENtropy-based Debiasing (END), shows an impressive worst-group accuracy on several benchmarks with various degrees of symmetric label noise. Furthermore, as a theoretical motivation, we demonstrate that the predictive uncertainty (entropy) is a proper indicator for identifying the SCF set regardless of the existence of the noisy labels in the simple binary classification problem setting. To summarize, our key contributions are three folds: 1. We propose a novel predictive uncertainty-based oversampling method that effectively selects the SCF samples while minimizing the selection of noisy samples. 2. We rigorously prove that the predictive uncertainty is an appropriate indicator for identifying a SCF set in the presence of the noisy labels, which well supports the proposed method. 3. We propose additional model considerations for real-world applications in both classification and regression tasks. The overall framework shows superior worst-group accuracy compared to recent strong baselines in various benchmarks.

2. RELATED WORKS

Noisy label robustness: small loss samples In this paper, we focus on two types of the noisy label robustness studies: (1) a sample re-weighting based approach and (2) a robust loss functions based approach. First, the sample re-weighting methods assign sample weights during model training to achieve the robustness against the noisy label ( Han et al., 2018; Ren et al., 2018; Wei et al., 2020; Yao et al., 2021) . Alternatively, the robust loss function based approaches design the loss function which implicitly focuses on the clean label (Reed et al., 2015; Zhang & Sabuncu, 2018; Thulasidasan et al., 2019; Ma et al., 2020) . The common premise of the sample re-weighting and robust loss function methods are that the low-loss samples are likely to be the clean samples. For instance, Co-teaching uses two models which select the clean sample for each model by choosing samples of small losses (Han et al., 2018) . Similarly, (Zhang & Sabuncu, 2018) design the generalized cross entropy loss function to have less emphasis on the samples of large loss than the vanilla cross entropy. Group robustness: large loss samples The model with the group robustness should yield a low test error regardless of the group specific information of samples (i.e., groups by background images). This group robustness can be improved if the model does not focus on the spurious cues (i.e., the background). The common assumption of prior works on the group robustness is that the large loss samples are spurious-cue-free. 



Sagawa et al. (2019);Zhang et al. (2020)  propose the Distributionally Robust Optimization (DRO) methods which directly minimize the worst-group loss via group information of the training datasets given a priori. On the other hand, the group informationfree approaches(Namkoong & Duchi (2017);Arjovsky et al. (2019); Oren et al. (2019)) have been proposed due to the non-negligible cost of group information. These approaches aim at achieving the

