IMPROVING GROUP ROBUSTNESS UNDER NOISY LABELS USING PREDICTIVE UNCERTAINTY

Abstract

The standard empirical risk minimization (ERM) can underperform on certain minority groups (i.e., waterbirds in lands or landbirds in water) due to the spurious correlation between the input and its label. Several studies have improved the worst-group accuracy by focusing on the high-loss samples. The hypothesis behind this is that such high-loss samples are spurious-cue-free (SCF) samples. However, these approaches can be problematic since the high-loss samples may also be samples with noisy labels in the real-world scenarios. To resolve this issue, we utilize the predictive uncertainty of a model to improve the worst-group accuracy under noisy labels. To motivate this, we theoretically show that the highuncertainty samples are the SCF samples in the binary classification problem. This theoretical result implies that the predictive uncertainty is an adequate indicator to identify SCF samples in a noisy label setting. Motivated from this, we propose a novel ENtropy based Debiasing (END) framework that prevents models from learning the spurious cues while being robust to the noisy labels. In the END framework, we first train the identification model to obtain the SCF samples from a training set using its predictive uncertainty. Then, another model is trained on the dataset augmented with an oversampled SCF set. The experimental results show that our END framework outperforms other strong baselines on several real-world benchmarks that consider both the noisy labels and the spurious-cues.

1. INTRODUCTION

The standard Empirical Risk Minimization (ERM) has shown a high error on specific groups of data although it achieves the low test error on the in-distribution datasets. One of the reasons accounting for such degradation is the presence of spurious-cues. The spurious cue refers to the feature which is highly correlated with labels on certain training groups-thus, easy to learn-but not correlated with other groups in the test set (Nagarajan et al., 2020; Wiles et al., 2022) . This spurious-cue is problematic especially occurs when the model cannot classify the minority samples although the model can correctly classify the majority of the training samples using the spurious cue. In practice, deep neural networks tend to fit easy-to-learn simple statistical correlations like the spurious-cues (Geirhos et al., 2020) . This problem arises in the real-world scenarios due to various factors such as an observation bias and environmental factors (Beery et al., 2018; Wiles et al., 2022) . For instance, an object detection model can predict an identical object differently simply because of the differences in the background (Ribeiro et al., 2016; Dixon et al., 2018; Xiao et al., 2020) . In nutshell, there is a low accuracy problem caused by the spurious-cues being present in a certain group of data. In that sense, importance weighting (IW) is one of the classical techniques to resolve this problem. Recently, several deep learning methods related to IW (Sagawa et al., 2019; 2020; Liu et al., 2021; Nam et al., 2020) have shown a remarkable empirical success. The main idea of those IW-related methods is to train a model with using data oversampled with hard (high-loss) samples. The assumption behind such approaches is that the high-loss samples are free from the spurious cues because these shortcut features generally reside mostly in the low-loss samples Geirhos et al. (2020) . For instance, Just-Train-Twice (JTT) trains a model using an oversampled training set containing the error set generated by the identification model. On the other hand, noisy labels are another factor of performance degradation in the real-world scenario. Noisy labels commonly occur in massive-scale human annotation data, biology and chem-istry data with inevitable observation noise (Lloyd et al., 2004; Ladbury & Arold, 2012; Zhang et al., 2016) . In practice, the proportions of incorrectly labeled samples in the real-world human-annotated image datasets can be up to 40% (Wei et al., 2021) . Moreover, the presence of noisy labels can lead to the failure of the high-loss-based IW approaches, since a large value of the loss indicates not only that the sample may belong to a minority group but also that the label may be noisy (Ghosh et al., 2017) . In practice, we observed that even a relatively small noise ratio (10%) can impair the highloss-based methods on the benchmarks with spurious-cues, such as Waterbirds and CelebA. This is because the high loss-based approaches tend to focus on the noisy samples without focusing on the minority group with spurious cues. Our observation motivates the principal goal of this paper: how can we better select only spuriouscue-free (SCF) samples while excluding the noisy samples? As an answer to this question, we propose the predictive uncertainty-based sampling as an oversampling criterion, which outperforms the error-set-based sampling. The predictive uncertainty has been used to discover the minority or unseen samples (Liang et al., 2017; Van Amersfoort et al., 2020) . We utilize such uncertainty to detect the SCF samples. In practice, we train the identification model via the noise-robust loss and the Bayesian neural network framework to obtain reliable uncertainty for the minority group samples. By doing so, the proposed identification model is capable of properly identifying the SCF sample while preventing the noisy labels from being focused on. After training the identification model, similar to JTT, the debiased model is trained with the SCF set oversampled dataset. Our novel framework, ENtropy-based Debiasing (END), shows an impressive worst-group accuracy on several benchmarks with various degrees of symmetric label noise. Furthermore, as a theoretical motivation, we demonstrate that the predictive uncertainty (entropy) is a proper indicator for identifying the SCF set regardless of the existence of the noisy labels in the simple binary classification problem setting. To summarize, our key contributions are three folds: 1. We propose a novel predictive uncertainty-based oversampling method that effectively selects the SCF samples while minimizing the selection of noisy samples. 2. We rigorously prove that the predictive uncertainty is an appropriate indicator for identifying a SCF set in the presence of the noisy labels, which well supports the proposed method. 3. We propose additional model considerations for real-world applications in both classification and regression tasks. The overall framework shows superior worst-group accuracy compared to recent strong baselines in various benchmarks.

2. RELATED WORKS

Noisy label robustness: small loss samples In this paper, we focus on two types of the noisy label robustness studies: (1) a sample re-weighting based approach and (2) a robust loss functions based approach. First, the sample re-weighting methods assign sample weights during model training to achieve the robustness against the noisy label (Han et al., 2018; Ren et al., 2018; Wei et al., 2020; Yao et al., 2021) . Alternatively, the robust loss function based approaches design the loss function which implicitly focuses on the clean label (Reed et al., 2015; Zhang & Sabuncu, 2018; Thulasidasan et al., 2019; Ma et al., 2020) . The common premise of the sample re-weighting and robust loss function methods are that the low-loss samples are likely to be the clean samples. For instance, Co-teaching uses two models which select the clean sample for each model by choosing samples of small losses (Han et al., 2018) . Similarly, (Zhang & Sabuncu, 2018) design the generalized cross entropy loss function to have less emphasis on the samples of large loss than the vanilla cross entropy. Group robustness: large loss samples The model with the group robustness should yield a low test error regardless of the group specific information of samples (i.e., groups by background images). This group robustness can be improved if the model does not focus on the spurious cues (i.e., the background). The common assumption of prior works on the group robustness is that the large loss samples are spurious-cue-free. Sagawa et al. (2019); Zhang et al. (2020) propose the Distributionally Robust Optimization (DRO) methods which directly minimize the worst-group loss via group information of the training datasets given a priori. On the other hand, the group informationfree approaches (Namkoong & Duchi (2017) We consider a supervised learning task with inputs x (i) ∈ X , corrupted labels y (i) ∈ Y, and true labels z (i) ∈ Y. We let a latent attribute of corresponding x (i) as a (i) ∈ A (e.g., different backgrounds in Figure 1 ). We assume each triplet (x (i) , y (i) , z (i) ) belongs to a corresponding group, g (i) ∈ G (e.g., group by a background attribute and its true class (z (i) , a (i) ) in Figure 1 ). We also denote a training dataset as D = {(x (i) , ŷ(i) )} N i=1 where each pair (x (i) , ŷ(i) ) is sampled from a data distribution D * on X ×Y. Importantly, an attribute a (i) can be spuriously correlated with a label on a certain dataset D. For instance, in Figure 1 , the label (cow/camel) can be highly correlated with the background feature (green pasture/desert), implying a false causal relationship. In this case, the background feature is a spurious cue. Ideally, we aim to minimize the worst-group risk of a model f θ : X → R c (c is the number of class), parameterized by θ, with an unknown data distribution D * and true labels z as follows: θ * = arg min θ [max g∈G R D * g (θ)], R D * g (θ) = E (x,z)∼D * g [L(f θ (x), z)] where D * g is the data distribution for the group g and L : R c × Y → R is a loss function. To achieve the above goal of improving the worst group accuracy, the prediction of the model f θ should not depend on the spurious cues. We instantiate this with the cow or camel classification dataset (Beery et al., 2018) which includes the background features as the spurious cue. When this relationship is abused, a model can easily classify the majority groups while failing to classify the minority groups such as a cow in the desert. Thus, the model has a poor accuracy on the minority group, leading to a high worst-group error. To improve the worst-group accuracy, it is ideal to directly optimize the model via Eq 1. However, in the real-world, we assume that only the training samples D are given, with no information about groups g, attributes a, and true labels z during training. Therefore, ERM with the corrupted label is the alternative solution for the parameter θ: θ * D = arg min θ RD (θ), RD (θ) = 1 N N i=1 L(f θ (x (i) ), ŷ(i) ) Our goal in this paper is to resolve the problem caused by the unavoidable alternative, ERM. The problem lies in the poor accuracy on a specific group due to the model's reliance on the spurious cues (Sagawa et al., 2019) . To address the problem of the poor worst-group accuracy, the common assumption in the literature has been that the model training should focus more on the particular samples that are not classifiable using the spurious cues (e.g., oversampling the cow on the desert samples) (Sagawa et al., 2019; Xu et al., 2020; Liu et al., 2021) . In this paper, we call such samples the Spurious-Cue-Free (SCF) samples. Given that we lack information to determine which samples are SCF, the remaining question is "how can we identify which samples belong to a SCF set?". The previous studies (Liu et al., 2021) hypothesize that the samples with large loss values correspond to the SCF set (Sec 2). These approaches, however, have limitations because large loss values can be attributed to both the SCF and the noisy samples. Going a step further than other approaches, our primary strategy for identifying a SCF set is obtaining samples with a high predictive uncertainty. By doing so, our approach allows a more careful selection of SCF samples while excluding noisy ones as much as possible. As theoretical support, we rigorously show in the following section that the predictive uncertainty is a proper indicator to identify a SCF set in the presence of the noisy labels.

4. GROUP ROBUSTNESS UNDER NOISY LABELS VIA UNCERTAINTY

In this section, we primarily show that the predictive uncertainty is a proper indicator to identify the SCF samples under the noisy label environment. Not only that, we can verify that utilizing the loss values as an oversampling metric can fail to distinguish SCF samples from the noisy samples. To rigorously prove this, we theoretically analyze the binary classification problem including both the spurious cues and the noise. Problem setup: data distribution and model hypothesis Consider a d-dimensional input x = (x 1 , . . . , x d ) and its features and labels are binary: x i , y, z ∈ {-1, 1}. A data generation procedure for triplets (x, y, z) is defined as following: first, z is uniformly sampled over {-1, 1}. Then, if the true label is positive (z = 1), x ∼ B p where B p is a distribution of independent Bernoulli random samples and p = (p 1 , . . . , p d ) ∈ [0, 1] d . Here, p i represents the probability that the i-th feature has a value of 1. On the contrary, if z = -1, x is sampled from a different distribution: x ∼ B p ′ where p ′ = (p ′ 1 , . . . , p ′ d ) ∈ [0, 1] d . Furthermore, with the probability η, there is a label noise: y = -z. Otherwise, y = z. Finally, we consider a linear model with parameters β = (β 0 , . . . β d ), which are optimized via risk minimization over the joint distribution of the features and noisy labels (x, y). This problem setup is inspired by the problem definition of Nagarajan et al. (2020) and Sagawa et al. (2020) , representing the spurious features. Next, the concept of the spurious cue in the defined classification task is demonstrated using the cow and camel images shown in Figure 1 . Let's assume that the j'th feature represents a background attribute (x j = a), which is either a green pasture (x j = 1) or a desert (x j = -1). Suppose 98% of cows have a green pasture background (thus, p j = 0.98) while only 5% of camels have the green pasture background (p ′ j = 0.05). In this case, the majority of the data could likely be classified only using the x j feature (e.g., only utilizing β j ). However, abusing this spurious feature could hinder a model from accurately classifying the minority groups such as cows in a desert. Note that in practice, there can be many spurious features that correspond to a given latent attribute (a). To quantify the spuriousness, we define the Spurious-Cue Score (SCS) function. Definition 1 (Spurious cue score function) we define the spurious cue score function Ψ p,p ′ : R d → R with any function s(•, •) which satisfies followings: Ψ p,p ′ (x) = d i=1 s(p i , p ′ i )x i , s(p i , p ′ i ) > 0, If p i > p ′ i s(p i , p ′ i ) ≤ 0, If p i ≤ p ′ i (3) Intuitively, if the SCS function value is low, the model will struggle to correctly predict the label of a sample x by relying solely on features having a high correlation with the label. A simple example of s(•, •) can be s(p j , p ′ j ) = p j -p ′ j . In the cow/camel classification problem, the cow in the desert sample has the lower Ψ p,p ′ value due to the term s(p j , p ′ j )x j = (0.98 -0.05)(-1) = -0.93. In contrast, the majority of cow has 0.93. For the negative class samples (camel), Ψ p ′ ,p can be used instead. Importantly, the SCS function is only determined by the true labels (z) and the input features (x) but not the noisy labels (y). With Definition 1, the following result formalizes our goal: to find the SCF samples (having a low SCS function values) by obtaining samples with the high predictive uncertainty. Theorem 1 Given any 0 < ϵ < 1/2, for a sufficiently small δ, a sufficiently large d and any p, p ′ ∈ [ϵ, 1 -ϵ] d with |p i -p ′ i | ≥ ϵ (1 ≤ i ≤ d) , the following holds for any 0 ≤ η < 1/2: P x,y,z [ R(z, x) ≤ ϵ | F (H β * (x)) ≥ 1 -δ ] ≥ 1 -ϵ, R(x, z) = 1 z=1 F x∼Bp (Ψ p,p ′ (x)) + 1 z=-1 F x∼B p ′ (Ψ p ′ ,p (x)) (4) where β * is the risk minimization solution of the linear regression on the distribution on (x, y) and H β * is the predictive entropy of the model with its parameter β * . F , F x∼Bp and F x∼B p ′ are the cumulative distribution functions with respect to the data distribution, B p and B p ′ respectively. Theorem 1 states that the highly uncertain samples (F (H β * (x)) ≥ 1 -δ) are more likely to have a low SCS function values among samples belonging to the same class (R(x, z) ≤ ϵ). Thus, this statement implies that utilizing the uncertainty (predictive entropy) can be useful in identifying the SCF samples. Additionally, it can be observed that the selectivity of the predictive entropy (H β * ) is independent of the presence of the label noise (whether yz = -1). Particularly, it's worth noting that the probability of the samples with a high predictive uncertainty to have a label noise is not greater than η, the original label noise ratio. On the contrary, the probability of the label noise to be present in the samples with high loss exceeds η (Theorem 2 in Appendix A.6). We empirically demonstrate that the proposed framework with the predictive uncertainty outperforms the existing loss value-based ones in the real-world benchmarks with the noisy labels (Sec 6). The formal form of this theorem and its proof are in Appendix A. Entropy based debiasing for neural networks trained via ERM We theoretically show that utilizing the predictive uncertainty can allow us to obtain the SCF samples under the ideal condition. However, there are two obstacles in applying this to the real-world scenario. Firstly, an overparameterized neural network trained via ERM (with a finite number of samples) can perfectly memorize the label noises (Zhang et al., 2017) . This memorization could cause samples with the noisy labels to have a high entropy during training. For instance, as the model reduces the loss of the noisy samples once after it has fitted to the noise-free samples, the predictive entropy of those noisy samples increases (Xia et al., 2020) . Secondly, since the neural network models generally tend to be overconfident (Guo et al., 2017; Hein et al., 2019) , the proposed framework could potentially be unreliable in entropy-based acquisition of the SCF samples. To summarize, utilizing the predictive uncertainty with the neural networks necessitates two requirements for a neural network: (1: robustness to label noise) the model should not memorize samples with noisy labels; (2: reliable uncertainty) the model should not be overconfident when identifying the SCF samples. As an illustrative example, Figure 2 presents how well two different models identify the SCF samples. Model-A (proposed) is successful because it satisfies these two requirements. In contrast, model-B fails due to the overconfident prediction (low predictive uncertainty) for the true SCF samples. In the following section, we describe the proposed framework along with the training process that is designed to satisfy these two requirements.

5. ENTROPY BASED DEBIASING

Overview The proposed ENtropy-based Debiasing (END) framework focuses on training the samples with high predictive uncertainty. The END framework, in particular, is made up of two models with identical architectures. First, the identification model uses the predictive uncertainty to identify the SCF samples. Second, the debiased model is trained on a newly constructed training data set by oversampling the SCF set. As a result, END achieves group and noise label robustness. In addition, our framework can also be extended to regression problems, whereas other baselines cannot.

5.1. IDENTIFICATION MODEL

Robustness to the label noise and reliable uncertainty are two major requirements for using the predictive uncertainty with neural networks. To achieve these goals, we design the proposed identification model with the loss function robust to noisy labels and the overconfidence regularizer. In addition, we employ a Bayesian Neural Network (BNN) to obtain reliable uncertainty. Additionally, we explain how to modify the proposed identification model to fit a regression task. The noise-robust loss function and the overconfidence regularizer For the loss function, we use the mean absolute error (MAE) loss instead of the typical cross-entropy loss because the noiselabel robustness of MAE has been well demonstrated in the literature (Ghosh et al., 2017; Zhang & Sabuncu, 2018) . The MAE loss L M AE is defined as the following: L M AE (f θ (x), ŷ) = ∥ŷ -σ(f θ (x))∥ 1 = 2 -2σ i * (f θ (x)) where i * is the index at which ŷi * = 1 in the one-hot encoded ŷ and σ i (•) is the i-th value of the softmax function. Although a model with the MAE loss alone could generally be noise-robust, it may occasionally be overconfident in predicting the SCF samples, meaning that the model produces low uncertainties in its predictions of those samples. As a result, the framework decides to exclude them from the SCF set. This problem is visually demonstrated in Model-B of Figure 2 , which is trained with the MAE alone. To resolve this, we employ the confidence regularization. Importantly, the role of this regularization is to prevent overconfident prediction for the SCF samples (Liang et al., 2018; Müller et al., 2019; Utama et al., 2020) . Specifically, this confidence regularization (Pereyra et al., 2017) penalizes the predictive entropy. The regularizer R ent is defined as follows: R ent (f θ (x)) = c i=1 σ i (f θ (x)) log(σ i (f θ (x))) In practice, we empirically show that combining MAE with the confidence regularization yields a better set of SCF samples (Model-A in Figure 2 ), which enhances the worst-group accuracy (Sec 6.1 and 6.2). In addition, the ablation study (Appendix C.1) shows that the contribution of their combination is significant on the classification benchmarks.

Bayesian neural network

We chose BNN as a network architecture to ensure that the identification model's uncertainty is reliable. This identification model is trained using the widely used stochastic gradient Markov-chain Monte Carlo sampling algorithm, Stochastic Gradient Langevin Dynamics (SGLD) (Welling & Teh, 2011) . The SGLD updates the parameter θ t with the batch {x (i) , ŷ(i) } n i=1 at step t via the following equation: θ t+1 ← θ t -- ϵ t 2 (∇ θ log p(θ t ) - N n n i=1 ∇ θ log p(ŷ (i) |θ t , x(i) )) + ρ t (7) where ϵ t is the step size and ρ t ∼ N (0, ϵ t ). The negative log-likelihood term (-log p(ŷ (i) |θ t , x(i) )) can be interpreted as a loss function. The prior term (log p(θ t )) is equivalent to the L2 regularization if we use the Gaussian prior over the θ. During the parameter updates, we obtain the snapshots of the parameters at every K steps. The final predictive value of the identification model is the empirical mean of the predictions: f (x) = 1 M M j=1 f θ (j) (x) where M is the number of parameter snapshots, θ (j) is the parameters at the j'th snapshot. Extension to regression Another benefit of the END framework is that it can be easily extended to fit a regression task with minor modifications. Since we utilize the BNN, the entropy can be calculated over a gaussian distribution with a predictive mean and a variance of SGLD weight samples N ( f (x), Var j [f θ (j) (x)]) (Kendall & Gal, 2017) . Instead of the classification-MAE loss (Eq 5), the regression version of the identification model uses the common regression-MAE loss (L(x, y) = |f (x) -y|) similar to the classification task. Another change in the regression version is that the confidence regularization is no longer used because it is not defined in the regression task. The regression version of END also improves worst-group performance in the regression task as shown in Sec 6.3. 

5.2. DEBIASED MODEL

Once the identification model is trained, we build a new training dataset by oversampling the SCF samples to train the debiased model. First, we assume that the predictive entropy of the identification model follows the Gamma distribution because it is a common assumption that the variance (uncertainty) of the Gaussian distribution follows the Gamma distribution (Bernardo & Smith, 2009) . Then, we obtain the SCF samples based on the p-value cut-off in a fitted Gamma distribution. Formally, the SCF set (D k SCF ) obtained by the identification model is as follows: D k SCF = Dτ ∪ • • • ∪ Dτ k times , Dτ = {(x (i) , ŷ(i) )|Φ(H(σ( f (x (i) )); α * , β * ) > 1 -τ } where H(•) is entropy; Φ is the CDF of the Gamma distribution; τ is the p-value thresholdfoot_0 ; k is the hyperparameter to represent the degree of oversampling. The parameters of the gamma distribution, α * and β * , are fitted via the moment estimation method (Hansen, 1982) . Finally, the debiased model is trained via ERM with the new dataset D ∪ D k SCF after acquisition of the SCF set via eq 8. Note that training the debiased model follows a conventional ERM procedure: it does not use the confidence regularization or the MAE loss. The final prediction of the END framework is the predictive value of the trained debiased model.

6.1. SYNTHETIC DATASET EXPERIMENT

We begin with the 2D-classification synthetic dataset experiment to qualitatively substantiate the group and noise label robustness of the END framework. The dataset has two characteristics: it has (1) the spurious features and (2) the noisy labels. Initially, we describe the two features of the dataset. One feature is the spurious feature which is easy to learn but exploiting this feature cannot classify the test set (Figure 3 ). The other one is the invariant feature. This feature is hard to learn because we manually scale down this feature value. Only a few training samples' labels solely rely on this feature. The ideal model can correctly classify both the train and the test sets only when using the invariant features. Therefore, the ideal decision boundary will be the vertical line in Figure 3 . Secondly, we assign random labels to the training samples with the probability of 20% to evaluate the model's robustness to the noisy labels. The details of the dataset and the neural network model are in Appendix B. The experimental results show that END correctly classifies both the majority and the minority groups (Figure 3(c) ). We posit that the outperformance is due to the well-identified SCF sample (Model-A in Figure 2 ). On the other hand, ERM is insufficient to classify the minority group (samples in the dotted circles), although it perfectly fits the majority group samples(Figure 3(a) ). This poor performance of ERM is consistent with the empirical studies on the real-world datasets like Waterbirds and CelebA (Liu et al., 2021) . Notably, JTT has learned a wrong decision boundary in favor of the minority group while completely overlooking the majority group. This is because JTT focuses too much on the noisy labels. In this subsection, we evaluate the END framework on the two benchmark image datasets: the CelebA and Waterbirds, which have the spurious-cues (Wah et al., 2011; Liu et al., 2015) . To evaluate both the group and the noise label robustness, we add simple symmetric label noises (uniformly flip the label) to the datasets, as shown in Table 1 . In The results from Table 1 substantiate that, unlike the other baselines, END achieves both the group and the noisy-label robustness. The primary reason for this is that END employs the predictive entropy, which is unaffected by the label noise. Specifically, we observe that the worst-group accuracy of the END framework consistently outperforms the other models in the noisy cases (Table 1 ; >= 10% noise). Not only that, END also shows the competitive worst-group accuracy on the noise free case. On the other hand, as the noise rate increases, the performances of the group-robust baselines are harmed much more severely. We interpreted that the reason for the degradation of these baselines is that they focus on the large loss samples, which are likely to be the noisy labels, as discussed in Sec 4. The noise label robust loss, GCE, improves the ERM with noisy labels, but its group robustness is insufficient. In addition, in the ablation study in Appendix C.1 shows that (1) utilizing entropy is the major contribution of the group and the noisy label robustness, as Theorem 1 stated, and (2) cooperation between the noisy-robust loss and the overconfident regularizer has an important role in its performance. Additionally, we qualitatively show that our identification model can identify the SCF samples while being robust to the noisy labels. Concretely, we visualize the 2-D projection of the latent features (before the last linear layer) of the identification model on the Waterbirds with 30% label noise. This experiment has two implications. Firstly, the SCF set identified by our framework corresponds with the minority group (the true SCF samples). Specifically, the first row of Figure 4 (END) shows that both the minority group (red and green dots in the third column) and the SCF set are mainly located around the middle of the images. Quantitatively, up to 30% of the SCF set consists of the minority group, which is higher than the actual proportion (5%). Secondly, the contamination of the SCF set with the noisy labeled samples is significantly mitigated by our framework. The first row of Figure 4 (END) shows that (1) the noise labels are almost identically distributed over the space, and (2) the noise labels do not severely overlap with the SCF set. This result substantiates that the proposed identification model (Sec 5.1) effectively identify the SCF samples while including less noisy samples. In contrast, the baseline identification model (the second row) shows an overlap between the noisy samples and the SCF set. This is in line with our claim: to utilize the uncertainty with the neural networks, the model should not memorize the noise labels and should have reliable uncertainty.

6.3. REAL-WORLD REGRESSION DATASET

In this subsection, we conduct experiments on the regression datasets with non-synthetic label noise to demonstrate the followings: (1) the END framework can be extended to a regression problem; (2) END achieve the group robustness under the non-artificial label noise. Particularly, we evaluate models on two drug-target affinity (DTA) benchmarks, Davis (Davis et al., 2011) and KIBA (Tang et al., 2014) , see Appendix B for the details. Inputs of these datasets are drug molecules and protein sequences. The target value is a physical experiment-derived affinity values between the drug and the protein. We use the DeepDTA architecture ( Öztürk et al., 2018) as the base architecture, see Appendix B for the details. Similar to the classification benchmarks, the DTA benchmarks have two characteristics. Firstly, the dataset has the spurious correlation: the DTA model typically relies on a single modality (e.g., predict its affinity by only leveraging the drug molecule, not considering the interaction with the target protein, or vice versa), which is inconsistent with the physicochemical laws ( Özc ¸elik et al., 2021; Yang et al., 2022) . To get the worst group information of each benchmarks, we group data samples by their distinct drug molecules and target proteins, respectively. Secondly, the target values are naturally noisy due to the different environments of data acquisition (Davis et al., 2011; Tang et al., 2014; 2018) . Thus, it can be seen as the non-synthetic label noise. Since Lff and JTT cannot be extended to the regression problems, we propose an alternative baseline, "hard." Akin to JTT and END, the hard algorithm picks up the top-K largest loss samples after the first phase of training. Next, another model is trained with the oversampled training dataset with those hard samples. Table 2 shows that END outperforms others in terms of worst-group MSE metrics. We posit that it is the well-identified SCF set via the proposed uncertainty-based approach that contributes to this improvement. On the other hand, hard shows no improvement over ERM due to the oversampled noise labels.

7. DISCUSSION

In this study, we present a new approach that can significantly improve the group robustness under the label noise scenario. We theoretically show that the predictive uncertainty is the proper criterion for identifying the SCF samples. Upon this foundation, we propose the END framework consisting of two procedures. (1) Obtaining the SCF set via predictive uncertainty of the noise-robust model with reliable uncertainty; (2) Training the debiased model on the oversampled training set with the selected SCF samples. In practice, we empirically demonstrate that END achieves both the group and the noisy label robustness. For future works, we discuss several potential areas of improvements. Firstly, the END framework adopts simple approaches (the MAE loss and the SGLD) for the identification model. Thus, the future works can employ a more advanced approach for the identification model which (1) obtains the reliable uncertainty and (2) prevents the memorization of the noisy label. Secondly, we only consider the total predictive uncertainty of the model in this study. However, the predictive uncertainty can be decomposed into two different types of uncertainty: aleatoric (uncertainty arising from the data noise) and epistemic (uncertainty arising from the model parameters) Kendall & Gal (2017) ; Oh & Shin (2022) . If we disregard the decomposed aleatoric uncertainty, we believe that it could improve the END framework.

2.. If

Z = 1, sample X = (X 1 • • • , X n ) ∼ B p . Otherwise if Z = -1, sample X = (X 1 • • • , X n ) ∼ B p ′ . 3. With probability 1 -η, let Y = Z. Otherwise, let Y = -Z. 4. Output (X 1 , • • • , X n , Y, Z) Note that the mathematical objects used here can be interpreted as follows: • X i : i-th feature of the sample. • Z: The true label of the sample, either positive(1) or negative(-1). • p i : The probability that the i-th feature has value 1 for a positive sample. • p ′ i : The probability that the i-th feature has value 1 for a negative sample. • η: The probability of "label noise". • Y : The post-noise label of the sample Proposition 1. Given p = (p 1 , • • • , p n ), p ′ = (p ′ 1 , • • • , p ′ n ) ∈ [0, 1] n and 0 ≤ η ≤ 1, let β * = (β 0 , β 1 , • • • , β n ) be the solution of D η p,p ′ . Then we have, for some k = k(p, p ′ ) > 0, β i = k(1 -2η) p i -p ′ i 1 -1 2 (2p i -1) 2 -1 2 (2p ′ i -1) 2 (i = 1, • • • , n) and β 0 + n i=1 (p i + p ′ i -1)β i = 0 Proof. All the expectations and probabilities that follow are with respect to (X, Y, Z) ∼ D η p,p ′ . Let β * = arg min β L(β) = arg min β E(β 0 + n i=1 β i X i -Y ) 2 Using • E[X i ] = 1 2 E[X i |Z = 1] + 1 2 E[X i |Z = -1] = p i + p ′ i -1 • E[X i |Y = 1] = (1 -η)E[X i |Y = 1, Z = 1] + ηE[X i |Y = 1, Z = -1] = (1 -η)(p i - (1 -p i )) + η(p ′ i -(1 -p ′ i )) = 2p i (1 -η) + 2p ′ i η -1 • E[X i |Y = -1] = (1 -η)E[X i |Y = -1, Z = -1] + ηE[X i |Y = -1, Z = 1] = (1 -η)(p ′ i -(1 -p ′ i )) + η(p i -(1 -p i )) = 2p i η + 2p ′ i (1 -η) -1 • For i ̸ = j, E[X i X j ] = 1 2 (E[X i X j |Z = 1] + E[X i X j |Z = -1]) = 1 2 (E[X i |Z = 1]E[X j |Z = 1] + E[X i |Z = -1]E[X j |Z = -1]) = 1 2 ((2p i -1)(2p j -1) + (2p ′ i - 1)(2p ′ j -1)) • P[Y = 1] = P[Y = -1] = 1/2, E[Y ] = 0 • E[X 2 i ] = 1 , we get 0 = 1 2 ∂ ∂β 0 L(β) β=β * = E(β 0 + n i=1 β i X i -Y ) = β 0 + n i=1 β i E[X i ] = β 0 + n i=1 (p i + p ′ i -1)β i , β 0 + n i=1 (p i + p ′ i -1)β i = 0 (which proves equation 11) and 0 = 1 2 ∂ ∂β i L(β) β=β * = EX i (β 0 + n j=1 β j X j -Y ) = β 0 EX i + β i EX 2 i + j̸ =i β j E[X i X j ] -E[X i Y ] = β 0 EX i + β i EX 2 i + j̸ =i β j E[X i X j ] - 1 2 E[X i |Y = 1] + 1 2 E[X i |Y = -1] = β 0 (p i + p ′ i -1) + β i + 1 2 j̸ =i β j ((2p i -1)(2p j -1) + (2p ′ i -1)(2p ′ j -1)) -(1 -2η)(p i -p ′ i ) , β 0 (p i + p ′ i -1) + β i + 1 2 j̸ =i β j ((2p i -1)(2p j -1) + (2p ′ i -1)(2p ′ j -1)) = (1 -2η)(p i -p ′ i ) (13) Letting w = n i=1 (2p i -1)β i and w ′ = n i=1 (2p ′ i -1)β i , we get β 0 = - w + w ′ 2 (14) from equation 12 and β 0 (p i +p ′ i -1)+(p i - 1 2 )w+(p ′ i - 1 2 )w ′ +(1- 1 2 (2p i -1) 2 - 1 2 (2p ′ i -1) 2 )β i = (1-2η)(p i -p ′ i ) (15) from equation 13. Plugging equation 14 into equation 15, we get w -w ′ 2 p i - w -w ′ 2 p ′ i + (1 - 1 2 (2p i -1) 2 - 1 2 (2p ′ i -1) 2 )β i = (1 -2η)(p i -p ′ i ) hence β i = k ′ p i -p ′ i 1 -1 2 (2p i -1) 2 -1 2 (2p ′ i -1) 2 , where k ′ = 1 -2η - w -w ′ 2 . ( ) Plugging equation 16 back into equation 17 using the definitions of w and w ′ , we get k ′ = 1 -2η 1 + n i=1 (pi-p ′ i ) 2 1-1 2 (2pi-1) 2 -1 2 (2p ′ i -1) 2 , so we conclude β i = k(1 -2η) p i -p ′ i 1 -1 2 (2p i -1) 2 -1 2 (2p ′ i -1) 2 where k = k(p, p ′ ) = 1 1 + n i=1 (pi-p ′ i ) 2 1-1 2 (2pi-1) 2 -1 2 (2p ′ i -1) 2 > 0.

A.3 BASIC LEMMAS

Definition 4 (cumulative distribution function F ). Let D be a distribution, and let f be a real function on the domain of D. Then for y ∈ R, we define F f (D) (y) = P X∈D [f (X) ≤ y] We treat "F (f (X))" as a shorthand of F f (D) (f (X)) when the definition of D is clear from the context. Definition 5 (anti-concentration of a discrete random variable). Let W be a discrete random variable. We write AC(W ) = max w P[W = w] We make use of the following estimates of P[F (W ) ≤ a)] and P[F (W ) ≥ 1 -a] for discrete random variables W throughout our proof: Lemma 1. For any discrete random variable W and any a ∈ [0, 1], we have a -AC(W Upper bounds on AC(W ) are called anti-concentration inequalities in the literature (Krishnapur (2016)). One such inequality is Littlewood-Offord inequality, and a version of it that encompasses our case was proved in Juškevičius & Kurauskas (2019)  ) < P[F (W ) ≤ a] ≤ a (18) and a -AC(W ) < P[F (W ) ≥ 1 -a] ≤ a (19) : Lemma 2. Let W = n i=1 β i X i , where β i ̸ = 0 and X = (X 1 , • • • , X n ) ∼ B p for p ∈ [ϵ, 1 -ϵ] n . Then we have AC(W ) ≤ C √ n for some C = C(ϵ) > 0. Proof. This is a direct consequence of Corollary 2 in Juškevičius & Kurauskas (2019) . An application of Lemma 1 is the following. Lemma 3. Let W be a discrete random variable. Let α = P[W < 0] and γ = AC(W ). For any δ > 0, we have P[F (|W |) ≤ δ and F (W ) > δ] ≤ 2α + γ and P[F (W ) ≤ δ and F (|W |) > δ] ≤ α In particular, for any event E, for any δ > 2γ, we have P[E | F (|W |) ≤ δ] ≥ P[E | F (W ) ≤ δ] • 1 1 + (2α + γ)/(δ -γ) - α δ -γ A.4 THE ENTROPY THEOREM Definition 6 (entropy function). For β = (β 0 , • • • , β n ) ∈ R n+1 , we define the entropy function H β : R n → R as follows: H β (x 1 , • • • , x n ) = H( exp(β 0 + n i=1 β i x i ) 1 + exp(β 0 + n i=1 β i x i ) ), where H(p) = -p log p -(1 -p) log(1 -p) Note that H β (x 1 , • • • , x n ) can be expressed as f (|β 0 + n i=1 β i x i |) for some monotonically decreasing function f . Definition 7 (sign-respecting function). Let A + = {(p, p ′ ) ∈ [0, 1] × [0, 1] : p > p ′ } and A -= {(p, p ′ ) ∈ [0, 1] × [0, 1] : p < p ′ }. A function Λ : ([0, 1]×[0, 1])\{(x, x)|x ∈ [0, 1]} = A + ∪A -→ R will be called sign-respecting if it is continuous on each of A + and A -, positive on A + and negative on A -. For example, f (p, p ′ ) = p-p ′ |p-p ′ | is sign-respecting. Definition 8 (spurious cue score function). Given a sign-respecting function Λ, under the distribution defined in Definition 3, we define the spurious cue score function Ψ Λ p,p ′ : R n → R as follows: Ψ Λ p,p ′ (X) = n i=1 Λ(p i , p ′ i )X i Roughly speaking, Ψ Λ p,p ′ measures how easy a given positive sample X is in terms of the number of label-compatible features. Here, a feature X i of a positive sample is label-compatible if either X i = 1 and p i > p ′ i (i.e. positive samples have higher probability of having X i = 1) or X i = -1 and p ′ i > p i (i.e. negative samples have higher probability of having X i = 1). Here, the "number" is weighted in terms of Λ(p i , p ′ i ). If we want to measure a similar thing for a negative sample, we can use Ψ p ′ ,p instead. Theorem 1 (formal). Let Λ be a sign-respecting function. Given 0 < ϵ < 1/2, there exists δ 0 = δ 0 (Λ, ϵ) > 0 such that for any 0 < δ ≤ δ 0 , for sufficiently large n (n ≥ N for some N = N (Λ, ϵ, δ)), the following holds: For any combination of • 0 ≤ η < 1/2 • p = (p 1 , • • • , p n ), p ′ = (p ′ 1 , • • • , p ′ n ) ∈ [ϵ, 1 -ϵ] n with |p i -p ′ i | ≥ ϵ (1 ≤ i ≤ n) , when β * = (β 0 , β 1 • • • , β n ) is the risk-minimizing linear solution of (X, Y ) for (X, Y, Z) ∼ D η p,p ′ , we have P X,Y,Z∼D η p,p ′ [ R Λ p,p ′ (X, Z) ≤ ϵ | F (H β * (X)) ≥ 1 -δ ] ≥ 1 -ϵ. , where R Λ p,p ′ is the spurious cue score rank function defined as R Λ p,p ′ (X, Z) = 1 Z=1 F Ψ p,p ′ (Bp) (Ψ Λ p,p ′ (X)) + 1 Z=-1 F Ψ Λ p ′ ,p (B p ′ ) (Ψ Λ p ′ ,p (X)) That is, R Λ p,p ′ (X, Z) is the rank of the spurious cue score of X within its ground truth label Z. This means intuitively that under our dataset generation process (as described by Definition 3) with some minor conditions (Those for p and p ′ ), relatively high (top-δ) entropy w.r.t. β * implies relatively small (bottom-ϵ within the ground truth label Z) spurious cue score with high probability (probability at least 1 -ϵ).

A.5 PROOF OF THEOREM 1 A.5.1 A REDUCTION

We reduce Theorem 1 to the following variant: Lemma 5. Let Λ be a sign-respecting function. Given 0 < ϵ < 1/2, there exists δ 0 = δ 0 (Λ, ϵ) > 0 such that for any 0 < δ min ≤ δ 0 , for sufficiently large n (n ≥ N for some N = N (Λ, ϵ, δ min )), the following holds: For any combination of • 0 ≤ η < 1/2 • p = (p 1 , • • • , p n ), p ′ = (p ′ 1 , • • • , p ′ n ) ∈ [ϵ, 1 -ϵ] n with |p i -p ′ i | ≥ ϵ (1 ≤ i ≤ n) • δ ∈ [δ min , δ 0 ] , when β * = (β 0 , β 1 • • • , β n ) is the risk-minimizing linear solution of (X, Y ) for (X, Y, Z) ∼ D η p,p ′ , P X∼Bp [ F (Ψ Λ p,p ′ (X)) ≤ ϵ | F (H β * (X)) ≥ 1 -δ ] ≥ 1 -ϵ. ( ) and P X∼B p ′ [ F (Ψ Λ p ′ ,p (X)) ≤ ϵ | F (H β * (X)) ≥ 1 -δ ] ≥ 1 -ϵ. ( ) where H β * is the entropy function (Definition 6), Ψ Λ p,p ′ and Ψ Λ p ′ ,p are the spurious cue score functions (Definition 8) and F are the cumulative distribution functions(Definition 4) related to them with respect to B p or B p ′ . The key differences from Theorem 1 are (1) the introduction of δ min and (2) that equation 20 has been separated into equation 21 and equation 22. Before presenting the proof of Lemma 5, we will show how to derive Theorem 1 from Lemma 5. proof of Theorem 1 assuming Lemma 5. Let Λ be a sign-respecting function, and let 0 < ϵ < 1/2. Let δ * 0 = δ Lemma5 0 (Λ, ϵ 2 ) and take δ 0 = δ * 0 3 . Given δ with 0 < δ < δ 0 , let δ min = δϵ 2 and take N = max(N Lemma5 (Λ, ϵ ′ , δ min ), ( 6C δ * 0 ) 2 , ( 4C δ ) 2 ) , where C = C(ϵ) is from Lemma 2. To show that N satisfies the theorem statement, let n ≥ N . Since δ min ≤ δ * 0 , by Lemma 5, we have ∀δ min ≤ δ ′ ≤ δ * 0 , P X∼Bp [ F (Ψ Λ p,p ′ (X)) ≤ ϵ 2 | F (H β * (X)) ≥ 1 -δ ′ ] ≥ 1 - ϵ 2 (23) and ∀δ min ≤ δ ′ ≤ δ * 0 , P X∼B p ′ [ F (Ψ Λ p ′ ,p (X)) ≤ ϵ 2 | F (H β * (X)) ≥ 1 -δ ′ ] ≥ 1 - ϵ 2 . ( ) Our goal is to show that Goal: P X,Y,Z∼D η p,p ′ [ R Λ p,p ′ (X, Z)) ≤ ϵ | F (H β * (X)) ≥ 1 -δ ] ≥ 1 -ϵ , where R Λ p,p ′ (X, Z) = 1 Z=1 F Ψ p,p ′ (Bp) (Ψ Λ p,p ′ (X)) + 1 Z=-1 F Ψ Λ p ′ ,p (B p ′ ) (Ψ Λ p ′ ,p (X)). To simplify the equations that follow, let us make the following definitions: • a * = min a ∈ R|F H β * (D η p,p ′ ) (a) ≥ 1 -δ) • p = P X,Y,Z∼D η p,p ′ [Z = 1|H β * (X) ≥ a * ] • δ 1 = 1 -F H β * (Bp) (a * ) • δ 2 = 1 -F H β * (B p ′ ) (a * ) We have the following (in)equalities: 1. P X∼Bp [F (H β * (X)) ≥ 1 -δ 1 ] = 2pP X,Y,Z∼D η p,p ′ [F (H β * (X)) ≥ 1 -δ] and P X∼B p ′ [F (H β * (X)) ≥ 1 -δ 2 ] = 2(1 -p)P X,Y,Z∼D η p,p ′ [F (H β * (X)) ≥ 1 -δ] (27) Proof. We have P X∼Bp [F (H β * (X)) ≥ 1 -δ 1 ] = P X∼Bp [H β * (X) ≥ a * ] = P X,Y,Z∼D η p,p ′ [H β * (X) ≥ a * |Z = 1] = P X,Y,Z∼D η p,p ′ [Z = 1|H β * (X) ≥ a * ] • P X,Y,Z∼D η p,p ′ [H β * (X) ≥ a * ] P X,Y,Z∼D η p,p ′ [Z = 1] = 2pP X,Y,Z∼D η p,p ′ [H β * (X) ≥ a * ] = 2pP X,Y,Z∼D η p,p ′ [F (H β * (X)) ≥ 1 -δ] , which proves equation 26. equation 27 can be proved similarly.

2.

δ 1 , δ 2 ≤ δ * 0 (28) Proof. We have δ 1 -AC X∼Bp (H β * (X)) ≤ P X∼Bp [F (H β * (X)) ≥ 1 -δ 1 ] (By Lemma 1) = 2pP X,Y,Z∼D η p,p ′ [F (H β * (X)) ≥ 1 -δ] (By equation 26) ≤ 2pδ (By Lemma 1) ≤ 2δ ≤ 2δ 0 ≤ 2 3 δ * 0 . Since AC X∼Bp (H β * (X)) = AC X∼Bp ( n i=1 β * i X i ) ≤ 2AC X∼Bp ( n i=1 β * i X i ) ≤ 2C √ n (By Lemma 2) ≤ δ * 0 3 (Since n ≥ ( 6C δ * 0 ) 2 ) , this proves δ 1 ≤ δ * 0 . δ 2 ≤ δ * 0 can be proved similarly.

3

. δ 1 ≥ pδ, δ 2 ≥ (1 -p)δ Proof. We have δ 1 ≥ P X∼Bp [F (H β * (X)) ≥ 1 -δ 1 ] (By Lemma 1) = 2pP X,Y,Z∼D η p,p ′ [F (H β * (X)) ≥ 1 -δ] (By equation 26) ≥ 2p(δ -AC X,Y,Z∼D η p,p ′ (H β * (X))) (By Lemma 1) ≥ 2p(δ - 1 2 (AC X∼Bp (H β * (X)) + AC X∼B p ′ (H β * (X)))) = 2p(δ - 1 2 (AC X∼Bp ( n i=1 β * i X i ) + AC X∼B p ′ ( n i=1 β * i X i ))) ≥ 2p(δ -(AC X∼Bp ( n i=1 β * i X i ) + AC X∼B p ′ ( n i=1 β * i X i ))) ≥ 2p(δ - 2C √ n ) (By applying Lemma 2 twice) ≥ pδ (Since n ≥ ( 4C δ ) 2 ) , which proves δ 1 ≥ pδ. δ 2 ≥ (1 -p)δ can be proved similarly. Using these inequalities, we can show equation 25 as follows: P X,Y,Z∼D η p,p ′ [ R Λ p,p ′ (X, Z)) ≤ ϵ | F (H β * (X)) ≥ 1 -δ ] ≥ P X,Y,Z∼D η p,p ′ [ R Λ p,p ′ (X, Z)) ≤ ϵ 2 | F (H β * (X)) ≥ 1 -δ ] = P X,Y,Z∼D η p,p ′ [ R Λ p,p ′ (X, Z)) ≤ ϵ 2 | H β * (X) ≥ a * ] (By definition of a * ) = p • P X,Y,Z∼D η p,p ′ [ F Ψ p,p ′ (Bp) (Ψ p,p ′ (X)) ≤ ϵ 2 | H β * (X) ≥ a * , Z = 1 ] + (1 -p) • P X,Y,Z∼D η p,p ′ [ F Ψ p ′ ,p (B p ′ ) (Ψ p ′ ,p (X)) ≤ ϵ 2 | H β * (X) ≥ a * , Z = -1 ] = p • P X∼Bp [ F Ψ p,p ′ (Bp) (Ψ p,p ′ (X)) ≤ ϵ 2 | H β * (X) ≥ a * ] + (1 -p) • P X∼B p ′ [ F Ψ p ′ ,p (B p ′ ) (Ψ p ′ ,p (X)) ≤ ϵ 2 | H β * (X) ≥ a * ] = p • P X∼Bp [ F Ψ p,p ′ (Bp) (Ψ p,p ′ (X)) ≤ ϵ 2 | F (H β * (X)) ≥ 1 -δ 1 ] + (1 -p) • P X∼B p ′ [ F Ψ p ′ ,p (B p ′ ) (Ψ p ′ ,p (X)) ≤ ϵ 2 | F (H β * (X)) ≥ 1 -δ 2 ] ≥ (1 - ϵ 2 )(p1 δ1≥δmin + (1 -p)1 δ2≥δmin ) (By equation 23, equation 24 and equation 28) ≥ (1 - ϵ 2 )(p1 p≥δmin/δ + (1 -p)1 1-p≥δmin/δ ) (By equation 29) ≥ (1 - ϵ 2 )(1 -δ min /δ) (Case analysis based on the magnitude of p) = (1 - ϵ 2 ) 2 ≥ 1 -ϵ A.5.2 PROOF OF LEMMA 5 Now, let us prove Lemma 5. The proof relies on the calculation results of Proposition 1 and the following lemma. Lemma 6. Given 0 < ϵ < 1/2 and M ≥ 1, there exists δ 0 = δ 0 (ϵ, M ) > 0 such that for any 0 < δ min ≤ δ 0 , for sufficiently large n (i.e. n ≥ N for some N = N (ϵ, M, δ min )), the following holds: For any combination of • p = (p 1 , • • • , p n ) ∈ [ϵ, 1 -ϵ] n • a 1 , • • • , a n , b 1 , • • • , b n ∈ [ 1 M , M ] • δ ∈ [δ min , δ 0 ] , we have P X∼Bp [F ( n i=1 a i X i ) ≤ ϵ | F ( n i=1 b i X i ) ≤ δ] ≥ 1 -ϵ Before presenting the proof of this lemma, we will show how to conclude Lemma 5 from it. Let 0 < ϵ < 1/2, M ≥ 1. It is enough to find δ 0 that satisfies the theorem statement where only equation 21 is considered but not equation 22. This is because since then a symmetric argument that accounts for equation 22 can be made and we can take the minimum of the both δ 0 . Let , where β(a, b) = a -b 1 -1 2 (2a -1) 2 -1 2 (2b -1) 2 . Take M = M (Λ, ϵ) such that 1 M ≤ min I ϵ = {(a, b) ∈ [ϵ, 1 -ϵ] × [ϵ, 1 -ϵ] : |a -b| ≥ ϵ}. Let δ 0 = δ Lemma6 0 (ϵ/2, M ) , where δ Lemma6 0 (ϵ, M ) stands for the δ 0 found by using Lemma 6. Suppose 0 < δ min ≤ δ 0 . Suppose δ ∈ [δ min , δ 0 ], p, p ′ and η have been chosen, and β * = (β 0 , • • • , β n ) has been determined accordingly. Then our goal is to show that Goal: P X∼Bp [ F (Ψ Λ p,p ′ (X)) ≤ ϵ | F (H β * (X)) ≥ 1 -δ ] ≥ 1 -ϵ for sufficiently large n. Letting X ′ i = p i -p ′ i |p i -p ′ i | X i (i = 1, • • • , n) , α = P X∼Bp [β 0 + n i=1 β i X i < 0] and γ = AC(β 0 + n i=1 β i X i ) , we get P X∼Bp [ F (Ψ Λ p,p ′ (X)) ≤ ϵ | F (H β * (X)) ≥ 1 -δ ] = P X∼Bp [ F (Ψ Λ p,p ′ (X)) ≤ ϵ | F ( β 0 n i=1 β i X i ) ≤ δ ] ≥ - α δ -γ + 1 1 + (2α + γ)/(δ -γ) • P X∼Bp [ F (Ψ Λ p,p ′ (X)) ≤ ϵ | F (β 0 + n i=1 β i X i ) ≤ δ ] (by Lemma3) = - α δ -γ + 1 1 + (2α + γ)/(δ -γ) • P X∼Bp [ F (Ψ Λ p,p ′ (X)) ≤ ϵ | F ( n i=1 β i X i ) ≤ δ ] = - α δ -γ + 1 1 + (2α + γ)/(δ -γ) • P X∼Bp [ F ( n i=1 p i -p ′ i |p i -p ′ i | Λ(p i , p ′ i )X ′ i ) ≤ ϵ | F ( n i=1 p i -p ′ i |p i -p ′ i | β(p i , p ′ i )X ′ i ) ≤ δ ] (by equation 10) ≥ - α δ -γ + 1 1 + (2α + γ)/(δ -γ) • (1 -ϵ/2) (by Lemma 6, provided that n ≥ N Lemma6 (ϵ/2, M, δ min )) ≥ - α δ min -γ + 1 1 + (2α + γ)/(δ min -γ) • (1 -ϵ/2) In the second last inequality, we could apply Lemma6 provided that n ≥ N Lemma6 (ϵ/2, M, δ min ) because 1. δ ∈ [δ min , δ Lemma6 0 (ϵ/2, M )] 2. pi-p ′ i |pi-p ′ i | Λ(p i , p ′ i ) > 0 (since Λ is sign-respecting) and 1 M ≤ |Λ(p i , p ′ i )| ≤ M 3. pi-p ′ i |pi-p ′ i | β(p i , p ′ i ) > 0 and 1 M ≤ |β(p i , p ′ i )| ≤ M 4. X ′ = (X ′ 1 , • • • , X ′ n ) ∼ B p ′′ , where p ′′ i = p i (p i > p ′ i ) 1 -p i (p i < p ′ i ) ∈ [ϵ, 1 -ϵ] (i = 1, • • • , n). may assume that Σ is invertible because Σ has the form Σ = 1 κ κ 1 , κ = n i=1 a i b i σ(p i ) 2 n i=1 a 2 i σ(p i ) 2 n i=1 b 2 i σ(p i ) 2 and the conclusion of the theorem is trivial when κ = 1 (Since then (a 1 , • • • , a n ) ∼ (b 1 , • • • , b n )). The inverse is: Σ -1 = 1 1 -κ 2 1 κ κ 1 Therefore, applying Lemma 4, we get, for Z ∼ N (0, Σ), |P[S ∈ U ] -P[Z ∈ U ]| ≤ Cγ for any convex U ⊆ R 2 , where C is an absolute constant and γ = n i=1 E[ Σ -1 V i 3 2 ]. κ has the following lower bound: κ = n i=1 a i b i σ(p i ) 2 n i=1 a i b i ( ai bi )σ(p i ) 2 n i=1 a i b i ( bi ai )σ(p i ) 2 ≥ 1 max n i=1 ai bi max n i=1 bi ai ≥ M -2 (39) On the other hand, an upper bound comes from our "restriction". bootstrapping restriction: κ ≤ 0.99 Note that the case where κ > 0.99 should be, at least intuitvely, easier to prove, since the two events in equation 33 become more correlated in that case. We assume this upper bound temporarily to make sure γ is bounded from above properly. From equation 36 and equation 40, it can be easily deduced that γ ≤ C ′ √ n for some C ′ = C ′ (ϵ, M ). Now, we're going to prove equation 33 in light of equation 37. To do so, note that the probability density function of Z can be written as: f Z (z 1 , z 2 ) = 1 2π det Σ exp(- 1 2 z 1 z 2 T Σ -1 z 1 z 2 ) = 1 2π(1 -κ 2 ) exp(- 1 2(1 -κ 2 ) (z 2 1 -2κz 1 z 2 + z 2 2 )) Also, note that from this we can calculate: α -∞ f Z (z 1 , z 2 )dz 1 ∞ -∞ f Z (z 1 , z 2 )dz 1 = Φ( α -κz 2 √ 1 -κ 2 ). For sufficiently large n, we have P X∼Bp [F ( n i=1 a i X i ) ≤ ϵ | F ( n i=1 b i X i ) ≤ δ] = P X∼Bp [F (S 1 ) ≤ ϵ | F (S 2 ) ≤ δ] = P X [F (S 1 ) ≤ ϵ and F (S 2 ) ≤ δ] / δ (by Lemma 1) = P X [P X ′ [S ′ 1 ≤ S 1 ] ≤ ϵ and P X ′ [S ′ 2 ≤ S 2 ] ≤ δ] / δ ≥ P X [P Z ′ [Z ′ 1 ≤ S 1 ] ≤ ϵ -Cγ and P Z ′ [Z ′ 2 ≤ S 2 ] ≤ δ -Cγ] / δ (by equation 37) = P X [S 1 ≤ Φ -1 (ϵ -Cγ) and S 2 ≤ Φ -1 (δ -Cγ)] / δ ≥ P X [S 1 ≤ α and S 2 ≤ Φ -1 (δ -Cγ)] / δ ≥ -Cγ/δ + P Z [Z 1 ≤ α and Z 2 ≤ Φ -1 (δ -Cγ)] / δ (by equation 37) = -Cγ/δ + (1 -Cγ/δ) • Φ -1 (δ-Cγ) -∞ ( α -∞ f Z (z 1 , z 2 )dz 1 )dz 2 Φ -1 (δ-Cγ) -∞ ( ∞ -∞ f Z (z 1 , z 2 )dz 1 )dz 2 ≥ -Cγ/δ + (1 -Cγ/δ) • inf z2≤Φ -1 (δ-Cγ) α -∞ f Z (z 1 , z 2 )dz 1 ∞ -∞ f Z (z 1 , z 2 )dz 1 = -Cγ/δ + (1 -Cγ/δ) • inf z2≤Φ -1 (δ-Cγ) Φ( α -κz 2 √ 1 -κ 2 ) (by equation 41) = -Cγ/δ + (1 -Cγ/δ) • Φ( α -κΦ -1 (δ -Cγ) √ 1 -κ 2 ) Now, let us further estimate the last term: Φ( α -κΦ -1 (δ -Cγ) √ 1 -κ 2 ) ≥ Φ( α -κΦ -1 (δ 0 ) √ 1 -κ 2 ) = Φ( α -κM 2 (α -Φ -1 (1 -ϵ/2)) √ 1 -κ 2 ) ≥ Φ( α -(α -Φ -1 (1 -ϵ/2)) √ 1 -κ 2 ) (α -Φ(1 -ϵ/2) is negative + equation 39) ≥ 1 -ϵ/2 Therefore continuing the previous chain of inequalities, P X∼Bp [F ( n i=1 a i X i ) ≤ ϵ | F ( n i=1 b i X i ) ≤ δ] ≥ -Cγ/δ + (1 -Cγ/δ) • (1 -ϵ/2) ≥ -Cδ -1 C ′ / √ n + (1 -Cδ -1 C ′ / √ n) • (1 -ϵ/2) ≥ -Cδ -1 min C ′ / √ n + (1 -Cδ -1 min C ′ / √ n) • (1 -ϵ/2) ≥ 1 -ϵ , when n ≥ N for some N that depends only on ϵ, M and δ min . This finishes the proof with the restriction of equation 40. Turning to the general case, let 0 < ϵ < 1/2 and M ≥ 1. We take δ 0 = min(δ bootstrap 0 (ϵ, M ), δ bootstrap 0 (ϵ/2, 2M )). We claim that δ 0 satisfies the theorem statement. Let δ min ∈ (0, δ 0 ], δ ∈ [δ min , δ 0 ], p = (p 1 , • • • , p n ) ∈ [ϵ, 1 -ϵ] n and 1 M ≤ a 1 , • • • , a n ≤ M . Take N = max(N bootstrap (ϵ, M, δ min ), N bootstrap (ϵ/2, 2M, δ min )). We may assume that κ = n i=1 a i b i σ(p i ) 2 n i=1 a 2 i σ(p i ) 2 n i=1 b 2 i σ(p i ) 2 > 0.99 , since otherwise, we can rely on δ ∈ [δ min , δ bootstrap 0 (ϵ, M )] to conclude. We will construct 1 2M ≤ a ′ 1 , • • • , a ′ n ≤ 2M and 1 2M ≤ a ′′ 1 , • • • , a ′′ n ≤ 2M such that κ ′ = n i=1 a ′ i b i σ(p i ) 2 n i=1 a ′2 i σ(p i ) 2 n i=1 b 2 i σ(p i ) 2 ≤ 0.99 , κ ′′ = n i=1 a ′′ i b i σ(p i ) 2 n i=1 a ′′2 i σ(p i ) 2 n i=1 b 2 i σ(p i ) 2 ≤ 0.99 and 3a i = a ′ i + a ′′ i (1 ≤ i ≤ n) . Then since δ ∈ [δ min , δ bootstrap 0 (ϵ/2, 2M )], we get p ′ = P[F ( n i=1 a ′ i X i ) ≤ ϵ/2 | F ( n i=1 b i X i ) ≤ δ] ≥ 1 -ϵ/2 and p ′′ = P[F ( n i=1 a ′′ i X i ) ≤ ϵ/2 | F ( n i=1 b i X i ) ≤ δ] ≥ 1 -ϵ/2 for n ≥ N . Then we can finish the proof as follows: P[F ( n i=1 a i X i ) ≤ ϵ | F ( n i=1 b i X i ) ≤ δ] = P[F ( n i=1 3a i X i ) ≤ ϵ | F ( n i=1 b i X i ) ≤ δ] = P[F ( n i=1 (a ′ i + a ′′ i )X i ) ≤ ϵ | F ( n i=1 b i X i ) ≤ δ] ≥ P[F ( n i=1 a ′ i X i ) + F ( n i=1 a ′′ i X i ) ≤ ϵ | F ( n i=1 b i X i ) ≤ δ] ≥ P[F ( n i=1 a ′ i X i ) ≤ ϵ/2 and F ( n i=1 a ′′ i X i ) ≤ ϵ/2 | F ( n i=1 b i X i ) ≤ δ] ≥ p ′ + p ′′ -1 ≥ (1 -ϵ/2) + (1 -ϵ/2) -1 = 1 -ϵ In the middle, we relied on the inequality  F (Y 1 + Y 2 ) ≤ F (Y 1 ) + F (Y 2 ) where Y 1 = n i=1 a ′ i X i and Y 2 = n i=1 a ′′ i X i , a i b i σ(p i ) ≃ 1 2 n i=1 a i b i σ(p i ) 2 (45) and i∈S a 2 i σ(p i ) ≃ 1 2 n i=1 a 2 i σ(p i ) 2 , we can let a ′ i = 2a i (i ∈ S) a i (i ∈ S c ) and a ′′ i = a i (i ∈ S) 2a i (i ∈ S c ) which leads to: κ ′ = n i=1 a ′ i b i σ(p i ) 2 n i=1 a ′2 i σ(p i ) 2 n i=1 b 2 i σ(p i ) 2 = 2 i∈S a i b i σ(p i ) 2 + i∈S c a i b i σ(p i ) 2 4 i∈S a 2 i σ(p i ) 2 + i∈S c a ′2 i σ(p i ) 2 n i=1 b 2 i σ(p i ) 2 ∼ n i=1 (2 • 1 2 + 1 2 )a i b i σ(p i ) 2 (4 • 1 2 + 1 2 ) n i=1 a 2 i σ(p i ) 2 n i=1 b 2 i σ(p i ) 2 = 1.5 √ 2.5 κ ≤ 1.5 √ 2.5 < 0.99 S that satisfies equation 45 and equation 46 can found, for example, by using a probabilistic argument. (Define a random set and show that it satisfies the desired properties with high probability using Hoeffding's concentration inequality) A.6 THE ERROR THEOREM Definition 9. A function Err : R × {-1, 1} → R will be called a good error function if it satisfies the following properties: 1. For any (ŷ 1 , y 1 ), (ŷ 2 , y 2 ) ∈ R × {-1, 1}, ŷ1 y 1 > 0, ŷ2 y 2 ≤ 0 ⇒ Err(ŷ 1 , y 1 ) < Err(ŷ 2 , y 2 ) 2. For any ŷ1 , ŷ2 ∈ R and y ∈ {-1, 1} ŷ1 ̸ = ŷ2 ⇒ Err(ŷ 1 , y) ̸ = Err(ŷ 2 , y) That is, 1. Predictions that have the same sign with the target label always have lower error values than those that do not. 2. Different predictions on inputs that have the same target label result in distinguishable error values. Theorem 2. Let Err be a good error function. Given 0 < ϵ < 1/2 and 0 ≤ η < 1/2, there exists C = C(ϵ, η) > 0 such that for any 0 < θ ≤ 1, for sufficiently large n, the following holds: For any combination of p = (p 1 , • • • , p n ) and p ′ = (p ′ 1 , • • • , p ′ n ) in [ϵ, 1 -ϵ] n with |p i -p ′ i | ≥ ϵ (1 ≤ i ≤ n), when β * = (β 0 , β 1 • • • , β n ) is the risk-minimizing linear solution of (X, Y ) for (X, Y, Z) ∼ D η p,p ′ , we have P X,Y,Z∼D η p,p ′ [ Y Z = -1 | F (Err( Ŷ , Y )) ≥ 1 -θ ] ≥ min(1, η θ ) - C √ n , where Ŷ = β 0 + n i=1 β i X i . A.7 PROOF OF THEOREM 2 In proving Theorem 2, we tackle the different ranges of θ (1) θ > η, (2) θ < η (3) θ = η separately. In all cases, we need the following facts: • When ξ = P[ Ŷ Z ≤ 0] , we have ξ ≤ exp(-C 1 n) (47) for some C 1 = C 1 (ϵ). Proof. Use Hoeffding's inequality in conjunction with equation 11 and equation 10 in the same manner as in equation 31 equation 32. • When η 0 = P[ Ŷ Y ≤ 0] , we have |η 0 -η| ≤ ξ Proof. |η 0 -η| = P[ Ŷ Y ≤ 0] -P[Y Z = -1] ≤ P[( Ŷ Y ≤ 0 ∧ Y Z = 1) ∨ ( Ŷ Y > 0 ∧ Y Z = -1)] (symmetric difference) ≤ P[ Ŷ Z ≤ 0] (logical implication) = ξ • When γ = AC(Err( Ŷ , Y )) (See Lemma 1) , we have γ ≤ C 2 √ n ( ) for some C 2 = C 2 (ϵ). Proof. γ = max x∈R P[Err( Ŷ , Y ) = x] ≤ max x∈R (P[Err( Ŷ , 0) = x] + P[Err( Ŷ , 1) = x]) ≤ AC(Err( Ŷ , 0)) + AC(Err( Ŷ , 1)) ≤ AC( Ŷ ) + AC( Ŷ ) (Since Err is "good") = 2AC( n i=1 β i X i ) ≤ C 2 √ n (equation 10, Lemma 2) A.7.1 THE CASE θ > η Assume θ > η. From equation 48, equation 47 and equation 49, we see that θ ≥ η 0 + γ ( ) when n is sufficiently large. First, we argue that Claim: Ŷ Y ≤ 0 ⇒ F (Err( Ŷ , Y )) ≥ 1 -θ. Suppose otherwise. Then it happens with nonzero probability that Ŷ Y ≤ 0 and F (Err( Ŷ , Y )) < 1 -θ. That is, ∃(ŷ 1 , y 1 ), ŷ1 y 1 ≤ 0 ∧ F (Err( ŷ1 , y 1 )) < 1 -θ. (52) On the other hand, we have P[F (Err( Ŷ , Y )) ≥ 1 -θ ∧ Ŷ Y > 0] ≥ P[F (Err( Ŷ , Y )) ≥ 1 -θ] -η 0 > θ -γ -η 0 (by Lemma 1) ≥ 0 (by equation 50) Since the event F (Err( Ŷ , Y )) ≥ 1 -θ ∧ Ŷ Y > 0 occurs with a nonzero probability, we have ∃(ŷ 2 , y 2 ), ŷ2 y 2 > 0 ∧ F (Err( ŷ2 , y 2 )) ≥ 1 -θ. From equation 52 and equation 53, since ŷ1 y 1 ≤ 0, ŷ2 y 2 > 0 but Err(ŷ 1 , y 1 ) < Err(ŷ 2 , y 2 ), we find a contradiction with the definition of "good error function". Now, we have P[Y Z = -1|F (Err( Ŷ , Y )) ≥ 1 -θ] = P[Y Z = -1 ∧ F (Err( Ŷ , Y )) ≥ 1 -θ]] P[F (Err( Ŷ , Y )) ≥ 1 -θ]] ≥ P[Y Z = -1 ∧ Ŷ Y ≤ 0]] Now, we have P[Y Z = -1|F (Err( Ŷ , Y )) ≥ 1 -θ] = P[Y Z = -1 ∧ F (Err( Ŷ , Y )) ≥ 1 -θ]] P[F (Err( Ŷ , Y )) ≥ 1 -θ]] ≥ P[ Ŷ Z > 0 ∧ F (Err( Ŷ , Y )) ≥ 1 -θ]] P[F (Err( Ŷ , Y )) ≥ 1 -θ]] (equation 55, Ŷ Y ≤ 0 ∧ Ŷ Z > 0 ⇒ Y Z ≤ 0 ⇒ Y Z = -1) ≥ 1 - ξ P[F (Err( Ŷ , Y )) ≥ 1 -θ]] ≥ 1 - ξ θ -γ (by Lemma 1) ≥ 1 - 1 θ/2 exp(-C 1 n) (by equation 47, equation 49) , when n is sufficiently large.

A.7.3 THE CASE θ = η

The remaining case is when θ = η. In this case, we rely on an approximation to the case θ < η 0 (Note that the proof for the previous case θ < η relied on that θ < η 0 when n is sufficiently large.). Let η 1 = min(η, (1 -exp(-n))η 0 ). We get P[Y Z = -1|F (Err( Ŷ , Y )) ≥ 1 -η] ≥ P[Y Z = -1|F (Err( Ŷ , Y )) ≥ 1 -η 1 ] • P[F (Err( Ŷ , Y )) ≥ 1 -η 1 ] P[F (Err( Ŷ , Y )) ≥ 1 -η] ≥ (1 - 2 η 1 exp(-C 1 n)) • P[F (Err( Ŷ , Y )) ≥ 1 -η 1 ] P[F (Err( Ŷ , Y )) ≥ 1 -η] (Applying the lower bound from the previous case) ≥ (1 - 2 η 1 exp(-C 1 n)) • η 1 -γ η (Lemma 1) ≥ (1 - 4 η exp(-C 1 n)) • (1 -exp(-n))η 0 -γ η ≥ (1 - 4 η exp(-C 1 n)) • (1 -exp(-n))(η -exp(-C 1 n)) -C2 √ n η (equation 48, equation 47, equation 49) ≥ 1 - 2C 2 /η √ n , when n is sufficiently large. B EXPERIMENTAL DETAILS B.1 SYNTHETIC DATASET

Dataset details

The 2-D synthetic classification dataset consists of the majority groups (can be classified with the spurious features) and the minority groups (cannot be classified with the spurious features). The training set consists of 1,000 majority group samples and five minority group samples. Specifically, the majority group sample is sampled from the multivariate Gaussian distributions N ([5.0, 5.0], 1.3I) and N ([-5.0, -5.0], 1.3I) for positive and negative classes, respectively. On the contrary, the minority group sample is sampled from the N ([5.0, -5.0], 1.3I) and N ([-5.0, 5.0], 1.3I). Notably, before we feed the training samples into the model, the second feature (invariant feature) is manually scaled down 1/10 times. This downscaling leads to relatively insignificant gradients for the invariant features. Thus, the invariant feature is hard to learn.

Model details

We use the simplest neural network architecture for this experiment: the fully connected neural network with a single hidden layer with ReLU activation. The number of the hidden neuron is 50,000. For the training, we use SGD optimization for all evaluated models. For the ERM, we use 1e-4 as the learning rate; 1e-5 as the weight decay. For the END's identification model, we use 5e-4 as the learning rates; 1e-5 as the weight decay; 1e-3 as the confidence regularization; 1e-2 as the p-value threshold. The SGLD weight sample is saved every 3,000 iterations. For the JTT and END's debiased model, we use 1e-5 as the learning rate; 5e-4 as the weight decay for both the identification model and final model. The hyperparameters for JTT and END's debiased model are determined based on the argument from Sagawa et al. (2019) ; Liu et al. (2021) : the importance weighting approaches should use relatively lower learning rates and higher weight decays. Every evaluated method is trained in 100,000 iterations with batch-size 32, excepts for the identification model of END and JTT: 30,000 iterations. Additionally, we add the additional datasets (error set for JTT, SCF set for END) 30 times.

B.2 GROUP ROBUSTNESS BENCHMARK DATASET

Dataset details We use the two image classification datasets, the CelebA and Waterbirds. The two datasets have been used to evalulate the group robustness (worst-group accuracy). Each dataset has the group information (attributes, labels). And the attributes is the spurious-cue. For instance, the target class of the CelebA dataset is hair color, but the model can predict the class by abusing the spurious cue "gender". This abuse leads to the degradation of accuracy for a certain group. If the model has group-robustness, this degradation is insignificant. Here, we elucidate the input features, the group information, and the target classes of the dataset: • Waterbirds (Wah et al., 2011) : The input data is the image of the birds and their backgrounds. The target classes are either "waterbirds" or "landbirds". Notably, the dataset has the group information (background images, labels): "waterbirds with (water/land) background" or "landbirds (water/land) background". In the training dataset, most of the waterbird images have the water background, and vice-versa. Here, only 5% of the training dataset has a contradictory background (e.g., landbird on the water background). We consider 4 groups: [waterbird, water background], [landbird, water background], [waterbird, land background], and [landbird, land background]. • CelebA (Liu et al., 2015) The input data is the image of the celebrities. Similar to the Sagawa et al. (2019) and Liu et al. (2021) , the target class is the hair color, "blond" or "not blond". Here, the spurious attributes are the "male" or "female." The label and spurious attributes are spuriously correlated like in the Waterbird case (e.g., most of the "male" group has the "not blond" label). We consider 4 groups: [blond, female], [not blond, female], [blond, male], and [not blond, male]. Moreover, we add symmetric noise to the two datasets. In practice, we randomly change the target labels of each sample with probabilities of [0, 10%, 20%, 30%].

Model details

In our experiments on the benchmark datasets, we follow the experimental setup of Liu et al. (2021) for LfF, ERM, and JTT. Specifically, we use the same model architecture for all evaluated methods: ResNet50 (He et al., 2016) , pre-trained with the ImageNet. For the baseline approaches' hyperparameters, we follow the prior studies (Nam et al., 2020; Liu et al., 2021) . Moreover, similar to the Liu et al. (2021) , the model selection is based on the validation worst-group accuracy. For the Waterbirds dataset, all evaluated models are trained with up to 300 epochs and batch-size 64, except for the identification model of END and JTT, 50 epochs. The SGD optimizers with the 0.9 momentum are used, except for the LfF. The model selection is based on the worst-group accuracy on the validation dataset for all methods. The ERM uses 1e-3 as the learning rate, 1e-4 as the L2 regularization; The JTT and END's debiased model uses 1e-5 as the learning rates, 1.0 as the L2 regularization, and adding the additional dataset for JTT (error set) 50 times and for END (SCF set) 100 times; The LfF uses the Adam optimizer, with 1e-4 learning rates and 1e-4 L2 regularization. The q of the LfF is 1e-3; END uses 1e-3 as the learning rate, the 1e-4 as the L2 regularization, and 3e-1 as the confidence regularization for the identification model. We save the SGLD weight samples every 5 epochs. For the CelebA dataset, the models are trained with up to 50 epochs and batch-size 64, except for the identification of END and JTT, 5 and 1 epochs respectively. Similar to the Waterbirds, the SGD optimizers with the 0.9 momentum are used, except for the LfF. The model selection is based on the worst-group accuracy on the validation dataset for all methods. The ERM uses 1e-4 as the learning rate, 1e-4 as the L2 regularization; The JTT uses 1e-5 as the learning rates, 1e-1 as the L2 regularization, and adding the additional dataset (error set of JTT, SCF set of END) 50 times; The LfF uses the Adam optimizer, with 1e-4 learning rates and 1e-4 L2 regularization. The q of the LfF is 1e-3; END uses 5e-2 as the learning rate, the 1e-4 as the L2 regularization, and 1e-3 as the confidence regularization for the identification model. We save the SGLD weight samples every 1 epochs. DTA task The Drug-Target Affinity (DTA) regression task is a significant task for early-stage drug discovery. We use two well-known benchmark datasets: Davis (Davis et al., 2011) and KIBA (Tang et al., 2014) . The inputs of the datasets are the Simplified Molecular Input Line-entry System (SMILES) sequence-the sequence of the drug molecule-and the amino acid sequence-the sequence of the target protein. Thus, similar to other sequence-based deep learning tasks, the input data are the one-hot encoded sequences. The target value is realvalued drug-target affinity. The dataset has 5 different folds, which is suggested in ( Öztürk et al., 2018) . Thus, the results in Table 2 are average and standard deviation over 5 trials.

B.3 DRUG-TARGET AFFINITY REGRESSION DATASET

• Davis dataset consists of clinically relevant kinase inhibitor ligands and their affinity values-dissociation constant K d . As Öztürk et al. (2018) does, we rescale the target affinity value: -log K d 1e 6 . The Davis dataset consists of 68 drugs and 442 target protein sequences, a total of 30,056 affinity values. The additional detail of the Davis dataset is that the target affinity value is 5 if the affinity is lower than 5 (Davis et al., 2011) . This could be a noisy label that does not represent the true affinity value. • KIBA dataset includes kinase protein sequences and SMILES molecule sequences, similar to the Davis. The KIBA dataset uses its own affinity score, the KIBA score. The dataset has 2,111 compounds and 229 proteins, a total of 118,254 affinity values. DeepDTA model We use the well-known simple, but effective DeepDTA architecture Öztürk et al. (2018) . The DeepDTA consists of the one-dimensional convolution layers to encode the sequences and the fully-connected layers to output the affinity value from the concatenated latent features. Figure 5 illustrates the DeepDTA architecture. Details For the all evaluated model, we use 1e-3 as the learning rate; 1e-4 as the weight decay; 0.1 as the dropout probability; 256 as the batch size, except for the 1e-1 learning rate for the identification model. We train the models with 200 epochs. For the identification model, we train the model with 200 and 100 epochs for the Davis and KIBA, respectively. We save the SGLD samples every 20 and 10 epochs for the Davis and Kiba, respectively. The baseline "hard" pick up the top-500 high-loss samples for the oversampling. In this subsection, we conduct the experiments on the CelebA and Waterbirds with 30% label noise to verify contributions of END's components (Table 3 ). Specifically, we train the models via the END framework with its variants: training the identification model without the MAE loss or confidence regularization, or both. In this experiment, we have several implications. Firstly, utilizing uncertainty is a major factor in improving the worst-case group accuracy compared to exploiting the error (loss). In particular, compared to JTT, every END variant improves the worst group accuracy by at least 0.4. This supports one of our main claims: utilizing uncertainty is a proper approach to improving the group robustness under the noisy label scenarios. Secondly, the cooperation between the noisy-label robust loss and the overconfidence regularization significantly contributes to the group robustness in the presence of the label noise. In practice, the combination of both shows the outstanding worst group accuracy with noisy labels, as shown in Table 2 and 3 . In contrast, without the overconfidence regularization (END w/o reg), only utilizing the noisy robust loss (MAE) does not show outstanding improvement compared to the proposed method (END). We interpret this degradation is due to the overconfident uncertainty for the minority group, as Model-B in Figure 2 shown. In addition, without the noisy label robust loss (END w/o reg and MAE), the worst group accuracy is degraded since the identification model memorizes the noisy labels, as we stated in the last two paragraphs of Sec 4. Hence, we conclude that the cooperation between the noisy robust MAE loss and the overconfidence regularization is the key component in the classification tasks, as we argued.

C.2 QUALITATIVE EXPERIMENT

We conduct a qualitative experiment on the waterbird dataset to evaluate the characteristic of the SCF set. Specifically, we obtain two separate groups of images: (1) the most uncertain images that their uncertainty is estimated by identification model (Figure 7 ); (2) randomly chosen images (Figure 8 ). We observed several abnormal properties of the uncertain (SCF) images. For instance, the uncertain image has "land background" as the group label but has a water background or vice versa. In addition, several images have ambiguous backgrounds: hard to determine whether the water or land background is. In contrast, the randomly chosen images have relatively certain backgrounds. We believe that the ambiguous backgrounds of the SCF dataset allow that model focuses on the image of the bird-the true causality of the label.



Note the similarity between the definition of Dτ and the condition F (H β * (x)) ≥ 1 -δ in Theorem 1. ].



Figure 1: The example case of the minority groups and its attribute on the cow-camel classification problem.

Figure 2: 2-D classification results of identification models on synthetic data. The blue/red represent classes. The deeper the background, the more confident the prediction. The dots and translucent stars represent training and test data, respectively. Here, true SCF samples are in dotted circles (cannot be classified via spurious feature). Two classification results ((a) and (b)) shows that Model-A (proposed) well identifies the SCF set (large overlaps between yellow circles in (b) and true SCF samples) while Model-B (without regularization) fails due to its overconfident uncertainty for the true SCF samples.

Figure 3: 2-D classification results on the synthetic data. The blue/red represents the classes. The dots and translucent stars represents training and test data respectively. The minority (SCF) groups are in the dotted circles. The ideal decision boundary is the vertical purple dotted lines. The decision boundary of END is the closest to the ideal boundary.

Figure 4: 2-D PCA projection of the latent feature of the identification model (first row) and its variant (w/o L M AE and R ent ) (second row). Each dot represents the training sample. (first column) The green dots are noisy labels, and the blue dots are clean labels; (second column) The yellow dots are the obtained SCF samples; (third column) Each color represents corresponding groups.

this experiment, we use two kinds of baselines: the group-robust and the noise-robust baselines. The group-robust baselines include JTT (Liu et al., 2021), ERM (Zhang & Sabuncu, 2018), and Learning-from-Failure (LfF) (Nam et al., 2020). There is one noise-robust baseline, ERM with GCE (Generalized Cross Entropy) (Zhang & Sabuncu, 2018). We use the identical model architecture, ResNet50 He et al. (2016), for all baselines and END. For JTT, LfF, and ERM, we follow the identical experimental setup as Liu et al. (2021) presented. Details of the experimental setup are in Appendix B.

It is enough to prove equation 18. equation 19 can be shown from equation 18 by letting W = -W . Let w * = max {w ∈ R|F (w) ≤ a} We have P[F (W ) ≤ a] = P[W ≤ w * ] = F (w * ) Since F (w * ) ≤ a by the definition of w * , we have the inequality on the right hand side. To show the inequality on the left hand side, suppose F (w * ) ≤ a -AC(W ). Let w * * be the smallest real number in the range of W bigger than w * . By the definition of w * , we have F (w * * ) > a. Then, AC(W ) ≥ P[W = w * * ] = P[W ≤ w * * ]-P[W ≤ w * ] = F (w * * )-F (w * ) > a-(a-AC(W )) = AC(W ) , which is a contradiction.

Figure 5: The schema of the DTA task with DeepDTA architecture.

Figure 6: 2-D classification results of the identification model on synthetic data. (a) The prediction results. The colors represents the classes. The dots and translucent stars represents training and test data respectively; (b) the histogram of the predictive entropy and the corresponding Gamma distribution; (c) the detected SCF samples.

Figure 7: The images of the waterbirds dataset with high-predictive uncertainty (Top-12).

Average accuracy (ACC) and worst-group accuracy (WG Acc) evaluated on the Waterbirds and CelebA dataset, with varying noise levels. END consistently outperforms other baselines, especially for the noisy datasets.

Evaluation results on the DTA datasets.

Worst-group accuracy (WG ACC) and average accuracy (ACC) of END with different loss functions and the regularization.

APPENDIX A FORMAL THEOREMS AND PROOFS

In this section, we formally state and prove our main theorem and an additional theorem.• In A.1, we briefly summarize our conventions on notations.• In A.2, we formalize the concept of "risk minimization" on our toy dataset, and provide the solution.• In A.3, we list some basic definitions and lemmas that will be used in our theorem statements and proofs.• In A.4, we re-state our main theorem formally.• In A.5, we prove our main theorem.• In A.6, we state the additional theorem.• In A.7, we prove the additional theorem.A.1 NOTATIONS Vectors are written in bold, while (one-dimensional) numbers are not. Random variables are always written in upper case, while plain values are written in lower case except for constants. For example, X is a random vector, while w is a plain real number. Note that this is a slightly different from the notations in the main paper, where random variables are written in lower case.X ∼ D reads as "X follows the distribution D" or "X is sampled from D". P stands for the probability, and E stands for the expectation.

A.2 RISK MINIMIZATION ON THE TOY DATASET

Definition 1 (Risk-minimizing linear solution of a distribution). Let n ∈ N, and let D be a distribution on R n+1 . We callDefinition 3 (The toy dataset).p,p ′ on R n+2 by its sampling procedure:1. Sample Z ∼ B 0.5 uniformly.Proof. We have, where the last equality is valid because if W ≥ 0, thenSimilarly, we haveWe can show the remaining statement as follows:Also, we need a quantitative version of multi-dimensional central limit theorem:If the covariance matrix Σ = Σ(S) is invertible, for Z ∼ N (0, Σ), we have, where C is an absolute constant andNow, it is enough to show that the last quantity is ≥ 1-ϵ when n ≥ N ′ for some N ′ = N ′ (ϵ, δ min ). (Then, N = max(N ′ , N Lemma6 (ϵ/2, M, δ min )) will satisfy the theorem statement) Since γ → 0 as n → ∞ in the convergence rate that depends only on ϵ, it remains to show that the same is true for α. Using equation 11 and equation 10, we getTherefore by Hoeffding's concentration inequality, since, finishing the proof.

A.5.3 PROOF OF LEMMA 6

Now we will prove Lemma 6. We will make a bootstrapping argument composed of two stages:(1) proving the theorem imposing a certain restriction onproving the general case using the "restricted theorem".First, let us prove the restricted theorem. The restriction will be clarified later on. Let 0 < ϵ < 1/2 and M ≥ 1. Define α = Φ -1 (ϵ/2), where Φ is the cumulative density function of the standard normal distribution. We argue that δ 0 satisfies the theorem statement. To show that, take arbitrary δ min ∈ (0,We have to show thatand From equation 56 and equation 57, since ŷ1 y 1 > 0, ŷ2 y 2 ≤ 0 but Err( ŷ2 , y 2 ) < Err( ŷ1 , y 1 ), we find a contradiction with the definition of "good error function". 

