MITIGATING DATASET BIAS BY USING PER-SAMPLE GRADIENT

Abstract

The performance of deep neural networks is strongly influenced by the training dataset setup. In particular, when attributes with a strong correlation with the target attribute are present, the trained model can provide unintended prejudgments and show significant inference errors (i.e., the dataset bias problem). Various methods have been proposed to mitigate dataset bias, and their emphasis is on weakly correlated samples, called bias-conflicting samples. These methods are based on explicit bias labels provided by humans. However, such methods require human costs. Recently, several studies have sought to reduce human intervention by utilizing the output space values of neural networks, such as feature space, logits, loss, or accuracy. However, these output space values may be insufficient for the model to understand the bias attributes well. In this study, we propose a debiasing algorithm leveraging gradient called Per-sample Gradient-based Debiasing (PGD). PGD is comprised of three steps: (1) training a model on uniform batch sampling, (2) setting the importance of each sample in proportion to the norm of the sample gradient, and (3) training the model using importance-batch sampling, whose probability is obtained in step (2). Compared with existing baselines for various datasets, the proposed method showed state-of-the-art accuracy for the classification task. Furthermore, we describe theoretical understandings of how PGD can mitigate dataset bias.

1. INTRODUCTION

Dataset bias (Torralba & Efros, 2011; Shrestha et al., 2021) is a bad training dataset problem that occurs when unintended easier-to-learn attributes (i.e., bias attributes), having a high correlation with the target attribute, are present (Shah et al., 2020; Ahmed et al., 2020) . This is due to the fact that the model can infer outputs by focusing on the bias features, which could lead to testing failures. For example, most "camel" images include a "desert background," and this unintended correlation can provide a false shortcut for answering "camel" on the basis of the "desert." In (Nam et al., 2020; Lee et al., 2021) , samples of data that have a strong correlation (like the aforementioned desert/camel) are called "bias-aligned samples," while samples of data that have a weak correlation (like "camel on the grass" images) are termed "bias-conflicting samples." To reduce the dataset bias, initial studies (Kim et al., 2019; McDuff et al., 2019; Singh et al., 2020; Li & Vasconcelos, 2019) have frequently assumed a case where labels with bias attributes are provided, but these additional labels provided through human effort are expensive. Alternatively, the bias-type, such as "background," is assumed in (Lee et al., 2019; Geirhos et al., 2018; Bahng et al., 2020; Cadene et al., 2019; Clark et al., 2019) . However, assuming biased knowledge from humans is still unreasonable, since even humans cannot predict the type of bias that may exist in a large dataset (Schäfer, 2016) . Data for deep learning is typically collected by web-crawling without thorough consideration of the dataset bias problem. Recent studies (Le Bras et al., 2020; Nam et al., 2020; Kim et al., 2021; Lee et al., 2021; Seo et al., 2022; Zhang et al., 2022b) have replaced human intervention with DNN results. They have identified bias-conflicting samples by using empirical metrics for output space (e.g., training loss and accuracy). For example, Nam et al. (2020) suggested a "relative difficulty" based on per-sample training loss and thought that a sample with a high "relative difficulty" was a bias-conflicting sample. Most of the previous research has focused on the output space, such as feature space (penultimate layer output) (Lee et al., 2021; Kim et al., 2021; Bahng et al., 2020; Seo et al., 2022; Zhang et al., 2022b) , loss (Nam et al., 2020) , and accuracy (Le Bras et al., 2020; Liu et al., 2021) . However, this limited output space can impose restrictions on describing the data in detail. Recently, as an alternative, model parameter space (e.g., gradient (Huang et al., 2021; Killamsetty et al., 2021b; Mirzasoleiman et al., 2020) ) has been used to obtain high-performance gains compared to output space approaches for various target tasks. For example, Huang et al. (2021) used gradientnorm to detect out-of-distribution detection samples and showed that the gradient of FC layer ∈ R h×c could capture joint information between feature and softmax output, where h and c are the dimensions of feature and output vector, respectively. Since the gradients of each data point ∈ R h×c constitute high-dimensional information, it is much more informative than the output space, such as logit ∈ R c and feature ∈ R h . However, there is no approach to tackle the dataset bias problem using a gradient norm-based metric. In this paper, we present a resampling method from the perspective of the per-sample gradient norm to mitigate dataset bias. Furthermore, we theoretically justify that the gradient-norm-based resampling method can be an excellent debiasing approach. Our key contributions can be summarized as follows: • We propose Per-sample Gradient-norm based Debiasing (PGD), a simple and efficient gradientnorm-based debiasing method. PGD is motivated by prior research demonstrating (Mirzasoleiman et al., 2020; Huang et al., 2021; Killamsetty et al., 2021b) that gradient is effective at finding rare samples, and it is also applicable to finding the bias-conflicting samples in the dataset bias problem (See Section 3 and Appendix E). • PGD outperforms other dataset bias methods on various benchmarks, such as colored MNIST (CM-NIST), multi-bias MNIST (MBMNIST), corrupted CIFAR (CCIFAR), biased action recognition (BAR), biased FFHQ (BFFHQ), CelebA, and CivilComments-WILD. In particular, for the colored MNIST case, the proposed method yielded higher unbiased test accuracies compared with the vanilla and the best methods by 35.94% and 2.32%, respectively. (See Section 4) • We provide theoretical evidence of the superiority of PGD. To this end, we first explain that minimizing the trace of inverse Fisher information is a good objective to mitigate dataset bias. In particular, PGD, resampling based on the gradient norm computed by the biased model, is a possible optimizer for mitigating the dataset bias problem. (See Section 5)

2. DATASET BIAS PROBLEM

Classification model. We first describe the conventional supervised learning setting. Let us consider the classification problem when a training dataset D n = {(x i , y i )} n i=1 , with input image x i and corresponding label y i , is given. Assuming that there are c ∈ N \ {1} classes, y i is assigned to the one element in set C = {1, ..., c}. Note that we focus on a situation where dataset D n does not have noisy samples, for example, noisy labels or out-of-distribution samples (e.g., SVHN samples when the task is CIFAR-10). When input x i is given, f (y i |x i , θ) represents the softmax output of the classifier for label y i . This is derived from the model parameter θ ∈ R d . The cross-entropy (CE) loss L CE is frequently used to train the classifier, and it is defined as L CE (x i , y i ; θ) = -log f (y i |x i , θ) when the label is one-hot encoded. Figure 1 : Target and bias attribute: digit shape, color. Dataset bias. Let us suppose that a training set, D n , is comprised of images, as shown in Figure 1 , and that the objective is to classify the digits. Each image can be described by a set of attributes, (e.g., for the first image in Figure 1 , it can be {digit 0, red, thin,...}). The purpose of the training classifier is to find a model parameter θ that correctly predicts the target attributes, (e.g., digit). Notably, the target attributes are also interpreted as classes. However, we focus on a case wherein another attribute that is strongly correlated to the target exists, and we call these attributes bias attributes. For example, in Figure 1 , the bias attribute is color. Furthermore, samples whose bias attributes are highly correlated to the target attributes are called bias-aligned (top three rows in Figure 1 ). Conversely, weakly correlated samples are called bias-conflicting (see the bottom row of Figure 1 ). Therefore, our main scope is that the training dataset which has samples whose bias and target attributes are misaligned. 1 According to (Nam et al., 2020) , when the bias attributes are easier-to-learn than the target attributes, dataset bias is problematic, as the trained model may prioritize the bias attributes over the target attributes. For example, for a model trained on the images in Figure 1 , it can output class 4 when the (Orange, 0) image (e.g., left bottom image) is given, due to the wrong priority, color which is an easier-to-learn attribute (Nam et al., 2020) . 3 PGD: PER-SAMPLE GRADIENT-NORM-BASED DEBIASING  for t = 1, 2, • • • , T b do 3: Construct a mini-batch B t = {(x i , y i )} B i=1 ∼ U . 4: Update θ t as: θ t-1 -η B ∇ θ (x,y)∈Bt L GCE (A(x), y; θ t-1 , α) 5: end for / ** STEP 2: Calculate h ** / 6: Calculate h(x i , y i ) for all (x i , y i ) ∈ D n , (1). / ** STEP 3: Train f d based on h ** / 7: for t = 1, 2, • • • , T d do 8: Construct a mini-batch B ′ t = {(x i , y i )} B i=1 ∼ h. 9: Update θ Tb+t as: θ Tb+t-1 -η B ∇ θ (x,y)∈B ′ t L CE (A(x), y; θ Tb+t-1 ) 10: end for In this section, we propose a novel debiasing algorithm, coined as PGD. PGD consists of two models, biased f b and debiased f d , with parameters θ b and θ d , respectively. Both models are trained sequentially. Obtaining the ultimately trained debiased model f d involves three steps: (1) train the biased model, (2) compute the sampling probability of each sample, and (3) train the debiased model. These steps are described in Algorithm 1. Step 1: Training the biased model. In the first step, the biased model is trained on the mini-batches sampled from a uniform distribution U , similar to conventional SGD-based training, with data augmentation A. The role of the biased model is twofold: it detects which samples are bias-conflicting and calculates how much they should be highlighted. Therefore, we should make the biased model uncertain when it faces bias-conflicting samples. In doing so, the biased model, f b , is trained on the generalized cross-entropy (GCE) loss L GCE to amplify the bias of the biased model, motivated by (Nam et al., 2020) . For an input image x, the corresponding true class y, L GCE is defined as L GCE (x, y; θ, α) = 1-f (y|x,θ) α α . Note that α ∈ (0, 1] is a hyperparameter that controls the degree of emphasizing the easy-to-learn samples, namely biasaligned samples. Note that when α → 0, the GCE loss L GCE is exactly the same as the conventional CE loss L CE . We set α = 0.7 as done by the authors of Zhang & Sabuncu (2018 ), Nam et al. (2020) and Lee et al. (2021) . Step 2: Compute the gradient-based sampling probability. In the second step, the sampling probability of each sample is computed from the trained biased model. Since rare samples have large gradient norms compared to the usual samples at the biased model (Hsu et al., 2020) , the sampling probability of each sample is computed to be proportional to its gradient norm so that bias-conflicting samples are over-sampled. Before computing the sampling probability, the per-sample gradient with respect to L CE for all (x i , y i ) ∈ D n is obtained from the biased model. We propose the following sampling probability of each sample h(x i , y i ) which is proportional to their gradient norms, as follows: h(x i , y i ) = ∥∇ θ L CE (x i , y i ; θ b )∥ r s (xi,yi)∈Dn ∥∇ θ L CE (x i , y i ; θ b )∥ r s , where ∥•∥ r s denotes r square of the L s norm, and θ b is the result of Step 1. Since, h(x i , y i ) is the sampling probability on D n , h(x i , y i ) is the normalized gradient-norm. Note that computing the gradient for all samples requires huge computing resources and memory. Therefore, we only extract the gradient of the final FC layer parameters. This is a frequently used technique for reducing the computational complexity (Ash et al., 2019; Mirzasoleiman et al., 2020; Killamsetty et al., 2021b; a; 2020) . In other words, instead of h(x i , y i ), we empirically utilize ĥ(x i , y i ) = ∥∇ θ fc LCE(xi,yi;θ b )∥ r s (x i ,y i )∈Dn ∥∇ θ fc LCE(xi,yi;θ b )∥ r s , where θ fc is the parameters of the final FC layer. We consider r = 1 and s = 2 (i.e., L 2 ), and deliver ablation studies on various r and s in Section 4. Step 3: Ultimate debiased model training. Finally, the debiased model, f d , is trained using mini-batches sampled with the probability h(x i , y i ) obtained in Step 2. Note that, as described in Algorithm 1, our debiased model inherits the model parameters of the biased model θ T b . However, Lee et al. (2021) argued that just oversampling bias-conflicting samples does not successfully debias, and this unsatisfactory result stems from the data diversity (i.e., data augmentation techniques are required). Hence, we used simple randomized augmentation operations A such as random rotation and random color jitter to oversample the bias-conflicting samples.

4. EXPERIMENTS

In this section, we demonstrate the effectiveness of PGD for multiple benchmarks compared with previous proposed baselines. Detail analysis not in this section (e.g., training time, unbiased case study, easier to learn target attribute, sampling probability analysis, reweighting with PGD) are described in the Appendix E.

4.1. BENCHMARKS

To precisely examine the debiasing performance of PGD, we used the Colored MNIST, Multi-bias MNIST, and Corrupted CIFAR datasets as synthetic datasets, which assume situations in which the model learns bias attributes first. BFFHQ, BAR, CelebA, and CivilComments-WILDS datasets obtained from the real-world are used to observe the situations in which general algorithms have poor performance due to bias attributes. Note that BFFHQ and BAR are biased by using human prior knowledge, while CelebA and CivilComments-WILDS are naturally biased datasets. A detailed explanation of each benchmark are presented in Appendix A.

Colored MNIST (CMNIST).

CMNIST is a modified version of MNIST dataset (LeCun et al., 2010) , where color is the biased attribute and digit serves as the target. We randomly selected ten colors that will be injected into the digit. Evaluation was conducted for various ratios ρ ∈ {0.5%, 1%, 5%}, where ρ denotes the portion of bias-conflicting samples. Note that CMNIST has only one bias attribute: color.

Multi-bias MNIST (MB-MNIST

). The authors of (Shrestha et al., 2021) stated that CMNIST is too simple to examine the applicability of debiaising algorithms for complex bias cases. However, the dataset that Shrestha et al. (2021) generated is also not complex, since they did not use an real-world pattern dataset (e.g., MNIST) and used simple artificial patterns (e.g., straight line and triangle). Therefore, we generated a MB-MNIST; we used benchmark to reflect the real-worled better than (Shrestha et al., 2021) . MB-MNIST consists of eight attributes: digit (LeCun et al., 2010) , alphabet (Cohen et al., 2017) , fashion object (Xiao et al., 2017) , Japanese character (Clanuwat et al., 2018) , digit color, alphabet color, fashion object color, Japanese character color. Among the eight attributes, the target attribute is digit shape and the others are the bias attributes. To construct MB-MNIST, we follow the CMNIST protocol, which generates bias by aligning two different attributes (i.e., digit and color) with probability (1 -ρ). MB-MNIST dataset is made by independently aligning the digit and seven other attributes with probabity (1 -ρ). Note that rarest sample is generated with probability ρ 7 . When ρ is set as the CMNIST case, it is too low to generate sufficient misaligned samples. Therefore, we use ρ ∈ {10%, 20%, 30%} to ensure the trainability. Corrupted CIFAR (CCIFAR). CIFAR10 (Krizhevsky et al., 2009) is comprised of ten different objects, such as an airplane and a car. Corrupted CIFAR are biased with ten different types of texture bias (e.g., frost and brightness). The dataset was constructed by following the design protocol of (Hendrycks & Dietterich, 2019) , and the ratios ρ ∈ {0.5%, 1%, 5%} are used. Biased action recognition (BAR). Biased action recognition dataset was derived from (Nam et al., 2020) . It comprised six classes for action, (e.g., climbing and diving), and each class is biased with place. For example, diving class pictures are usually taken underwater, while a few images are taken from the diving pool. Biased FFHQ (BFFHQ). BFFHQ dataset was constructed from the facial dataset, FFHQ (Karras et al., 2019) . It was first proposed in (Kim et al., 2021) and was used in (Lee et al., 2021) . It is comprised of two gender classes, and each class is biased with age. For example, most female pictures are young while male pictures are old. This benchmark follows ρ = 0.5%. CelebA. CelebA (Liu et al., 2015) is a common real-world face classification dataset. The goal is classifying the hair color ("blond" and "not blond") of celebrities which has a spurious correlation with the gender ("male" or "female") attribute. Hair color of almost all female images is blond. We report the average accuracy and the worst-group accuracy on the test dataset. CivilComments-WILDS. CivilComments-WILDS (Borkan et al., 2019) is a dataset to classify whether an online comment is toxic or non-toxic. The mentions of certain demographic identities (male, female, White, Black, LGBTQ, Muslim, Christian, and other religion) cause the spurious correlation with the label.

4.2. IMPLEMENTATION.

Baselines. We select baselines available for the official code from the respective authors among debiasing methods without prior knowledge on the bias. Our baselines comprise six methods on the various tasks: vanilla network, LfF (Nam et al., 2020) , JTT (Liu et al., 2021) foot_1 , Disen (Lee et al., 2021) , EIIL (Creager et al., 2021) and CNC (Zhang et al., 2022b) . Implementation details. We use three types of networks: two types of simple convolutional networks (SimConv-1 and SimConv-2) and ResNet18 (He et al., 2016) . Network imeplementation is described in Appendix B. Colored MNIST is trained on SGD optimizer, batch size 128, learning rate 0.02, weight decay 0.001, momentum 0.9, learning rate decay 0.1 every 40 epochs, 100 epochs training, and GCE parameter α 0.7. For Multi-bias MNIST, it also utilizes SGD optimizer, and 32 batch size, learning rate 0.01, weight decay 0.0001, momentum 0.9, learning rate decay 0.1 with decay step 40. It runs 100 epochs with GCE parameter 0.7. For corrupted CIFAR and BFFHQ, it uses ResNet18 as a backbone network, and exactly the same setting presented by Disen (Lee et al., 2021) . 3 For CelebA, we follows experimental setting of (Zhang et al., 2022b) which uses ResNet50 as a backbone network. For CivilComments-WILDS, we utilize exactly the same hyperparameters of (Liu et al., 2021) and utilize pretrained BERT. To reduce the computational cost in extracting the per-sample gradients, we use only a fully connected layer, similar to (Ash et al., 2019; Mirzasoleiman et al., 2020; Killamsetty et al., 2021b; a; 2020) . Except for CivilComments-WILDS and CelebA, we utilize data augmentation, such as color jitter, random resize crop and random rotation. See Appendix B for more details.

4.3. RESULTS AND EMPIRICAL ANALYSIS

Accuracy results. In Table 1 , we present the comparisons of the image classification accuracy for the unbiased test sets. The proposed method outperforms the baseline methods for all benchmarks and for all different ratios. For example, our model performance is 35.94% better than that of the vanilla model for the colored MNIST benchmark with a ratio ρ = 0.5%. For the same settings, PGD performs better than Disen by 2.32%. As pointed out in (Shrestha et al., 2021) , colored MNIST is too simple to evaluate debiasing performance on the basis of the performance of baselines. In Multi-bias MNIST case, other models fail to (c) Corrupted CIFAR Figure 2 : Average PGD results for various of norms, {L 1 , L 2 , L 2 2 , L ∞ }, for the feature-injected benchmarks. The error bars represent the standard deviation of three independent trials. obtain higher unbiased test results, even though the ratio is high, e.g., 10%. In this complex setting, PGD shows superior performance over other methods. For example, its performance is higher by 36.15% and 35.63% compared with the performance of vanilla model and Disen for the ratio of 10%. Similar to the results for the bias-feature-injected benchmarks, as shown in Table 2 and Table 3 , PGD shows competitive performance among all the debiasing algorithms on the raw image benchmark (BAR, BFFHQ, and CelebA). For example, for the BFFHQ benchmark, the accuracy of PGD is 1.43% better than that of Disen. As in Table 3 , PGD outperforms the other baselines on CivilComments-WILDs, much more realistic NLP task. Therefore, we believe PGD also works well with transformer, and it is applicable to the real-world. Unbiased test accuracy on various norms. Since, gradient norm can have various candidates, such as order of the norm, we report four configurations of gradient norms. As shown in Figure 2 , all norms have significant unbiased test performance. Amongst them, the L 2 -norm square case shows lower unbiased performance than the other cases. Therefore, it is recommended that any first power of L {1,2,∞} -norms be used in PGD for overcoming the dataset bias problem. This is quite different from the results in (Huang et al., 2021) , which suggested that L 1 -norm is the best choice in the research field of out-of-distribution detection. Ablation study. Table 4 shows the importance of each module in our method: generalized cross entropy, and data augmentation modules. We set the ratio ρ as 0.5% and 10% for CMNIST and MB-MNIST, respectively. We observe that GCE is more important that data augmentation for CMNIST. However, data augmentation shows better performance than GCE for MB-MNIST. In all cases, the case where both are utilized outperforms the other cases.

5. MATHEMATICAL UNDERSTANDING OF PGD

This section provides a theoretical analysis of per-sample gradient-norm based debiasing. We first briefly summarize the maximum likelihood estimator (MLE) and Fisher information (FI), which are ingredients of this section. We then interpret the debiasing problem as a min-max problem and deduce that solving it the min-max problem can be phrased as minimizing the trace of the inverse FI. Since handling the trace of the inverse FI is very difficult owing to its inverse computation, we look at a glance by relaxing it into a one-dimensional toy example. In the end, we conclude that the gradient-norm based re-sampling method is an attempt to solve the dataset bias problem.

5.1. PRELIMINARY

Training and test joint distributions. The general joint distribution P(x, y|θ) is assumed to be factored into the parameterized conditional distribution f (y|x, θ) and the marginal distribution P(x), which is independent of the model parameter θ, (i.e., P(x, y|θ) = P(x)f (y|x, θ)). We refer to the model f (y|x, θ ⋆ ) that produces the exact correct answer, as an oracle model, and to its parameter θ ⋆ as the oracle parameter. The training dataset D n is sampled from {(x i , y i )} n i=1 ∼ p(x)f (y|x, θ ⋆ ), where the training and test marginal distributions are denoted by p(x) and q(x), respectively. Here, we assume that both marginal distributions are defined on the marginal distribution space M = {P(x)| x∈X P(x) dx = 1}, where X means the input data space, i.e., p(x), q(x) ∈ M. The space H of sampling probability h. When the training dataset D n is given, we denote the sampling probability as h(x) which is defined on the probability space Hfoot_3 : H = {h(x) | (xi,yi)∈Dn h(x i ) = 1 , h(x i ) ≥ 0 ∀(x i , y i ) ∈ D n }. Maximum likelihood estimator (MLE). When h(x) is the sampling probability, we define MLE θh(x),Dn as follows: Fisher information (FI). FI, denoted by I P(x) (θ), is an information measure of samples from a given distribution P(x, y|θ). It is defined as follows: θh(x),Dn = arg min θ -(xi,yi)∈Dn h(x i ) log f (y i |x i , θ). Note that MLE θh(x),Dn I P(x) (θ) = E (x,y)∼P(x)f (y|x,θ) [∇ θ log f (y|x, θ)∇ ⊤ θ log f (y|x, θ)]. (3) FI provides a guideline for understanding the test cross-entropy loss of MLE θU(x),Dn . When the training set is sampled from p(x)f (y|x, θ ⋆ ) and the test samples are generated from q(x)f (y|x, θ ⋆ ), we can understand the test loss of MLE θU(x),Dn by using FI as follows. Theorem 1. Suppose Assumption 1 in Appendix F and Assumption 2 in Appendix G hold, then for sufficiently large n = |D n |, the following holds with high probability: E (x,y)∼q(x)f (y|x,θ ⋆ ) E Dn∼p(x)f (y|x,θ ⋆ ) -log f (y|x, θU(x),Dn ) ≤ 1 2n Tr I p(x) ( θU(x),Dn ) -1 Tr I q(x) (θ ⋆ ) . (4) Here is the proof sketch. The left-hand side of (4) converges to the Fisher information ratio (FIR) Tr I p(x) (θ ⋆ ) -1 I q(x) (θ ⋆ ) -related term. Then, FIR can be decomposed into two trace terms with respect to the training and test marginal distributions p(x) and q(x). Empirical Fisher information (EFI). In practice, the exact FI (3) cannot be computed since we do not know the exact data generation distribution P(x)f (y|x, θ). For practical reasons, the empirical Fisher information (EFI) is commonly used (Jastrzębski et al., 2017; Chaudhari et al., 2019) to reduce the computational cost of gathering gradients for all possible classes when x is given. In the present study, we used a slightly more generalized EFI that involved the sampling probability h(x) ∈ H as follows: Îh(x) (θ) = (xi,yi)∈Dn h(x i )∇ θ log f (y i |x i , θ)∇ ⊤ θ log f (y i |x i , θ). (5) Note that the conventional EFI is the case when h(x) is uniform. EFI provides a guideline for understanding the test cross-entropy loss of MLE θh(x),Dn .

5.2. UNDERSTANDING DATASET BIAS PROBLEM VIA MIN-MAX PROBLEM

Debiasing formulation from the perspective of min-max problem. We formulate the dataset bias problem as described in Definition 1. ( 6) is a min-max problem formula, a type of robust optimization. Similar problem formulations for solving the dataset bias problem can be found in (Arjovsky et al., 2019; Bao et al., 2021; Zhang et al., 2022a) . However, they assume that the training data is divided into several groups, and the model minimizes the worst inference error of the reweighted group dataset. In contrast, the objective of ( 6) minimizes the worst-case test loss without explicit data groups where the test distribution can be arbitrary. Published as a conference paper at ICLR 2023 Definition 1. When the training dataset D n ∼ p(x)f (y|x, θ ⋆ ) is given, the debiasing objective is min h(x)∈H max q(x)∈M E (x,y)∼q(x)f (y|x,θ ⋆ ) -log f (y|x, θh(x),Dn ) . (6) The meaning of Definition 1 is that we have to train the model θh(x),Dn so that the loss of the worst case test samples (max q(x) ) is minimized by controlling the sampling probability h(x) (min h(x) ). Note that since we cannot control the given training dataset D n and test marginal distribution q(x), the only controllable term is the sampling probability h(x). Therefore, from Theorem 1 and EFI, we design a practical objective function for the dataset bias problem as follows: min h(x)∈H Tr Îh(x) ( θh(x),Dn ) -1 . (7)

5.3. MEANING OF PGD IN TERMS OF (7).

In this section, we present an analysis of PGD with respect to (7). To do so, we try to understand (7), which is difficult to directly solve. It is because computing the trace of the inverse matrix is computationally expensive. Therefore, we intuitively understand (7) in the one-dimensional toy scenario. One-dimensional example. We assume that D n comprises sets M and m such that elements in each set share the same loss function. For example, the loss functions of the elements in set M and m are 1 2 (θ + a) 2 and 1 2 (θ -a) 2 with a given constant a, respectively. We also assume that each sample of M and m has the set dependent probability mass h M (x) and h m (x), respectively. With these settings, our objective is to determine h ⋆ (x) = arg min h(x)∈H Tr[ Îh(x) ( θh(x),Dn ) -1 ]. Thanks to the model's simplicity, we can easily find h ⋆ (x) in a closed form with respect to the gradient at θU(x),Dn for each set, i.e., g M ( θU(x),Dn ) and g m ( θU(x),Dn ). Theorem 2. Under the above setting, the solution of The proof of Theorem 2 is provided in Appendix E. Note that h ⋆ M (x) and h ⋆ m (x) are computed using the trained biased model with batches sampled from the uniform distribution U (x). It is the same with the second step of PGD. PGD tries to minimize (7). Theorem 2 implies that (7) can be minimized by sampling in proportion to their gradient norm. Because the basis of PGD is oversampling based on the gradient norm from the biased model, we can deduce that PGD strives to satisfy (7). Furthermore, we empirically show that PGD reduces the trace of the inverse of EFI in the high-dimensional case, as evident in Figure 3 . (h ⋆ M (x), h ⋆ m (x)) = arg min h(x)∈H Tr[ Îh(x) ( θh(x),Dn ) -1 ] is: h ⋆ M (x) = |g M ( θU(x),Dn )|/Z, h ⋆ m (x) = |g m ( θU(x), Vanilla PGD (Ours) Tr [IU(θh,D n ) -1 ] 0 2 × 1 0 4 4 × 1 0 4 Bias conflict ratio ρ 0 .0 0 5 0 .1 0 .2 0 .4 0 .5 0 .9 Tr 𝐼$! " 𝜃 $ ! " ,𝒟! %& (a) Colored MNIST Tr [IU(θh,D n ) -1 ] 0 2 × 1 0 6 4 × 1 0 6 6 × 1 0 6 8 × 1 0 6 Bias conflict ratio ρ 0 .1 0 .3 0 .5 0 .7 0 .9 Tr 𝐼$! " 𝜃 $ ! " ,𝒟! %& (b) Multi-bias MNIST

6. RELATED WORK

Debiasing with bias label. In (Goyal et al., 2017; 2020) , a debiased dataset was generated using human labor. Various studies (Alvi et al., 2018; Kim et al., 2019; McDuff et al., 2019; Singh et al., 2020; Teney et al., 2021) have attempted to reduce dataset bias using explicit bias labels. Some of these studies (Alvi et al., 2018; Kim et al., 2019; McDuff et al., 2019; Singh et al., 2020; Li et al., 2018; Li & Vasconcelos, 2019) , used bias labels for each sample to reduce the influence of the bias labels when classifying target labels. Furthermore, Tartaglione et al. (2021) proposed the EnD regularizer, which entangles target correlated features and disentangles biased attributes. Several studies (Alvi et al., 2018; Kim et al., 2019; Teney et al., 2021) 2020) proposed a new overlap loss defined by a class activation map (CAM). The overlap loss reduces the overlapping parts of the CAM outputs of the two bias labels and target labels. The authors of (Li & Vasconcelos, 2019; Li et al., 2018) employed bias labels to detect bias-conflicting samples and to oversample them to debias. In (Liu et al., 2021) , a reconstructing method based on the sample accuracy was proposed. Liu et al. (2021) used bias labels in the validation dataset to tune the hyper-parameters. On the other hand, there has been a focus on fairness within each attribute (Hardt et al., 2016; Woodworth et al., 2017; Pleiss et al., 2017; Agarwal et al., 2018) . Their goal is to prevent bias attributes from affecting the final decision of the trained model. Debiasing with bias context. In contrast to studies assuming the explicit bias labels, a few studies (Geirhos et al., 2018; Wang et al., 2018; Lee et al., 2019; Bahng et al., 2020; Cadene et al., 2019; Clark et al., 2019) assumed that the bias context is known. In (Geirhos et al., 2018; Wang et al., 2018; Lee et al., 2019) , debiasing was performed by directly modifying known context bias. In particular, the authors of (Geirhos et al., 2018) empirically showed that CNNs trained on ImageNet (Deng et al., 2009) were biased towards the image texture, and they generated stylized ImageNet to mitigate the texture bias, while Lee et al. (2019) and Wang et al. (2018) inserted a filter in front of the models so that the influence of the backgrounds and colors of the images could be removed. Meanwhile, some studies (Bahng et al., 2020; Clark et al., 2019; Cadene et al., 2019) , mitigated bias by reweighting bias-conflicting samples: Bahng et al. (2020) used specific types of CNNs, such as BagNet (Brendel & Bethge, 2018) , to capture the texture bias, and the bias was reduced using the Hilbert-Schmidt independence criterion (HSIC). In the visual question answering (VQA) task, Clark et al. (2019) and Cadene et al. (2019) conducted debiasing using the entropy regularizer or sigmoid output of the biased model trained on the fact that the biased model was biased toward the question. Debiasing without human supervision. Owing to the impractical assumption that bias information is given, recent studies have aimed to mitigate bias without human supervision (Le Bras et al., 2020; Nam et al., 2020; Darlow et al., 2020; Kim et al., 2021; Lee et al., 2021) . Le Bras et al. ( 2020) identified bias-conflicting samples by sorting the average accuracy of multiple train-test iterations and performed debiasing by training on the samples with low average accuracy. In (Ahmed et al., 2020) , each class is divided into two clusters based on IRMv1 penalty (Arjovsky et al., 2019) using the trained biased model, and the deibased model is trained so that the output of two clusters become similar. Furthermore, Kim et al. (2021) used Swap Auto-Encoder (Park et al., 2020) to generate biasconflicting samples, and Darlow et al. (2020) proposed the modification of the latent representation to generate bias-conflicting samples by using an auto-encoder. Lee et al. (2021) and Nam et al. (2020) proposed a debiasing algorithm weighted training by using a relative difficulty score, which is measured by the per-sample training loss. Specifically, Lee et al. (2021) used feature-mixing techniques to enrich the dataset feature information. Seo et al. (2022) and Sohoni et al. (2020) proposed unsupervised clustering based debiasing method. Recently, contrastive learning based method (Zhang et al., 2022b) and self-supervised learning method (Kim et al., 2022) are proposed. On the other hand, there have been studies (Li & Xu, 2021; Lang et al., 2021; Krishnakumar et al., 2021) that identify the bias attribute of the training dataset without human supervision.

7. CONCLUSION

We propose a gradient-norm-based dataset oversampling method for mitigating the dataset bias problem. The main intuition of this work is that gradients contain abundant information about each sample. Since the bias-conflicting samples are relatively more difficult-to-learn than bias-aligned samples, the bias-conflicting samples have a higher gradient norm compared with the others. Through various experiments and ablation studies, we demonstrate the effectiveness of our gradient-normbased oversampling method, called PGD. Furthermore, we formulate the dataset bias problem as a min-max problem, and show theoretically that it can be relaxed by minimizing the trace of the inverse Fisher information. We provide empirical and theoretical evidence that PGD tries to solve the problem of minimizing the trace of the inverse Fisher information problem. Despite this successful outcome and analysis, we are still working on two future projects: release approximations, such as a toy example, for understanding PGD and cases where the given training dataset is corrupted, such as with noisy labels. We hope that this study will help improve understanding of researchers about the dataset bias problem.

-Appendix -Mitigating Dataset Bias by Using Per-sample Gradient

Due to the page constraint, this extra material includes additional results and theoretical proofs that are not in the original manuscript. Section A demonstrates how to create datasets. Section B.1 contains implementation details such as hyperparameters, computer resources, and a brief explanation of the baseline. Section C and Section D include case studies and empirical evidence of PGD. Section E demonstrates additional experiment results. In Section F, we first provide a notation summary and some assumptions for theoretical analysis. Section G and Section H include proofs of Theorem 1 and Theorem 2 with lemmas, respectively.

A BENCHMARKS AND BASELINES

We explain the datasets utilized in Section 4. In short, we build MNIST variants from scratch, while others get them directly from online repositories BARfoot_4 , CCIFAR and BFFHQfoot_5 . A.1 CONTROLLED BIASED BENCHMARKS Colored MNIST (CMNIST) The MNIST dataset (LeCun et al., 2010) is composed of 1-dimensional grayscale handwritten images. The size of the image is 28 × 28. We inject color into these gray images to give them two main attributes: color and digit shape. This benchmark comes from related works (Nam et al., 2020; Kim et al., 2021; Lee et al., 2021; Bahng et al., 2020) . At the beginning of the generation, ten uniformly sampled RGB colors are chosen, {C i } i∈[10] ∈ R 3×10 . When the constant ρ, a ratio of bias-conflicting samples, is given, each sample (x, y) is colored by the following steps: (1) Choose bias-conflicting or bias-aligned samples: take a random sample and set it to bias-conflicting set when u < ρ where u ∼ U(0, 1), otherwise bias-aligned. In experiments, we use ρ ∈ {0.5%, 1%, 5%}. (2) Coloring: Note that each C i ∈ R 3 (i ∈ [10]) is a bias-aligned three-dimensional color vector for each digit i ∈ {10}. Then for bias-aligned images with the arbitrary digit y, color the digit with c ∼ N (C y , σI 3×3 ). In the case of bias conflicting images with the arbitrary digit y, first uniformly sample C Uy ∈ {C i } i∈[10]\y , and color the digit with c ∼ N (C Uy , σI 3×3 ). In the experiments, we set σ as 0.0001. We use 55, 000 samples for training 5, 000 samples for validation (i.e., 10%), and 10, 000 samples for testing. Take note that test samples are unbiased, which means ρ = 90%. Multi-Bias MNIST Multi-bias MNIST has images with size 56 × 56. This dataset aims to test the case where there are multiple bias attributes. To accomplish this, we inject a total of seven bias attributes: digit color, fashion object, fashion color, Japanese character, Japanese character color, English character, and English character color, with digit shape serving as the target attribute. We inject each bias independently into each sample, as with the CMNIST case (i.e., sampling and injecting bias). We also set ρ = 90% for all bias attributes to generate an unbiased test set. As with CMNIST, we use 55, 000 samples for training and 5, 000 samples for validation, and 10, 000 samples for testing. Corrupted CIFAR This dataset was generated by injecting filters into the CIFAR10 dataset (Krizhevsky et al., 2009) . The work (Nam et al., 2020; Lee et al., 2021) inspired this benchmark. In this benchmark, the target attribute and the bias attribute are object and corruption, respectively. {Snow, Frost, Fog, Brightness, Contrast, Spatter, Elastic, JPEG, Pixelate, Saturate} are examples of corruption. We downloaded this benchmark from the repository of the official code of Disen (Lee et al., 2021) . This dataset contains 45, 000 in training samples, 5, 000 in validation samples, and 10, 000 in testing images. As with prior datasets, the test dataset is composed of unbiased samples (i.e., ρ = 90%). Biased FFHQ This BFFHQ benchmark was conducted in (Lee et al., 2021; Kim et al., 2021) . Target and bias attributes for bias-aligned samples are (Female, Young), and (Male, Old). Here, "Young" refers to people aged 10 to 29, while "old" refers to people aged 40 to 59. The bias-conflicting samples are (Female, Old) and (Male, Young). The number of training, validation, and test samples are 19, 200, 1, 000, and 1, 000, respectively.

A.2 REAL-WORLD BENCHMARKS

CelebA CelebA (Liu et al., 2015) is a common real-world face classification dataset, and each image has 40 attributes. The goal is to classify the hair color ("blond" and "not blond") of celebrities, which has a spurious correlation with the gender ("male" or "female") attribute. In fact, only 6% of blond hair color images are male. Therefore, ERM shows poor performance on the bias-conflicting samples. We report the average accuracy and the worst-group accuracy on the test dataset. CivilComments-WILDS CivilComments-WILDS (Borkan et al., 2019) is a dataset to classify whether an online comment is toxic or non-toxic. Each sentence is a real online comment, curated on the Civil Comments platform, a comment plug-in for independent news sites. The mentions of certain demographic identities (male, female, White, Black, LGBTQ, Muslim, Christian, and other religion) cause the spurious correlation with the label. Table 5 indicates the portion of toxic comments for each demographic identity.

Identity

Male Female White Black LGBTQ Muslim Christian Other religions Portion(%) of toxic 14.9 13.7 28.0 31.4 26.9 22.4 9.1 15.3 Table 5 : For each demographic identity, the portion of toxic comments in the CivilComments-WILDS.

A.3 BASELINES

In this section, we briefly describe how the baselines, such as LfF, JTT, Disen, GEORGE, BPA, CNC, and EIIL. Please refer to each paper for a detailed explanation because we briefly explain the algorithms. (1) LfF (Nam et al., 2020) trains the debiased model by weighting the bias-conflicting samples based on the "relative difficulty", computed by the two loss values from the biased model and the debiased model. To amplify the bias-conflicting samples, the authors employ generalized cross-entropy loss with parameter α = 0.7. We implement the LfF algorithm following the code officially offered by the authors. The loss functions that this work proposes are as follows: L LfF = W (z)L CE (C d (z, y) + λL GCE (C b (z, y)), W (z) = L CE (C b (z), y) L CE (C b (z), y) + L CE (C d (z), y) . Note that W (z) is a relative difficulty and that GCE is a generalized cross-entropy. z denotes feature, which is the output of the penultimate layer, and C • is a fully connected layer. (2) JTT (Liu et al., 2021) aims to debias by splitting the dataset into correctly and incorrectly learned samples. To do so, JTT trains the biased model first and splits the given training dataset as follows: D error-set = {(x, y) s.t. y given ̸ = arg max c f b (x)[c]}, The ultimate debiased model is then trained by oversampling D error-set with λ up times. We set λ up for all experiments as 1/ρ. We reproduce the results by utilizing the official code offered by the authors. The main strength of PGD compared to JTT is that PGD does not need to set a hyperparameter λ up . (3) Disen (Lee et al., 2021) aims to debias by generating abundant features from mixing features between samples. To do so, the author trains the biased and debiased model by aggregating features from both networks. This work also utilize the "relative difficulty" that is proposed in (Lee et al., 2021) . We reproduced the results utilizing the official code offered by the authors. The loss function proposed in this work is as follows: L total = L dis + λ swap L swap , where L swap = W (z)L CE (C d (z swap , y) + λ swap b L GCE (C b (z swap , ỹ)) L dis = W (z)L CE (C d (z, y) + λ dis L GCE (C b (z, y)), W (z) = L CE (C b (z), y) L CE (C b (z), y) + L CE (C d (z), y) . Except for the swapped feature, z swap all terms are identical to those in LfF explanation. (4) GEORGE (Sohoni et al., 2020) aims to debias by measuring and mitigating hidden stratification without requiring access to subclass labels. Assume there are n data points x 1 , ..., x n ∈ χ and associated superclass (target) labels y 1 , ..., y n ∈ {1, • • • , C}. Furthermore, each datapoint x i is associated with a latent (unobserved) subclass label z i . George consists of three steps. The author trains the biased model using ERM. Next, to estimate an approximate subclass (latent) label, apply UMAP dimensionality reduction (McInnes et al., 2018) to the features of a given training dataset at the ERM model. Here, GEORGE cluster the output of the reduced dimension for the data of each superclass into K clusters, where K is chosen automatically. The original paper contains a detailed description of the clustering process. Lastly, to improve performance on these estimated subclasses, GEORGE minimizes the maximum per-cluster average loss (i.e., (x, y) ∼ Pz ), by using the cluster as groups in the G-DRO objective (Sagawa et al., 2019) . The loss function proposed in this work is as follows: minimize L,f θ max 1≤z≤K E (x,y)∼ Pz [l(L • f θ (x), y)] where f θ and L are parameterized feature extractor and classifier, respectively. (5) BPA (Seo et al., 2022) Concretely, for any iteration number T , the momentum method based on the history set H T , which is defined as: H T = 1 ≤ t ≤ T | E (x,y)∼P k [l((x, y); θ t )] N k , where N k is the number of the data belonging to k-th cluster. (6) CNC (Zhang et al., 2022b) aims to debias by learning representation such that samples in the same class are close but different groups are far. CNC is composed of two steps: (1) inferring pseudo group label, (2) supervised contrastive learning. Get the ERM-based model f and the pseudo prediction ŷ first, then standard argmax over the final layer outputs of the model f . Next, CNC trains the debiased model based on supervised contrastive learning using pseudo predictionŷ. The detailed process of contrastive learning for each iteration is as follows: • From the selected batch, sample the one anchor data (x,y). • Construct the set of positives samples {(x + m , y + m )} which is belong to the batch, satisfying y + m = y and ŷ+ m ̸ = ŷ. • Similarly, construct the set of negative samples {(x - n , y - n )} which is belong to the batch, satisfying y - n ̸ = y and ŷn = ŷ. • With the loss of generality, assume the cardinality of the positive and negative sets are M and N , respectively. • Weight update based on the gradient of the loss function L(f θ ; x, y), the detail is like below: L(f θ ; x, y) = λ Lsup con (x, {x + m } M m=1 , {x - n } N n=1 ; f enc ) + (1 -λ) Lcross (f θ ; x, y). Here, λ ∈ [0, 1] is a hyperparameter and Lcross (f θ ; x, y) is an average cross-entropy loss over x, the M positives, and N negatives. Moreover, f enc is the feature extractor part of f θ and the detail formulation of Lsup con (x, {x + m } M m=1 , {x - n } N n=1 ; f enc ) is like below: - 1 M M r=1 log exp(f enc (x) T f enc (x + r )/τ ) M m=1 exp(f enc (x) T f enc (x + m )/τ ) + N n=1 exp(f enc (x) T f enc (x - n )/τ ) . (7) EIIL (Creager et al., 2021) 

B EXPERIMENT DETAILS AND ADDITIONAL ANALYSIS B.1 SETTINGS

This section discusses how our experiment was set up, including architecture, image processing, and implementation details. Architecture. For the colored MNIST, we use simple convolutional networks consisting of three CNN layers with kernel size 4 and channel sizes {8, 32, 64} for each layer. Also, we utilize average pooling at the end of each layer. Batch normalization and dropout techniques are used for regularization. Detailed network configurations are below. Similarly, for the multi-bias MNIST, we use four CNN layers with kernel size {7, 7, 5, 3} and channel size {8, 32, 64, 128}, respectively. For corrupted CIFAR, BAR, and BFFHQ, we utilize ResNet-18, which is provided by the open-source library torchvision. For CelebA, we follow the experimental setting of CNC (Zhang et al., 2022b) , which uses ResNet-50 as a backbone network. For CivilCOmments-WILDS, we use pretrained-BRET for the backbone network and exactly the same hyperparameters as in (Zhang et al., 2022b) . referring to other official repositories:foot_6 foot_7 foot_8 foot_9 .The differences compared to the baseline codes are network architecture for CMNIST and usage of data augmentation. Here, we use the same architecture for CMNIST and data augmentation for all algorithms for a fair comparison. Except for JTT, all hyperparameters for CCIFAR and BFFHQ follow previously reported parameters in repositories. SimConv-1. We grid-search for other cases, MNIST variants, and BAR. We set the only hyperparameter of PGD, α = 0.7, as proposed by the original paper (Zhang & Sabuncu, 2018) . A summary of the hyperparameters that we used is reported in Table 6 . Debiasing algorithms require an additional computational cost. To evaluate the computational cost of PGD, we report the training time in Table 11. We conduct this experiment by using the colored MNIST with ρ = 0.5%. As in the top of Table 11 , we report the training time of four methods: vanilla, LfF, Disen, and PGD. Here, PGD spends a longer amount of training time. This is because there is no module for computing per-sample gradient in a batch manner. At the bottom of Table 11 , we report part-by-part costs to see which parts consume the most time. Note that Steps 1, 2, and 3 represent training the biased model, computing the per-sample gradient-norm, and training the debiased model, respectively. We can conclude with the following two facts. We examine whether PGD fails when an unbiased dataset is given. To verify this, we report two types of additional results: (1) unbiased CMNIST (i.e., ρ = 90%) and (2) conventional public dataset (i.e., CIFAR10). We follow the experimental setting of CMNIST for the unbiased CMNIST case. On the other hand, we train ResNet18 (He et al., 2016) for CIFAR10 with the SGD optimizer, 0.9 momentum, 5e -4 weight decay, 0.1 learning rate, and Cosine Annealing LR decay scheduler. As shown in Table 12 , PGD does not suffer significant performance degradation in unbiased CMNIST. Furthermore, it performs better than the vanilla model on the CIFAR10 dataset. This means that the training distribution that PGD changes do not cause significant performance degradation. In other words, PGD works well, regardless of whether the training dataset is balanced or unbiased.

D EMPIRICAL EVIDENCE OF PGD

As same with the setting of Appendix C, in this section, we also use the existing CMNIST setting, such as data augmentation, hyperparameters.

D.1 CORRELATION BETWEEN GRADIENT NORM AND BIAS-ALIGNMENT OF THE CMNIST

To check if the per-sample gradient norm efficiently separates the bias-conflicting samples from the bias-aligned samples, we plot the gradient norm distributions of the colored MNIST (CMNIST). For comparison, we normalized the per-sample gradient norm as follows: 9 , the bias-aligned sample has a lower gradient norm (blue bars) than the bias-conflicting samples (red bars).  ∥∇ θ LCE(xi,yi;θ b )∥ max (x i ,y i )∈Dn ∥∇ θ LCE(xi,yi;θ b )∥ . As in Figure

D.2 PGD DOES NOT LEARN ONLY THE SECOND-EASIEST FEATURE

We provide the results of the following experimental settings: The target feature is color, and the bias feature is digit shape, i.e., the task is to classify the color, not the digit shape. Let us give an example of this task. When one of the target classes is red, this class is aligned with one of the digits (e.g., "0"). In other words, the bias-aligned samples in this class are (Red, "0"), and the bias-conflicting samples are (e.g., (Red, "1"), (Red, "2"),.., (Red, "9")). Note that, as shown in LfF (Nam et al., 2020) , color is empirically known to be easier to learn than digit shape; we think that the above scenario reflects the concern: whether PGD is only targeting the second-easiest feature (digit shape). Therefore, if the concern is correct, PGD may fail in this color target MNIST scenario since the model will learn digit shape. However, as shown in the table below, vanilla, PGD, and LfF perform well in that case. We can also support this result by seeing the distribution of the normalized gradient norms, The numbers filled in Table 14 are the number of data items belonging to each bin category. We can check that there are no bias-conflicting samples whose gradient norm is significantly larger than the bias-aligned samples. In other words, PGD does not force the debiased model to learn the digit shape (i.e., the second-easiest feature) in this scenario. This scenario brings similar performance to Vanilla. For in-depth analysis, we provide the results of 25 tests on CMNIST in Figure 10 . We compare with Disen, which shows the best performance except for PGD. However, very few cases overlap, as shown in Figure 10 . We conduct a t-test for a more in-depth analysis of this. When the alternative hypothesis is established that PGD is superior to Disen, the p-value for it has values of 0.01572, 0.01239, and 0.29370, respectively. In other words, it can be said that as the bias becomes worse, the superiority of PGD stands out. (Seo et al., 2022) . The results of comparison algorithms for CelebA † are the results reported in (Seo et al., 2022) . The best worst accuracy is indicated in bold. ∥∇ θ L CE (x i , y i ; θ b )∥ 2 / max (xi,yi)∈Dn ∥∇ θ L CE (x i , y i ; θ b )∥ 2 ∈ [0, 1], extracted from the biased model θ b (trained in Step 1 of Algorithm 1 of Section 3). [0.0,0.1) [0.1,0.2) [0.2,0.3) [0.3,0.4) [0.4,0.5) [0. We reported the results of CelebA in Table 3 of section 4, following the settings of (Zhang et al., 2022b) . For comparison with more diverse algorithms, we further report the CelebA results according to the settings of (Seo et al., 2022) . Note that the (Zhang et al., 2022b) and (Seo et al., 2022 ) used a different model, each using ResNet50 and ResNet18, respectively. As in Table 15 , PGD shows competitive performance among the other baselines.

F BACKGROUNDS FOR THEORETICAL ANALYSIS

F.1 NOTATIONS SUMMARY. For convenience, we describe notations used in Section 5, Appendix F, G, and H. Fisher Information E (x,y)∼P(x)f (y|x,θ) [∇ θ log f (y|x, θ)∇ ⊤ θ log f (y|x, θ)] Îh(x) (θ) Empirical Fisher Information n i=1 h(x i )∇ θ log f (y i |x i , θ)∇ ⊤ θ log f (y i |x i , θ) Set H set of all possible h(x) on D n {h(x)| (xi,yi)∈Dn h(x i ) = 1 and h(x i ) ≥ 0 ∀(x i , y i ) ∈ D n } M set of all possible marginal P(x) on input space X {P(x)| x∈X P(x) dx = 1} W set of all possible (x, y true (x)) supp(P(x, y|θ)) Support set of P(x, y|θ)  {(x, y) ∈ X × {1, • • • , c} | P(x, y|θ) ̸ = 0}, ∀ P(x, - h ⋆ m (x) Optimal sampling probability of samples in m -

F.2 MAIN ASSUMPTION

Here, we organize the assumptions that are used in the proof of Theorems. These are basically used when analyzing models through Fisher information. The assumptions are motivated by (Sourati et al., 2016) . Assumption 1. (A0). The general joint distribution P(x, y|θ) is factorized into the conditional distribution f (y|x, θ) and the marginal distribution P(x), not depend on model parameter θ, that is: P(x, y|θ) = P(x)f (y|x, θ). Thus, the joint distribution is derived from model parameter θ and the marginal distribution P(x), which is determined from the task that we want to solve. Without loss of generality, we match the joint distribution's name with the marginal distribution.

(A1). (Identifiability):

The CDF P θ (whose density is given by P(x, y|θ)) is identifiable for different parameters. Meaning that for every distinct parameter vectors θ 1 and θ 2 in Ω, P θ1 and P θ2 are also distinct. That is, ∀θ 1 ̸ = θ 2 ∈ Ω, ∃A ⊆ X × {1, • • • , c} s.t. P θ1 (A) ̸ = P θ2 (A), where X, {1, • • • , c} and Ω are input, label, and model parameter space, respectively. dataset D n and (2) the adjustment of the sampling probability h(x). If h(x) is a uniform distribution U (x), then the result of empirical risk minimization (ERM) is θU(x),Dn .

F.3.2 FISHER INFORMATION (FI)

General definition of FI. Fisher information (FI), denoted by I P(x) (θ), is a measure of sample information from a given distribution P(x, y|θ) ≜ P(x)f (y|x, θ). It is defined as the expected value of the outer product of the score function ∇ θ log P(x, y|θ) with itself, evaluated at some θ ∈ Ω. I P(x) (θ) ≜ E (x,y)∼P(x,y|θ) [∇ θ log P(x, y|θ)∇ T θ log P(x, y|θ)]. Extended version of FI. Here, we summarize the extended version of FI, which can be derived by making some assumptions. These variants of FI are utilized in the proof of Theorems. • (Hessian version) Under the differentiability condition (A6) of Assumption 1 in Appendix F, FI can be written in terms of the Hessian matrix of the log-likelihood: I P(x) (θ) = -E (x,y)∼P(x,y|θ) [∇ 2 θ log P(x, y|θ)]. • (Model decomposition version) Under the factorization condition (A0) of Assumption 1 in Appendix F, ( 6) and ( 7) can be transformed as follows: I P(x) (θ) = E (x,y)∼P(x)f (y|x,θ) [∇ θ log f (y|x, θ)∇ T θ log f (y|x, θ)] (8) = -E (x,y)∼P(x)f (y|x,θ) [∇ 2 θ log f (y|x, θ)]. Specifically, ( 8) and ( 9) can be unfolded as follows: I P(x) (θ) = x∈X P(x) c y=1 f (y|x, θ) • ∇ θ log f (y|x, θ)∇ T θ log f (y|x, θ) dx (10) = - x∈X P(x) c y=1 f (y|x, θ) • ∇ 2 θ log f (y|x, θ) dx From now on, we define I p(x) (θ) and I q(x) (θ) as the FI derived from the training and test marginal, respectively.

F.3.3 EMPIRICAL FISHER INFORMATION (EFI)

When the training dataset D n is given, we denote the sampling probability as h(x) which is defined on the probability space H: H = {h(x) | (xi,yi)∈Dn h(x i ) = 1 , h(x i ) ≥ 0 ∀(x i , y i ) ∈ D n }. 12 Practically, the training dataset D n is given as deterministic. Therefore, (8) can be refined as empirical Fisher information (EFI). This reformulation is frequently utilized, e.g., in (Jastrzębski et al., 2017; Chaudhari et al., 2019) , to reduce the computational complexity of gathering gradients for all possible classes (i.e., expectation with respect to f (y|x, θ) as in ( 8)). Refer the c y=1 term of (10). Different from prior EFI, which is defined on the case when h(x) is uniform, U (x), we generalize the definition of EFI in terms of h(x) ∈ H as follows: Îh(x) (θ) := E h(x) ∇ θ log f (y|x, θ)∇ ⊤ θ log f (y|x, θ) (a) := (xi,yi)∈Dn h(x i )∇ θ log f (y i |x i , θ)∇ ⊤ θ log f (y i |x i , θ). Note that (a) holds owing to (11). F.3.4 STOCHASTIC ORDER NOTATIONS o P AND O P For a set of random variables X n and a corresponding set of constant a n , the notation X n = o p (a n ) means that the set of values X n /a n converges to zero in probability as n approaches an appropriate limit. It is equivalent with X n /a n = o p (1), where X n = o p (1) is defined as: lim n→∞ P(|X n | ≥ ϵ) = 0 ∀ ϵ ≥ 0. The notation X n = O p (a n ) means that the set of values X n /a n is stochastically bounded. That is ∀ ϵ > 0, ∃ finite M > 0, N > 0 s.t. P(|X n /a n | > M ) < ϵ for any n > N . Lemma 7. If (A0) to (A8) of the Assumption 1 in Appendix F hold and the case ∇ 2 θ log p(x, y true (x)|θ ⋆ ) is non-singular for given data (x, y true (x)) satisfies, then the asymptotic distribution of the log-likelihood ratio is a mixture of first-order Chi-square distributions, and the convergence rate is one. More specifically: n log p(x, y true (x)|θ ⋆ ) p(x, y true (x)| θU(x),Dn ) D -→ 1 2 d i=1 λ i (x, y true (x)) • χ 2 1 , where {λ i (x, y true (x))} d i=1 are eigenvalues of I p(x) (θ ⋆ ) -1 2 -∇ 2 θ log p(x, y true (x)|θ ⋆ ) I p(x) (θ ⋆ ) -1 2 . Proof. The proof is based on the Taylor expansion theorem. Remind that we deal with the data (x, y true (x)) satisfying ∇ 2 θ log p(x, y true (x)|θ ⋆ ) is non-singular. From the property √ n( θU(x),Dn - θ ⋆ ) D -→ N ( ⃗ 0, I p(x) (θ ⋆ ) -1 ) derived from Lemma 3, one concludes that √ n∥ θU(x),Dn -θ ⋆ ∥ 2 = O p (1) and therefore ∥ θU(x),Dn -θ ⋆ ∥ 2 = O p ( 1 √ n ) by the Lemma 4. Thus, by the Lemma 5, log p(x, y true (x)| θU(x),Dn ) = log p(x, y true (x)|θ ⋆ ) + ( θU(x),Dn -θ ⋆ ) T ∇ θ log p(x, y true (x)|θ ⋆ ) + 1 2 ( θU(x),Dn -θ ⋆ ) T ∇ 2 θ log p(x, y true (x)|θ ⋆ )( θU(x),Dn -θ ⋆ ) + o p 1 n holds. By the Lemma 3 and the property ∇ θ log p(x, y true (x)|θ ⋆ ) = ⃗ 0 derived by Lemma 6, we can obtain n log p(x, y true (x)|θ ⋆ ) -log p(x, y true (x)| θU(x),Dn ) = - 1 2 √ n( θU(x),Dn -θ ⋆ ) T ∇ 2 θ log p(x, y true (x)|θ ⋆ ) √ n( θU(x),Dn -θ ⋆ ) + o p (1) D -→ 1 2 N ⃗ 0, I p(x) (θ ⋆ ) -1 T -∇ 2 θ log p(x, y true (x)|θ ⋆ ) N ⃗ 0, I p(x) (θ ⋆ ) -1 = 1 2 N ⃗ 0, I d T -I p(x) (θ ⋆ ) -1 2 ∇ 2 θ log p(x, y true (x)|θ ⋆ )I p(x) (θ ⋆ ) -1 2 N ⃗ 0, I d . Define Γ(x, y true (x)) as -I p(x) (θ ⋆ ) -1 2 ∇ 2 θ log p(x, y true (x)|θ ⋆ )I p(x) (θ ⋆ ) -1 2 and rewrite the righthand-side element-wise 13 as 1 2 N ⃗ 0, I d T Γ(x, y true (x))N ⃗ 0, I d = 1 2 d i=1 λ i (x, y true (x)) • N (0, 1) 2 = 1 2 d i=1 λ i (x, y true (x)) • χ 1 2 , where {λ i (x, y true (x))} d i=1 are eigenvalues of Γ(x, y true (x)). Thus, the desired property is obtained.

G.2 MAIN LEMMA

In this section, we derive the main Lemma, which represents the test cross-entropy loss and can be understood as Fisher information ratio (FIR) (Sourati et al., 2016) . 13 Suppose Γ = U Σ U T and Σ = diag(λ1, • • • λ d ). Then, N ⃗ 0, I d T U ∼ N ⃗ 0, U U T = N ⃗ 0, I d . Thus, N ⃗ 0, I d T ΓN ⃗ 0, I d = N ⃗ 0, I d T ΣN ⃗ 0, I d = d i=1 λiN (0, 1) 2 .

G.2.1 MAIN LEMMA STATEMENT AND PROOF

Lemma 8 (FIR in expected test cross entropy loss with MLE). Assumption 1 in Appendix F holds, then lim n→∞ nE (x,y)∼q(x)f (y|x,θ ⋆ ) E Dn∼p(x)f (y|x,θ ⋆ ) -log f (y|x, θU(x),Dn ) = 1 2 Tr I p(x) (θ ⋆ ) -1 I q(x) (θ ⋆ ) . Proof. We prove Lemma 8 via two steps. First we show that the expected cross-entropy loss term can be rewritten in terms of the log-likelihood ratio. Then, we prove that the expected log-likelihood ratio can be asymptotically understood as FIR. Step 1: Log-likelihood ratio We show that the expected log-likelihood ratio can be formulated as the expected test cross-entropy loss as follows: This property holds because, E (x,y)∼q(x)f (y|x,θ ⋆ ) E Dn∼p(x)f (y|x,θ ⋆ ) log p(x, y|θ ⋆ ) p(x, y| θU(x),Dn ) = E (x,y)∼q(x)f (y|x,θ ⋆ ) E Dn∼p(x)f (y|x,θ ⋆ ) log f (y|x, θ ⋆ ) f (y|x, θU(x),Dn ) (15) = E (x,y)∼q(x)f (y|x,θ ⋆ ) E Dn∼p(x)f (y|x,θ ⋆ ) log f (y|x, θ ⋆ ) f (y|x, θU(x),Dn ) 1 Supp(q(x,y|θ ⋆ )) (16) = E (x,y)∼q(x)f (y|x,θ ⋆ ) E Dn∼p(x)f (y|x,θ ⋆ ) log f (y|x, θ ⋆ ) f (y|x, θU(x),Dn ) 1 Supp(q(x,y|θ ⋆ )) = E (x,y)∼q(x)f (y|x,θ ⋆ ) E Dn∼p(x)f (y|x,θ ⋆ ) -log f (y|x, θU(x),Dn ) 1 Supp(q(x,y|θ ⋆ )) (17) = E (x,y)∼q(x)f (y|x,θ ⋆ ) E Dn∼p(x)f (y|x,θ ⋆ ) -log f (y|x, θU(x),Dn ) Since (A0) of Assumption 1 in Appendix F, ( 15) and ( 16) hold. At (17), the properties, (i) Supp(q(x, y|θ ⋆ )) ⊆ W and (ii) f (y|x, θ ⋆ ) = 1 ∀ (x, y) ∈ W, was used which is derived by (A3) of the Assumption 1 in Appendix F. Step 2: FIR Here, we show that the expected test loss of MLE can be understood as FIR. By (A0) in Assumption 1 in Appendix F, trivially {(x, y) ∈Supp(q(x, y|θ ⋆ )) | ∇ 2 θ log q(x, y|θ ⋆ ) is singular} = {(x, y) ∈ Supp(q(x, y|θ ⋆ )) | ∇ 2 θ log p(x, y|θ ⋆ ) is singular}. holds. Since (18), and (A9) in Assumption 1 in Appendix F, Supp(q(x, y|θ ⋆ )) can be replaced by, S ≜ Supp(q(x, y|θ ⋆ )) \ {(x, y) ∈ W | ∇ 2 θ log p(x, y|θ ⋆ ) is singular} (19) when calculate expectation. We can get a result of Lemma 8 as follows: 2 E (x,y)∼q(x)f (y|x,θ ⋆ ) -∇ 2 θ log q(x, y|θ ⋆ ) lim n→∞ nE (x,y)∼q(x)f (y|x,θ ⋆ ) E Dn∼p(x)f (y|x,θ ⋆ ) -log f (y|x, θU(x),Dn ) = lim n→∞ nE (x,y)∼q(x)f (y|x,θ ⋆ ) E Dn∼p(x)f (y|x,θ ⋆ ) log p(x, y|θ ⋆ ) p(x, y| θU(x),Dn ) 1 Supp(q(x,y|θ ⋆ )) (20) = lim n→∞ nE (x,y)∼q(x)f (y|x,θ ⋆ ) E Dn∼p(x)f (y|x,θ ⋆ ) log p(x, y|θ ⋆ ) p(x, y| θU(x),Dn ) 1 S (21) = E (x,y)∼q(x)f (y|x,θ ⋆ ) lim n→∞ E Dn∼p(x)f (y|x,θ ⋆ ) n log p(x, y|θ ⋆ ) p(x, y| θU(x),Dn ) 1 S = E (x,ytrue(x))∼q(x)f (y|x,θ ⋆ ) lim n→∞ E Dn∼p(x)f (y|x,θ ⋆ ) n log p(x, I p(x) (θ ⋆ ) -1 2 = 1 2 Tr I p(x) (θ ⋆ ) -1 2 I q(x) (θ ⋆ )I p(x) (θ ⋆ ) -1 2 = 1 2 Tr I p(x) (θ ⋆ ) -1 I q(x) (θ ⋆ ) . (20) holds from Step 1. From ( 19), (21) satisfied. ( 22) and (26) holds because (x, y) is sampled from q(x)f (y|x.θ ⋆ ). From ( 23) to (25), the result of Lemma 7 is used. ( 27) can be obtained thanks to (A0). Lastly, (28) holds because trace satisfies the commutative law about matrix multiplication. The final term, Tr I p(x) (θ ⋆ ) -1 I q(x) (θ ⋆ ) , is known as the Fisher information ratio (FIR) because it can be expressed as a ratio in the scalar case.

G.3 THEOREM 1

In this section, we finally prove Theorem 1. To do so, we additionally follow assumptions in (Sourati et al., 2016) .

G.3.1 ADDITIONAL ASSUMPTION

Assumption 2. We assume to exist four positive constants L 1 , L 2 , L 3 , L 4 ≥ 0 such that following properties hold ∀x ∈ X, y ∈ {1, • • • , c} and θ ∈ Ω : • I(θ, x) = -∇ 2 θ log f (y|x, θ) is independent of the class labels y. • ∇ θ log f (y|x, θ ⋆ ) T I q(x) (θ ⋆ ) -1 ∇ θ log f (y|x, θ ⋆ ) ≤ L 1 • ∥I q(x) (θ ⋆ ) -1/2 I(θ ⋆ , x)I q(x) (θ ⋆ ) -1/2 ∥ ≤ L 2 • ∥I q(x) (θ ⋆ ) -1/2 (I(θ ′ , x) -I(θ ′′ , x))I q(x) (θ ⋆ ) -1/2 ∥ ≤ L 3 (θ ′ -θ ′′ ) T I q(x) (θ ⋆ )(θ ′ -θ ′′ ) • -L 4 ∥θ -θ ⋆ ∥I(θ ⋆ , x) ⪯ I(θ, x) -I(θ ⋆ , x) ⪯ L 4 ∥θ -θ ⋆ ∥I(θ ⋆ , x) G.3.2 REPLACING θ ⋆ BY θU(x),Dn Lemma 9. Suppose Assumption 1 in Appendix F and Assumption 2 in Appendix G hold, then with high probability: Tr (θ ⋆ ) -1 = lim n→∞ Tr I p(x) ( θU(x),Dn ) -1 . (29) Proof. It is shown in the proof of Lemma 2 in (Chaudhuri et al., 2015) that under the assumptions mentioned in Assumption 2, the following inequalities hold with probability 1 -δ(n): β(n) -1 β(n) I(θ ⋆ , x) ⪯ I( θU(x),Dn , x) ⪯ β(n) + 1 β(n) I(θ ⋆ , x), where β(n) and 1 -δ(n) are proportional to n, which is the size of the training set D n . Because of the independence for the class labels y of I(θ, x), P(x), I P(x) (θ) = E x∼P(x) [I(θ, x)] holds for any marginal distribution P(x). 14 Taking the expectation to the (30) with respect to the marginal p(x) and q(x), then: β(n) -1 β(n) I p(x) (θ ⋆ ) ⪯ I p(x) ( θU(x),Dn ) ⪯ β(n) + 1 β(n) I p(x) (θ ⋆ ). β(n) -1 β(n) I q(x) (θ ⋆ ) ⪯ I q(x) ( θU(x),Dn ) ⪯ β(n) + 1 β(n) I q(x) (θ ⋆ ). Since I p(x) (θ ⋆ ) and I p(x) ( θU(x),Dn ) are assumed to be positive definite, we can write (31) in terms of inverted matrices 15 : β(n) β(n) + 1 I p(x) (θ ⋆ ) -1 ⪯ I p(x) ( θU(x),Dn ) 

G.4 STATEMENT AND PROOF OF THEOREM 1

Theorem 1. Suppose Assumption 1 in Appendix F and Assumption 2 in Appendix G hold, then for sufficiently large n n |, the following holds with high probability: E (x,y)∼q(x)f (y|x,θ ⋆ ) E Dn∼p(x)f (y|x,θ ⋆ ) -log f (y|x, θU(x),Dn ) ≤ 1 2n Tr I p(x) ( θU(x),Dn ) -1 Tr I q(x) (θ ⋆ ) . (36) Proof. Because of the (A8) of Assumption 1 in Appendix F, I p(x) (θ ⋆ ) -1 and I q(x) (θ ⋆ ) are positive definite matrix. Thus, Tr I p(x) (θ ⋆ ) -1 I q(x) (θ ⋆ ) ≤ Tr I p(x) (θ ⋆ ) -1 Tr I q(x) (θ ⋆ ) holds.foot_12 holds with high probability. It is worth noting that Theorem 1 states that the upper bound of the MLE θU(x),Dn , D n test loss can be minimized by lowering Tr I p(x) ( θU(x),Dn ) -1 when training marginal p(x), the only tractable and controllable variable.

H THEOREM 2

In this section, we introduce the motivation of gradient norm-based importance sampling. To show why this is important, we introduce the debiasing object problem for a given D n under sampling probability h(x) and show how to solve it in a toy example because the problem is difficult.

H.1 PRACTICAL OBJECTIVE FUNCTION FOR THE DATASET BIAS PROBLEM

Remark that the right-hand side term of (36) are controlled by the training and test marginals p(x), and q(x). Since we can only control the training dataset D n not p(x) and q(x), we can design a practical objective function for the dataset bias problem by using EFI and Theorem 1 as follows: min h(x)∈H Tr Îh(x) ( θh(x),Dn ) -1 , (37) where Îh(x) (θ) is an empirical Fisher information matrix. Remark that EFI is defined as: Îh(x) (θ) = n i=1 h(x i )∇ θ log f (y i |x i , θ)∇ ⊤ θ log f (y i |x i , θ). Here, h(x) describes the sampling probability on D n , which is the only controllable term. We deal with (37) in the toy example because of the difficulty of the problem.

H.2 ONE-DIMENSIONAL TOY EXAMPLE SETTING

For simplicity, we assume that D n comprises sets M and m, and the samples in each set share the same loss function and the same probability mass. The details are as follows: 



Note that bias-alignment cannot always be strictly divisible in practice. For ease of explanation, we use the notations bias-conflicting/bias-aligned. In the case of JTT(Liu et al., 2021), although the authors used bias label for validation dataset (especially, bias-conflicting samples), we tune the hyperparameters using a part of the biased training dataset for fair comparison. Considering that JTT does not show significant performance gain in the results, it is consistent with the existing results that the validation dataset is important in JTT, as described in(Idrissi et al., 2022). Lee et al. (2021) only reported bias-conflicting case for BFFHQ, but we report the unbiased test result. Note that for simplicity, we abuse the notation h(x, y) used in Section 3 as h(x). This is exactly the same for a given dataset Dn situation. https://github.com/alinlab/BAR https://github.com/kakaoenterprise/Learning-Debiased-Disentangled https://github.com/alinlab/LfF https://github.com/clovaai/rebias https://github.com/kakaoenterprise/Learning-Debiased-Disentangled https://github.com/anniesch/jtt We say that a function f : X -→ Y is of C p (X), for an integer p > 0, if its derivatives up to p-th order exist and are continuous at all points of X. Note that for simplicity, we abuse the notation h(x, y) used in Section 3 as h(x). This is exactly the same for a given dataset Dn situation. For ∀ two positive definite matrices A and B, Tr [AB] ≤ Tr [A] Tr [B] satisfies.



is a variable controlled by two factors: (1) a change in the training dataset D n and (2) the adjustment of the sampling probability h(x). If h(x) is a uniform distribution U (x), then θU(x),Dn is the outcome of empirical risk minimization (ERM).

Finally, we show that the term Tr[I p(x) (θ ⋆ ) -1 ] which is defined in the oracle model parameter can be replaced with Tr[I p(x) ( θU(x),Dn ) -1 ]. The proof of Theorem 1 is in Appendix D. Note that Theorem 1 means that the upper bound of the test loss of MLE θU(x),Dn can be minimized by reducing Tr[I p(x) ( θU(x),Dn ) -1 ].

Dn )|/Z, where Z = |M ||g M ( θU(x),Dn )| + |m||g m ( θU(x),Dn )|, and |M | and |m| denote the cardinality of M and m, respectively.

Figure 3: Target objective Tr[ Îh ( θh(x),Dn ) -1 ]. PGD : h(x) = ĥ(x), and vanilla: h(x) = U (x).

have designed DNNs as a shared feature extractors and multiple classifiers. In contrast to the shared feature extractor methods, McDuff et al. (2019) and Ramaswamy et al. (2021) fabricated a classifier and conditional generative adversarial networks, yielding test samples to determine whether the classifier was biased. Furthermore, Singh et al. (

Figure 4: Colored MNIST: The single bias attribute is color, and the target attribute is shape. The top 3 rows represent biasaligned samples, and the bottom row samples are bias-conflicting examples.

Figure 5: Multi-bias MNIST: Multiple colors and objects bias, with digit shape as the target. The top 3 rows represent biasaligned samples, and the bottom row samples are bias-conflicting examples.

Figure 6: Corrupted CIFAR: corruption is the bias attribute, while target attribute is object. The top three rows are biasaligned samples, while the bottom row are bias-conflicting examples.

Figure 7: Biased Action Recognition: The biased attribute is background, while the target attribute is action. The top 2 rows are bias-aligned samples, and the bottom row is bias-conflict samples.

Figure 8: BFFHQ: stands for biased FFHQ. Target attribute: gender, biased attribute: age. The samples in the top two rows are bias-aligned, while the samples in the bottom row are bias-conflict.

aims to debias by using the technique of feature clustering and cluster reweighting. It consists of three steps. First, the author trains the biased model based on ERM. Next, at the biased model, cluster all training samples into K clusters based on the feature, where K is the hyperparameter. Here, h(x, y; θ) ∈ K = {1, • • • , K} denote the cluster mapping function of data (x, y) derived by the biased model with parameter θ. At the last step, BPA computes the proper importance weight, w k for the k-th cluster, where k ∈ K and the final objective of debiasing the framework is given by minimizing the weighted empirical risk as follows: minimize θ E (x,y)∼P w h(x,y; θ) (θ)l(x, y; θ) ,

(1) Step 2 (computing the per-sample gradient norm and sampling probability) spends 4.3% of training time. (2) Resampling based on the modified sampling probability h(x) requires an additional cost of 1m 27s by seeing the difference between the computing times of Step 3 and Step 1.

Figure 9: Histogram of per-sample gradient norm.

Figure 10: Histogram of unbiased test accuracy among 25 trials for each.

y true (x)|θ p(x, y true (x)| θU(x),Dn ) θ log p(x, y true (x)|θ ⋆ ) I p(x) (θ ⋆ ) -θ log p(x, y|θ ⋆ ) I p(x) (θ ⋆ ) -p(x) (θ ⋆ ) -1 2 E (x,y)∼q(x)f (y|x,θ ⋆ ) -∇ 2 θ log p(x, y|θ ⋆ ) 1 S I p(x) (θ ⋆ ) -p(x) (θ ⋆ ) -1 2 E (x,y)∼q(x)f (y|x,θ ⋆ ) -∇ 2 θ log q(x, y|θ ⋆ ) 1 S I p(x) (θ ⋆ ) -p(x) (θ ⋆ ) -1

n → ∞ to the (34). Note that β(n) is proportional to n.14 I P(x) (θ) = E (x,y)∼P(x,y|θ) -∇ 2 θ log f (y|x, θ) = E x∼P(x) E y∼f (y|x,θ) -∇ 2 θ log f (y|x, θ) = E x∼P(x) E y∼f (y|x,θ) [I(θ, x)] = E x∼P(x) [I(θ, x)].15 For ∀ two positive definite matrices A and B, we have thatA ⪰ B ⇒ A -1 ⪯ B -1 16 If A ⪯ B, Tr [A] ≤ Tr [B] holds. (∵) A ⪯ B ⇒ B -A ⪰ O and B -A := U ΣU T , where U = [u1| • • • , |u d ]. Then, Tr(B -A) = d i=1 ui(B -A)u T i ≤0 because of the positive definite property of B -A. Tr(B -A) ≤ 0 ⇒ Tr(B) ≤ Tr(A).

the result of Lemma 8 and 9 in Appendix G,lim n→∞ nE (x,y)∼q(x)f (y|x,θ ⋆ ) E Dn∼p(x)f (y|x,θ ⋆ ) -log f (y|x, θU(x),Dn ) = 1 2 Tr I p(x) (θ ⋆ ) -1 I q(x) (θ ⋆ ) ≤ 1 2 Tr I p(x) (θ ⋆ ) -1 Tr I q(x) (θ ⋆ )Tr I p(x) ( θU(x),Dn ) -1 Tr I q(x) (θ ⋆ )

|m| 2|M |•|m| , h ⋆ m (x) = |M | 2|M |•|m| . This result is related with the trained model θU(x),Dn , where U M (x) m (x) = 1 |M |+|m| . At θU(x),Dn , |M |g M ( θU(x),Dn ) + |m|g m ( θU(x),Dn ) = 0 satisfies and this is equivalent to |M | : |m| = |g m ( θU(x),Dn )| : |g M ( θU(x),Dn )|.Thus, it is consistent with our intuition that setting the sampling probability h for set M and m in proportion to |g M ( θU(x),Dn )| and |g m ( θU(x),Dn )| helps to minimize the trace of the inverse empirical Fisher information.

Average test accuracy and standard deviation (three runs) for experiments with the MNIST variants under various bias conflict ratios. The best accuracy is indicated in bold for each case.

Average test accuracy and standard deviation (three runs) for experiments with the raw image benchmarks: BAR and BFFHQ. The best accuracy is indicated in bold, and for the overlapped best performance case is indicated in Underline.

Average and worst test accuracy with the raw image benchmark: CelebA and raw NLP task: CivilComments-WILDS. The results of comparison algorithms are the results reported in(Zhang  et al., 2022b). The best worst accuracy is indicated in bold.

Ablation studies on GCE and data augmentation (✓for applied case).

proposes a novel invariant learning framework that does not require prior knowledge of the environment. EIIL is composed of three steps: (i) training based on ERM, (ii)

Image processing We train and evaluate with a fixed image size. For colored MNIST case (28 × 28), multi-bias MNIST (56 × 56), corrupted CIFAR (32 × 32), and the remains (224 × 224). For the CMNIST, MBMNIST, CCIFAR, BAR, and BFFHQ, we utilize random resize crop, random rotation, and color jitter to avoid overfitting. We use normalizing with a mean of (0.4914, 0.4822, 0.4465), and standard deviation of (0.2023, 0.1994, 0.2010) for CCIFAR, BAR, and BFFHQ cases.Implementation For table 1 andTable 2 reported in Setion 4, we reproduce all experimental results

Hyperparameter detailsFor Table3reported in Section 4, we follow the implementation settings of CelebA and CivilComments-WILDS, suggested bySeo et al. (2022) andLiu et al. (2021), respectively. A summary of the hyperparameters that we used is reported in Table6. We conduct our experiments mainly using a single Titan XP GPU for all cases. As shown in Table10, we can conclude the following two results: (1) loss of the last epoch of the first stage in a two-stage approach is not suitable for resampling, and (2) the results of replacing the relative difficulty metric of LfF with a gradient norm show that gradient norm has better discriminative properties for bias-conflicting samples than loss.

Computation cost

Results on unbiased CMNIST and natural CIFAR10 cases.

Digit target MNIST vs Color target MNIST

Number of samples at each bin: Color target MNIST (ρ = 0.5%)

Average and worst test accuracy with CelebA setting of

Notation Table

y|θ)

ACKNOWLEDGEMENT

This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) [No.2019-0-00075, Artificial Intelligence Graduate School Program(KAIST), 10%] and Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) [No.2022-0-00641, XVoice: Multi-Modal Voice Meta Learning, 90%]

C CASE STUDIES ON PGD

In this section, we analyze PGD in many ways. Most analyses are based on the CMNIST dataset, and the experimental setting is the same as the existing setting unless otherwise noted. For example, all experiments used the same data augmentation, color jitter, resize crop, and random rotation. Table 7 : Ablation study on GCE parameter α.The only hyper-parameter used in PGD is the GCE parameter α. We experimented with this value at 0.7 according to the protocol of LfF (Nam et al., 2020) . However, we need to compare the various cases of α. To analyze this, we run PGD with various α and report the performance in Table 7 . As in Table 7 , the debiased model performs best when the GCE parameter is 0.9. This is because the biased model is fully focused on the bias feature, rather than the target feature, which can be seen from the unbiased test accuracy of the biased model, as in the bottom of To support our algorithm design, we provide further experimental analysis, i.e., resampling versus reweighting.Reweighting (Nam et al., 2020; Lee et al., 2021) and resampling (Liu et al., 2021) are the two main techniques to debias by up-weighting bias-conflicting samples. PGD is an algorithm that modifies the sampling probability by using the per-sample gradient norm. To check whether PGD works with reweighting, we examine the results of PGD with reweighting on colored MNIST dataset and report them in Table 9 . We compute the weight for each sample as follows:. As in Table 9 , PGD with resampling slightly outperforms PGD with reweighting. As argued in (An et al., 2020) , this gain from resampling can be explained by the argument that resampling is more stable and better than reweighting.Published as a conference paper at ICLR 2023 (A2). The joint distribution P θ has common support for all θ ∈ Ω. (A5). (Test joint): Let q(x) denote the test marginal without dependence on the parameter. The unseen test pairs are distributed according to the test/true joint distribution of the form q(x, y|θ ⋆ ) ≜ q(x)f (y|x, θ ⋆ ), because we do not think the existence of mismatched label data situation in the test task. 

(A8). (Invertibility):

The arbitrary Fisher information matrix I P(x) (θ) is positive definite and therefore invertible for all θ ∈ Ω.In contrast to (Sourati et al., 2016) , we modify (A3) so that the oracle model always outputs a hard label, i.e., f (y true (x)|x, θ ⋆ ) = 1 and add (A9) which is not numbered but noted in the statement of Theorem 3 and Theorem 11 in (Sourati et al., 2016) .

F.3 PRELIMINARIES

We organize the two types of background knowledge, maximum likelihood estimator (MLE) and Fisher information (FI), needed for future analysis.

F.3.1 MAXIMUM LIKELIHOOD ESTIMATOR (MLE)

In this section, we derive the maximum likelihood estimator in a classification problem with sampling probability h(x). Unless otherwise specified, training set D n = {(x i , y i )} n i=1 is sampled from p(x, y|θ ⋆ ). For given probability mass function (PMF) h(x) on D n , we define MLE θh(x),Dn as follows:In ( 3) and ( 4), (A0) and (A4) of Assumption 1 in Appendix F were used, respectively. It is worth noting that MLE θh(x),Dn is a variable that is influenced by two factors: (1) a change in the training

G THEOREM 1

In this section, we deal with some required sub-lemmas that are used for the proof of Lemma 8, which is the main ingredient of the proof of Theorem 1.

G.1 SUB-LEMMAS

Lemma 1 ( (Lehmann & Casella, 1998) , Theorem 5.1). When P -→ denotes convergence in probability, and if (A0) to (A7) of the Assumption 1 in Appendix F hold, then there exists a sequence of MLE solutions { θU(x),Dn } n∈N that θU(x),Dn P -→ θ ⋆ as n -→ ∞, where θ ⋆ is the 'true' unknown parameter of the distribution of the sample.Proof. We refer to (Lehmann & Casella, 1998) for detailed proof.Lemma 2 ( (Lehmann & Casella, 1998) , Theorem 5.1). Let { θU(x),Dn } n∈N be the MLE based on the training data set D n . If (A0) to (A8) of the Assumption 1 in Appendix F hold, then the MLE θU(x),Dn has a zero-mean normal asymptotic distribution with a covariance equal to the inverse Fisher information matrix, and with the convergence rate of 1/2:where D -→ represents convergence in distribution.Proof. We refer to (Lehmann & Casella, 1998) for detailed proof, based on Lemma 1.Lemma 3 ( (Wasserman, 2004) , Theorem 9.18). Under the (A0) to (A8) of the Assumption 1 in Appendix F hold, we get √ nI p(x) ( θU(x),Dn )where D -→ represents convergence in distribution.Proof. We refer to (Wasserman, 2004) for detailed proof, based on Lemma 2. Proof. We refer to (Serfling, 1980) for detailed proof.Lemma 5. ( (Sourati et al., 2016) , Theorem 27) Let {θ n } be a sequence of random vectors in a convex and compact set Ω ⊆ R d and θ ⋆ ∈ Ω be a constant vector such that ∥θ n -Proof. We refer to (Serfling, 1980) for detailed proof.Lemma 6. If (A0) and (A3) of the Assumption 1 in Appendix F hold, then ∇ θ log P(x, y true (x)|θ ⋆ ) = ⃗ 0 for any joint distribution P(x, y|θ ⋆ ).Proof.On the first equality, (A0) of Assumption 1 in Appendix F is used. At the second equality, (A3) of Assumption 1 in Appendix F is used.• For the given a ∈ R, at the model parameter θ ∈ R, 1 2 (θ + a) 2 and 1 2 (θloss function arise for all data in M and m, respectively.• θh,Dn denote the trained model from the arbitrary PMF h(x) ∈ H which has a constraint having degree of freedom 2, (h M (x), h m (x)). • Concretely, each samples of M and m has a probability mass h M (x) and h m (x), respectively. • In these settings, our objective can be written as finding h ⋆ (x) = arg min h(x)∈H Tr Îh (x)( θh(x),Dn ) -1 and this is equivalent to find (h ⋆ M (x), h ⋆ m (x)).

H.3 STATEMENT AND PROOF OF THEOREM 2

In this section, we introduce the motivation for the gradient norm-based importance sampling in the toy example setting. Theorem 2. Under the above setting, the solution of (h Since the gradient is scalar in the toy setting, Îh(x) ( θh(x),Dn ) is also scalar and the same as the unique eigenvalue, that is,Thus, our problem is deciding hBecause of the toy setting, three constraints are held for arbitrary θ ∈ [-a, a] and h(x) ∈ H. (39) is maximized when |M | • h M (x) = 1 2 , and it means |m| • h m (x) = 1 2 . Thus, h ⋆ M (x) =

