DR-FAIRNESS: DYNAMIC DATA RATIO ADJUSTMENT FOR FAIR TRAINING ON REAL AND GENERATED DATA Anonymous

Abstract

Fair visual recognition has become critical for preventing demographic disparity. A major cause of model unfairness is the imbalanced representation of different groups in training data. Recently, several works aim to alleviate this issue using generated data. However, these approaches often use generated data to obtain similar amounts of data across groups, which is not optimal for achieving high fairness due to different learning difficulties and generated data qualities across groups. To address this issue, we propose a novel adaptive sampling approach that leverages both real and generated data for fairness. We design a bilevel optimization that finds the optimal data sampling ratios among groups and between real and generated data while training a model. The ratios are dynamically adjusted considering both the model's accuracy as well as its fairness. To efficiently solve our non-convex bilevel optimization, we propose a simple approximation to the solution given by the implicit function theorem. Extensive experiments show that our framework achieves state-of-the-art fairness and accuracy on the CelebA and ImageNet People Subtree datasets. We also observe that our method adaptively relies less on the generated data when it has poor quality.

1. INTRODUCTION

Model fairness in visual recognition is becoming essential to prevent discriminatory predictions over demographics. Recently, numerous unfairness issues have been reported (Wang et al., 2020; Najibi, 2020) , and several fair image classification approaches have been proposed that do not discriminate against specific groups such as gender, age, or skin color (Ramaswamy et al., 2021; Roh et al., 2021) . With the rapid progress in deep generative learning (Karras et al., 2020; Dhariwal & Nichol, 2021) , there is a new research direction to improve fairness by augmenting training data with generated data. Recent breakthroughs in generative learning make generated data practical enough to use in real-world applications (OpenAI, 2022) , and many high-quality pre-trained generative models are now open to the public (Rombach et al., 2022) , which obviates the need to train such models ourselves. Thus, generated data is increasingly used to improve model performances, including fairness. From a fairness perspective, generated data complements real data by making it more diverse. For example, if a specific group's data is collected from a limited data source that does not have the full data distribution, that group may be discriminated in model training due to the bias (Mehrabi et al., 2021) . In this case, generated data can be used to supplement that underrepresented group. However, most fair training approaches that use generated data simply generate similar amounts of data across groups (Ramaswamy et al., 2021; Choi et al., 2020) , which may not be optimal to improve group fairness such as equalized odds (Hardt et al., 2016) and demographic parity (Feldman et al., 2015) . Such suboptimality could originate from 1) the learning difficulty differences across groups and 2) the potential bias (i.e., typically in the form of missing modes) and quality issues in the generated data that can hurt the accuracy and fairness of the model under training. Therefore, it is essential to find the right mix of generated and real data for the best accuracy and fairness. In this paper, we harness the potential of both real and generated data via adaptive sampling to improve group fairness while minimizing accuracy degradation. To this end, we design a new sampling approach called Dr-Fairness (Dynamic Data Ratio Adjustment for Fairness) that adaptively adjusts data ratios among groups and between real and generated data over iterations, as in Figure 1a . Compared to the original model, the 1:1 ratio baseline (Ramaswamy et al., 2021) does not significantly improve group fairness, measured through equalized odds (EO) disparity. FairBatch (Roh et al., 2021) shows high fairness by adaptively selecting real data only, but loses accuracy. In comparison, Dr-Fairness (ours) achieves high fairness, while not sacrificing accuracy.  :1 ratio ✓ ✗ ✗ ✗ FairBatch ✗ ✓ ✗ ✗ Dr-Fairness ✓ ✓ ✓ ✓ In Table 1 , we compare the unique properties of Dr-Fairness against two representative methods: 1) an equal ratio baseline (1:1 ratio) (Ramaswamy et al., 2021) that uses generated data and 2) a fairness-aware adaptive sampling baseline (Fair-Batch) (Roh et al., 2021) that finds the optimal group ratio for fairness only using real data. We can see that Dr-Fairness subsumes the two baselines and improves them by also optimizing the ratio between real and generated data and utilizing accuracy for ratio updates. To perform adaptive sampling systematically, we design a novel bilevel optimization problem along with an efficient algorithm for solving it. Our bilevel optimization consists of 1) an outer optimization that adjusts data sampling ratios considering both fairness and accuracy and 2) an inner optimization that minimizes the standard empirical risk on both real and generated data, given the current sampling ratios. Although various exact algorithms have been proposed to solve bilevel optimizations (Maclaurin et al., 2015) , they often scale poorly in our scenario with large models and data. We thus propose an approximate algorithm that uses the implicit function theorem (Krantz & Parks, 2002) and identity-matrix approximation (Luketina et al., 2016) to efficiently compute the gradient of our bilevel optimization. Specifically, instead of computing the expensive inverse Hessian matrix, we approximate it with a simple diagonal identity matrix. Experiments on CelebA (Liu et al., 2015) and ImageNet People Subtree (Yang et al., 2020) show that our approach achieves the state-of-the-art fairness and accuracy performances. For instance, Figure 1b highlights our results on CelebA, where our framework largely outperforms FairBatch, which only uses real data and the 1:1 ratio baseline -see Sec. 3 for comparisons using more baselines and other fairness metrics, which show consistent results. On the ImageNet People Subtree classification problem, which represents a large-scale real-world scenario, we achieve better accuracies than the best baseline, with an absolute improvement of 5-9%, while obtaining similar fairness scores. We also observe that our framework adaptively relies less on the generated data when it has poor quality.

Summary of Contributions:

(1) We propose Dr-Fairness, a novel adaptive sampling framework for fair training that enjoys the potential of both real and generated data. (2) To perform adaptive sampling systematically, we formulate a bilevel optimization to train fair and accurate models on real and generated data. (3) We also design an approximate algorithm based on the implicit function theorem and identity-matrix approximation to efficiently solve our optimization. (4) We perform extensive experiments on CelebA and ImageNet People Subtree to show that Dr-Fairness achieves the state-of-the-art accuracy and fairness. (5) Finally, we believe that our work reveals the importance of using generated data together with real data to improve model fairness.

2. FRAMEWORK

In this section, we first formulate a bilevel optimization problem for optimizing sampling ratios for real and generated data. We then design a new algorithm that efficiently solves the optimization problem. Throughout this paper, we use the following notations and fairness definitions. Notations Let x ∈ X be the input feature, and let y ∈ Y and ŷ ∈ Y be the true label and the predicted label, respectively. Let z ∈ Z be a sensitive group attribute, e.g., gender, age, or skin color. Let m be the total number of training samples, and m y,z be the number of samples in the set {i : y i = y, z i = z} with label y and group label z. Similarly, m y,⋆ := |{i : y i = y}| and m ⋆,z := |{i : z i = z}|. Let w be the model weights, and the overall empirical risk is given by L(w) = 1 m i ℓ(y i , ŷi ), where ℓ(•) represents the loss function. Let L y,z (w) be the empirical risk over samples in the set {i : y i = y, z i = z}, i.e., L y,z (w) := 1 my,z i:y i =y,zi=z ℓ(y i , ŷi ). Finally, let L real (•) and L gen (•) be the empirical risks on real data and generated data, respectively. Fairness Definitions For the method design, we focus on two prominent group fairness definitions: equalized odds (EO) (Hardt et al., 2016) and demographic parity (DP) (Feldman et al., 2015) . EO is satisfied when the accuracies conditioned on the true label are the same for different groups (i.e., Pr(ŷ = y|y = y, z = z 1 ) = Pr(ŷ = y|y = y, z = z 2 ), ∀y ∈ Y, z 1 , z 2 ∈ Z). DP is satisfied when the positive prediction rates are the same for the groups (i.e., Pr(ŷ = 1|z = z 1 ) = Pr(ŷ = 1|z = z 2 ), ∀z 1 , z 2 ∈ Z), where DP is designed for binary classifications (i.e., y ∈ {0, 1}) with a favorable label class (e.g., "approval" in loan decision).

Generated Data

In general, any synthetic data, including data from deep generative models, can be considered generated data for fair training. Here, the key role of the generated data in algorithmic fairness is supporting the limited subset of the real data. Also, we implicitly assume that the domains of the generated data and real data are the same, but the distributions of the two data can be different. For example, if the real data represents human faces, then we assume the generated data also contains human faces. However, generated data may have a fairer distribution than real data. In this paper, we assume we can get group-specific generated data by using conditional image generation techniques (Nie et al., 2021; Dhariwal & Nichol, 2021) -see details in Secs. 3 and B.3.

2.1. BILEVEL OPTIMIZATION FOR FAIRNESS WITH REAL AND GENERATED DATA

To design an adaptive sampling strategy on real and generated data, we first formulate a bilevel optimization for training fair and accurate models. The bilevel optimization consists of inner and outer objectives: 1) we maintain the standard empirical risk minimization (ERM) in the inner problem, and 2) we capture the desired fairness properties in the outer problem. The bilevel formulation allows us to support prominent group fairness metrics and utilize generated data for fairness. Moreover, through this formulation, we can achieve the desired fairness properties while keeping the standard model training process without re-configuring the model architecture or loss functions. We discuss more advantages of using bilevel optimization compared to other problem formulation methods like distributionally robust optimization (Sinha et al., 2017) in Sec. A.6. We now explain how our optimization improves group fairness and accuracy together by using both real and generated data. The outer objective aims to find the optimal data ratios among sensitive groups and between real and generated data to minimize the fairness and accuracy losses on the real data distribution. Given the current data ratios, the inner objective runs a weighted ERM with both real and generated data. We can support various prominent group fairness metrics by modifying the outer objective and the constraints. As an illustration, we state our bilevel optimization w.r.t. EO as follows (see the DP version in Sec. A.1): min λ,µ max y∈Y,z1,z2∈Z {|L real y,z1 (w(λ, µ)) -L real y,z2 (w(λ, µ))|} + k y∈Y,z∈Z m y,z m L real y,z (w(λ, µ)), w(λ, µ) = arg min w y∈Y,z∈Z m y,⋆ m λ y,z {µ y,z L real y,z (w) + (1 -µ y,z )L gen y,z (w)}, s.t. λ ∈ [0, 1], µ ∈ [0, 1], z∈Z λ y,z = 1, ∀y ∈ Y, where λ y,z is the ratio for group z in class y, µ y,z is the ratio for real data in the (y, z)-class, and λ and µ are the sets of all λ y,z and µ y,z , respectively. In the outer objective, the first term indicates the fairness loss, and second term indicates accuracy loss. The hyperparameter k tunes the importance of the two losses. We note that the λ y,z and µ y,z values are the data ratios within one mini-batch. Thus, among all samples in the real and generated data, our framework serves mini-batches according to λ y,z and µ y,z . Here, we can capture the EO disparity as the maximum of the loss differences in different groups within the same label (i.e., max |L real y,z1 (w) -L real y,z2 (w)|) -see the details in Sec. A.2. Through the above formulation, the amount of generated data is automatically adjusted to augment the real data. One problem with only using real data is that there can be an accuracy degradation of the model due to over-sampling minority groups for better fairness. Generating data for these groups can prevent the model's overfitting and lessen the accuracy degradation.

2.2. ALGORITHM

We now design our algorithm to solve the above bilevel optimization. In this section, we first describe how to efficiently approximate our optimization by utilizing the implicit function theorem (Krantz & Parks, 2002) and adapting identity-matrix approximation (Luketina et al., 2016) in a fairness setting. We then introduce the overall training procedure, and show the validity of our approximate algorithm on synthetic data. Algorithm Design Solving bilevel optimization is known to be challenging (Liu et al., 2021) , especially when the objectives are non-convex as in our problem. Thus, we resort to stochastic gradient descent to find the optimal parameters of the bilevel optimization gradually. To obtain the gradients, we first convert our optimization into the unconstrained version: min λ,µ max y∈Y,z1,z2∈Z {|L real y,z1 (w(λ, µ)) -L real y,z2 (w(λ, µ))|} + k y∈Y,z∈Z m y,z m L real y,z (w(λ, µ)), w(λ, µ) = arg min w y∈Y,z∈Z m y,⋆ m σ y (λ y,z ){S(µ y,z )L real y,z (w) + (1 -S(µ y,z ))L gen y,z (w)}, where σ y (λ y,z ) := exp(λ y,z )/ zi exp(λ y,zi ) (i.e., the softmax function), and S(µ y,z ) := 1/(1 + exp(-µ y,z )) (i.e., the sigmoid function). Denoting the outer and inner objectives as f outer (λ, µ, w(λ, µ)) and f inner (λ, µ, w) respectively, the inner optimization w(λ, µ) = arg min w f inner (λ, µ, w) can be solved efficiently using SGD-like algorithms. The main question is how to solve the outer optimization. We can state the gradient of f outer w.r.t. λ as follows: df outer dλ = ∂f outer ∂λ Term A + ∂f outer ∂w(λ, µ) Term B × ∂w(λ, µ) ∂λ Term C , where Term A and Term B are the direct gradients w.r.t. λ and w(λ, µ), respectively, and Term C is the best-response Jacobian. Note that w(λ, µ) is the best-response of model weights. In Eq. 1, the best-response Jacobian is hard to directly compute. Although various algorithms have been proposed to explicitly find the best-response Jacobian, most of them require propagating the entire history of the gradients (Maclaurin et al., 2015) , which is very time-consuming. Instead, we implicitly measure the best-response Jacobian using the implicit function theorem (Krantz & Parks, 2002) . This approach does not need to investigate the entire gradient history (Rajeswaran et al., 2019; Lorraine et al., 2020) and builds on the assumption that the inner optimization has converged to a local minimum, i.e., ∂finner ∂w = 0. Using this assumption, we can convert the best-response Jacobian into the multiplication of two matrices (see more details in Corollary 2 of Sec. A.3): df outer dλ = ∂f outer ∂λ + ∂f outer ∂w(λ, µ) × -[ ∂ 2 f inner ∂w∂w ] -1 × ∂ 2 f inner ∂w∂λ . However, obtaining the inverse Hessian (i.e., [∂ 2 f inner /∂w∂w] -1 ) in Eq. 2 is also computationally expensive. Thus, we consider the identity matrix approximation (Luketina et al., 2016; Finn et al., 2017; Geng et al., 2021) that replaces the inverse Hessian with the identity matrix. Despite its simplicity, such approximation may be valid for neural networks with normalization layers that make the Hessian matrix diagonally dominant (e.g. BatchNorm), and in practice, it often performs on par with other approximation methods in various applications (Raiko et al., 2012; Pedregosa, 2016; Liu et al., 2018; Wilder et al., 2019; Fung et al., 2022) . Given this, we can rewrite Eq. 2 simply as: df outer dλ ≈ ∂f outer ∂λ - ∂f outer ∂w(λ, µ) × ∂ 2 f inner ∂w∂λ , where the second term on the right-hand side is efficiently computed via vector-Jacobian product (Paszke et al., 2017) . Similarly, we can also approximate the gradient of f outer w.r.t. µ. Overall Training Process We now describe the overall training process in Algo. 1. We first initialize the model parameters and the data ratios, and for each iteration, we then get a minibatch from Dr-Fairness (Algo. 2). In Algo. 2, we first update the data ratios among groups (λ) and between real and generated data (µ) by calculating dfouter dλ and dfouter dµ as in Eq. 3. We then draw a minibatch according to σ y (λ) and S(µ). Note that the batch sampling with σ y (λ) and S(µ) provides an unbiased estimator of the weighted ERM in our inner optimization (Roh et al., 2021) . Finally, we update the model parameters w with the given minibatch. Here we can optionally use an exponential moving average (EMA) that averages the model parameters w for improving training stability. Validity of Our Algorithm We empirically verify how close the solutions from our approximation strategy are to the optimal ones. To this end, we follow the synthetic binary setting in Roh et al. (2021) , where FairBatch has a theoretical guarantee to find the optimal group ratios, and compare the optimized group ratios of Dr-Fairness and FairBatch -see details on the setup in Sec. A.4. Note that in this synthetic setting, we set the fairness metric to equal opportunity (i.e., a relaxed version of EO that focuses on the positive label) and only use the real data to optimize the group ratios λ, as FairBatch cannot handle µ for the generated data. Ideally, if the approximation error in our method is small, Dr-Fairness should obtain the same group ratios and performance as FairBatch. Figure 2 shows that our algorithm converges to similar group ratios to those in FairBatch, although the key ideas of the two algorithms on updating the group ratio are very different. Also, the two algorithms have the similar fairness scores (0.012 equal opportunity disparity for both). These results imply that our approximations are good enough to find reasonable solutions, which is consistent with the observations in other applications (Lorraine et al., 2020; Luketina et al., 2016) . In the next section, we will show that Dr-Fairness achieves much higher accuracy with similar or better fairness than FairBatch on real-world datasets, as our algorithm scales to multiple groups and labels and is capable of harnessing the potential of both real and generated data. We note that although FairBatch is known to have theoretical guarantees, they only apply to limited settings (e.g., binary groups and labels), so there is room to improve fairness in other settings. We also verify that, in the above setting, the identity matrix approximation indeed gets almost the same group ratios with the exact inverse Hessian computation. See more details in Sec. A.5. 

3. EXPERIMENTS

We perform various experiments to evaluate our algorithm. We repeat all experiments with three random seeds and measure all performances on a separate test set -see more information, including hyperparameter details of the algorithms, in Sec. B.1. We use ResNet50 (He et al., 2016) and the Adam optimizer (Kingma & Ba, 2015) -see results on ViT (Dosovitskiy et al., 2021) in Sec. B.9. Datasets We utilize two real-world datasets: 1) CelebA (Liu et al., 2015) to compare our algorithms with baselines and perform various analyses, and 2) ImageNet People Subtree (Yang et al., 2020) to further observe the algorithm performances on a large-scale real-world scenario. Note that we are using large datasets instead of the traditional smaller tabular benchmarks for fairness because our goal is to make Dr-Fairness work in large-scale real-world applications. [CelebA] Contains celebrity images, where each image has 40 attributes (e.g., gender, age, and smiling). We choose group and label attributes that are less subjective and traditionally considered for fairness. The group attributes are gender (male and female) and age (young and old). The label attributes are age, haircolor (black, blond, and others), and smiling (smiling and not-smiling). Note that age can be used as either the group or label attribute. The sizes of the training, validation, and test sets are 160k, 20k, and 20k, respectively. [ImageNet People Subtree] Contains 284 label classes and 3 group attributes: gender (male, female, and unsure), skin color (light, medium, and dark), and age (child, adult, middle, and retired). We first filter out classes that are vague, duplicates, or too small with few samples, which leaves us with 112 classes -see details in Sec. B.2. These classes contain about 111k samples, but only 10% of them have group attribute annotations. We split the group-labeled data into 40%, 20%, and 40% for training, validation, and testing, respectively.

Data generation

We create the generated datasets using state-of-the-art generative models that conditionally synthesize images for each (y, z)-class. For CelebA, we use a StyleGAN-based controllable generation method called LACE (Nie et al., 2021) . For ImageNet People Subtree, we fine-tune a diffusion model (Dhariwal & Nichol, 2021) pre-trained on ImageNet (Deng et al., 2009) , and use classifier guidance (Song et al., 2020; Dhariwal & Nichol, 2021) to sample images in each (y, z)-class. Note that the controllable generation for ImageNet People Subtree is more challenging due to its large number of (y, z)-classes and labeling noises. Thus, the resulting generated data has lower quality than the generated data in CelebA. More details on data generation are in Sec. B.3. Baselines We compare our algorithm with three types of baselines: 1) vanilla (non-fair) baseline, 2) fair pre-processing baselines, and 3) fair in-processing baselines. For fair pre-processing training, we consider three baselines: simple sampling, pair-augmenting (PairAug) (Ramaswamy et al., 2021) , and pair-augmenting with our generated data (PairAug*). For simple sampling, we over-and under-sample the real data to ensure an equal ratio among groups. PairAug is a fair augmentation technique that uses the generation methods to synthesize balanced images for groups to reduce the correlation between the group and label attributes. For a fair comparison, we also implement an extension of PairAug (denoted by PairAug*), which uses the same balancing ratio in Ramaswamy et al. ( 2021), but uses our generated data. For fair in-processing training, we consider three baselines: fairness constraint (Zafar et al., 2017a;b), domain independence (Wang et al., 2020) , and FairBatch (Roh et al., 2021) . Fairness constraint adds a fairness penalty term to the loss function to reduce the unfairness. Domain independence trains separate classifiers per each group to reduce the correlation between the group and label attributes. At inference, one can ensemble the outputs of the trained classifiers to get the final predictions. FairBatch adaptively adjusts batch ratios among groups to improve fairness only using real data. Metrics We focus on two accuracy metrics and three fairness metrics. [Accuracy] We measure the standard accuracy over all samples and the balanced accuracy that averages y-class-wise accuracies. [Fairness] We focus on equalized odds (EO) (Hardt et al., 2016) , demographic parity (DP) (Feldman et al., 2015) , and bias amplification (Zhao et al., 2017) . For EO and DP, we measure the disparities (i.e., unfairness) among groups: EO disp. = max z∈Z,y∈Y | Pr(ŷ=y|z=z, y=y)-Pr(ŷ=y|y=y)|, and DP disp.= max z∈Z | Pr(ŷ=1|z=z)-Pr(ŷ=1)|. Together with either EO or DP, we measure bias amplification to see how much the data bias is amplified in the model: Bias amp. = max y∈Y Pr(z=z|ŷ=y) -Pr(z=z|y=y), where z := arg max z ′ ∈Z Pr(z=z ′ |ŷ=y). Here, a good performance is indicated by high accuracy values and low EO disp., DP disp., and bias amp. values.

3.1. CELEBA EXPERIMENTS

We evaluate Dr-Fairness on CelebA by comparing it with baselines (Sec. 3.1.1) and analyzing the impact of its hyperparameters (Sec. 3.1.2), components (Sec. 3.1.3), and generated data (Sec. 3.1.4). The fair pre-processing baselines (in rows 2-4) improve the fairness performances compared to the original non-fair baseline, but still perform worse (i.e., lower EO disp. and lower bias amp.) than the fair in-processing baselines and Dr-Fairness. Thus, simply equalizing the data ratio among groups may not be enough to achieve high group fairness. Note that it is not straightforward to get the generated data from the original PairAug work in the non-binary label setting, so we are not able to report the numbers (e.g., the right columns of Table 2 ). But we expect that the results would be similar to PairAug*, as observed in the binary setting. Additionally, FairnessGAN (Sattigeri et al., 2019) is another previous method that aims to generate fair images, but this method has been reported to show worse fairness and accuracy performances than PairAug -see Sec. B.4 for a detailed comparison.

3.1.1. ACCURACY AND FAIRNESS

The fair in-processing baselines (in rows 5-7) improve fairness (esp. EO), but tend to sacrifice accuracy because they only utilize real data. Here, in the baselines with higher fairness, the decrease in accuracy becomes more significant. For example, FairBatch adaptively adjusts the group ratio on real data to improve fairness, but we observe that some small-sized groups end up being oversampled, which is detrimental to the accuracy performance on the test set. In comparison, Dr-Fairness achieves high fairness performances while even improving accuracies by adaptively finding optimal data ratios among groups and between real and generated data. There are two takeaways: 1) we can find a better group ratio than the 1:1 ratio for fairness, and 2) an optimal combination of real and generated data can mitigate the accuracy degradation of fair training.

3.1.2. HYPERPARAMETER ANALYSIS

We now evaluate Dr-Fairness by varying its main hyperparameter k used in the bilevel optimization. A larger k puts more weight on the accuracy loss than the fairness loss. Figures 3a and 3b show the accuracy and fairness of Dr-Fairness during the training with the different k values. As expected, increasing k (say k = 50) results in higher accuracy and lower fairness. By varying k, we can also compare the accuracy-fairness tradeoff curves of Dr-Fairness and FairBatch in Figure 3c . Dr-Fairness shows a better tradeoff, which is consistent with the results in Sec. 3.1.1.

3.1.3. ABLATION STUDY

We perform an ablation study on our framework to evaluate the impact of each component in the optimization on fairness and accuracy. For fairness, we conduct two ablations: F1) remove both the fairness loss in the outer objective and λ y,z in the inner objective, and F2) only remove the fairness loss in the outer objective. For accuracy, we consider three ablations: A1) remove the accuracy loss in the outer objective; and µ y,z and the generated data loss in the inner objective, A2) remove µ y,z and the generated data loss in the inner objective, and A3) only remove the accuracy loss in the outer objective. We note that A2 also represents how Dr-Fairness works only with real data when we cannot utilize generated data. Through this sequence of ablations, we observe that each part of our algorithm gradually improves the fairness and accuracy performances. In Table 3 , the fairness ablations (in rows 1-2) show worse fairness as the fairness loss and λ y,z are discarded, and the accuracy ablations (in rows 3-5) demonstrate lower accuracy and balanced accuracy as some of the accuracy loss, generated data, and µ y,z are removed. We thus conclude that all components in our bilevel optimization contribute to the overall performances. 3.1.4 GENERATED DATA OF DIFFERENT QUALITIES We analyze the robustness of our framework against the generated data quality as shown in Table 4 . We vary the quality of generated data by adding random Gaussian noise to the original images. Interestingly, when the generated data quality decreases (i.e., adding more noise to the images), Dr-Fairness automatically reduces the usage of the generated data, as shown in the second column of Table 4 . With this automatic adjustment, Dr-Fairness shows robust performances in the last three columns. When the generated data is fully replaced with Gaussian noise (i.e., severe noise), the accuracy and fairness performances become worse than the clean setting as expected, but the fairness score is still much better than the non-fair baseline by reasonably sacrificing the accuracy. These results show that Dr-Fairness is effective even with low-quality generated data.

3.2. IMAGENET PEOPLE SUBTREE EXPERIMENTS

We finally perform experiments on ImageNet People Subtree, which represents a large-scale realworld scenario. As only 10% of the data has group annotations, following Zhao et al. (2021) , we first pre-train a non-fair model on the entire training set with y labels and then fine-tune the pre-trained model to improve fairness on the small set with group labels. Tables 5 (below) and 10 (in Sec. B.8) show the performances of the algorithms on four group scenarios: gender, skin color, age, and all combinations of them. The overall results are consistent with the CelebA experiments, where we can see Dr-Fairness outperforms the baselines in accuracy, fairness, or both. Specifically, our algorithm shows the best or second-best performance on EO disparity and bias amplification in almost all group settings while obtaining better classification accuracies compared to the baselines with similar fairness scores. For example, we obtain classification accuracies better than FairBatch, with an absolute improvement of 5-9%, while achieving similar fairness scores. As ImageNet People Subtree shows a more complicated real-world scenario than CelebA, we have two additional observations. First, when we train the baselines w.r.t. EO, the bias amplification metric occasionally gets worse compared to the original model (e.g., Dom. Indep. on gender and FairBatch on skin color). This result shows that improving EO, which aims to minimize the label-specific accuracy gap between groups, does not necessarily lead to reducing the bias in the model compared to the data. In addition, as domain independence trains separate classifiers per each group, we suspect that the final model may have undesirable results (e.g., worse bias amp.) if some of the classifiers fail. Second, this dataset contains a large number of (y, z)-classes where many of them are extremely small-sized. Here the controllable data generation becomes challenging where the generated labels may be noisy, which negatively affects the fair training as well. Nonetheless, Dr-Fairness still shows a clear improvement in fairness compared to the baselines, and we believe more data with clean labels could further improve its performance. The above experiments use ResNet50 (He et al., 2016) as the model backbone. In Sec. B.9, we also conduct experiments using ViT (Dosovitskiy et al., 2021) and observe similar results. The results imply our method is applicable to different network architectures. 

4. RELATED WORK

As model fairness becomes indispensable for Trustworthy AI, numerous works have been recently proposed to better measure fairness and design fairness-aware algorithms (Narayanan, 2018) . Among various fairness definitions, we focus on group fairness measures (Hardt et al., 2016; Feldman et al., 2015) , which are widely studied in the fairness literature -see representative works in Sec. C. Unfortunately most of the existing algorithms are not designed to handle large number of groups or labels, and our contribution is to support such large-scale scenarios for real-world applications. Among the previous techniques, FairBatch (Roh et al., 2021) is the most relevant to our work as it finds the optimal group ratio for fairness on real data and shows the state-of-the-art fairness performances on various tabular datasets, including COMPAS (Angwin et al., 2016) and AdultIncome (Kohavi, 1996) . However, FairBatch may suffer from accuracy degradation due to oversampling on very small-size groups, especially in vision datasets. In particular, FairBatch cannot utilize accuracy-based objectives and generated data, which may result in a worse accuracy and fairness tradeoff. Also, the theoretical guarantees of FairBatch do not apply in our setting because the outer objective of our bilevel optimization problem is non-convex. In contrast, Dr-Fairness can minimize the accuracy degradation of fair training by optimally utilizing both real and generated data based on the fairness and accuracy objectives. There is an emerging line of research for fairness in visual recognition (Najibi, 2020; Wang et al., 2020) where using generated data is critical. Many visual recognition tasks involve multiple classes of varying sizes, and only using real data is often insufficient to improve fairness. In response, several works have proposed new algorithms to create a balanced dataset by augmenting the biased real dataset with well-controlled generated data (Sattigeri et al., 2019; Choi et al., 2020; Ramaswamy et al., 2021) . However, simply balancing the data sizes is not enough to achieve high-enough group fairness, as the learning difficulty and generated data quality can differ across groups. Although a recent work (Zietlow et al., 2022) suggests an adaptive data augmentation that generating more data for worse-performing groups, it uses heuristics to adjust data ratios without proper optimization and thus has limited fairness performance. In comparison, Dr-Fairness solves a novel bilevel optimization problem to find optimal data ratios and thus obtains both high fairness and accuracy. In addition, there are other related studies on fair data reweighing (Li & Liu, 2022; Jiang & Nachum, 2020; Krasanakis et al., 2018) , fair augmentation (Chuang & Mroueh, 2021) , and fair representations (Shui et al., 2022) . Compared to our work, these studies only use real data or do not scale to large datasets -see a detailed discussion in Sec. C.

5. CONCLUSION

We proposed a novel adaptive sampling approach called Dr-Fairness that utilizes both real and generated data for fairness. To perform adaptive sampling systematically, we first formulated a bilevel optimization, where the goal is to find the optimal data ratios among sensitive groups and between real and generated data to achieve high group fairness while minimizing accuracy degradation. To solve the bilevel optimization problem, we then designed an efficient approximate algorithm based on the implicit function theorem and identity-matrix approximation. Extensive experiments on the CelebA and ImageNet People Subtree datasets showed that Dr-Fairness achieves state-of-the-art fairness and accuracy performances. We believe Dr-Fairness opens up new opportunities for effectively using generated data in large-scale real-world scenarios.

ETHICS STATEMENT

We believe our work can positively impact society by reducing discrimination in AI applications. In particular, our framework shows that generated data can compensate for unfairness issues in real data (e.g., size bias and lack of diversity) to help obtain better accuracy and fairness results that would not have been possible otherwise. As a result, real-world applications have a better chance of ensuring fairness without sacrificing accuracy unnecessarily. We do note that choosing an appropriate fairness metric for each application is essential, as a poor choice may lead to unintended discrimination. Thus, one needs to carefully choose the target fairness metrics based on the social context in each application. Also, in terms of privacy, we did not involve human subjects or use any direct personal identifiers in the experiments, except for the human images in the publicly available benchmark datasets.

REPRODUCIBILITY STATEMENT

We provide implementation and experimental details (e.g., libraries, hyperparameters, data preprocessing, and data generation) in Sec. 3 and Sec. B to enable the reproduction of our results. To help with the reproducibility of results in this paper, we will make our source code publicly available in the future. A APPENDIX -OPTIMIZATION AND ALGORITHM  s.t. λ ∈ [0, 1], µ ∈ [0, 1], z∈Z λ y,z = 1, ∀y ∈ Y, where Y = {0, 1}. Note that DP is designed for binary classification. For designing the fairness loss, we are inspired by Roh et al. (2021) , which gives a hint on formulating DP loss in bilevel optimization. Intuitively, the fractions in the fairness loss make the model reduces the disparity of each prediction ratio across groups, without considering the sizes of true label classes. This strategy can be a sufficient condition for DP, as the goal of DP is to achieve the same positive prediction ratio among groups -see more details in Roh et al. (2021) .

A.2 CAPTURING EQUALIZED ODDS WITH THE LOSS CONSTRAINT

Continuing from Sec. 2.1, we explain how the loss-based constraints can capture equalized odds. We recall our notation: the empirical risk over samples in the set (y = y, z = z) is L y,z (w) := 1 my,z i:y i =y,zi=z ℓ(y i , ŷi ), where ℓ(y i , ŷi ) is the loss function. Here, when the loss function ℓ(y i , ŷi ) is 1/0-loss (i.e., ℓ(y i , ŷi ) = 1(y i ̸ = ŷi ), where 1(•) is an indicator function), the loss-based constraint can perfectly express the equalized odds disparity. Specifically, L y,z (w) with 1/0-loss is equivalent to the probability of the correct predictions in each (y, z)-class (i.e., Pr(ŷ = y|y = y, z = z)). Therefore, our fairness loss constraint (i.e., max y∈Y,z1,z2∈Z {|L real y,z1 (w(λ, µ)) -L real y,z2 (w(λ, µ))|}) becomes the equalized odds metric, which describes the class-conditioned accuracy disparity among groups (i.e., max y∈Y,z1,z2∈Z | Pr(ŷ = y|y = y, z = z 1 ) -Pr(ŷ = y|y = y, z = z 2 )|). In practice, we can also use other loss functions like cross-entropy loss instead of the 1/0-loss, as other loss functions have been empirically verified as reasonable proxies for capturing group fairness metrics (Roh et al., 2021; Shen et al., 2022) .

A.3 USING THE IMPLICIT FUNCTION THEOREM

Continuing from Sec. 2.2, we describe how we convert the best-response Jacobian in Eq. 1 using the implicit function theorem. We note that among various methods of solving bilevel optimization problems, the implicit function theorem significantly improves the algorithm efficiency with theoretical evidence. Here, we first state the original implicit function theorem:  and ∂F (a,b ) ∂y (i.e., the Jacobian matrix) is invertible. Then, there exist open sets U ⊂ R n and V ⊂ R m that contain a and b, respectively, and satisfy the following: Theorem 1. (Implicit Function Theorem, stated in Krantz & Parks (2002); de Oliveira (2014)) Let F : R n × R m → R m be a continuously differentiable function, where the input of F is (x, y) ∈ R n × R m . Assume there is an input point (a, b) that satisfies F (a, b) = 0, • There is a unique continuously differentiable function G, where G(a) = b and F (x, G(x)) = 0 for all x ∈ U . • We have the Jacobian matrix of partial derivatives of G in U as follows: ∂G(x) ∂x = -[ ∂F (x, G(x)) ∂y ] -1 [ ∂F (x, G(x)) ∂x ]. We now apply the above theorem in our setting. To get the best-response Jacobian w.r.t. λ, we consider ∂finner(λ,w) ∂w as F (x, y) and w(λ) as G(x). Note that when accessing the gradient w.r.t. λ, we can ignore µ without loss of generality, and vice versa. Then, we can rewrite Theorem 1 for our scenario as follows: Corollary 2. (Implicit Function Theorem in our setting) Let ∂finner ∂w : R n × R m → R m be a continuously differentiable function, where the input of ∂finner ∂w is (λ, w) ∈ R n × R m . Assume there is an input point (a, b) that satisfies ∂finner(a,b) ∂w = 0, and ∂ 2 finner(a,b) ∂w∂w (i.e., the Jacobian matrix) is invertible. Then, we have the Jacobian matrix of partial derivatives of w(λ) as follows: ∂w(λ) ∂λ = -[ ∂ 2 f inner (λ, w) ∂w∂w ] -1 × ∂ 2 f inner (λ, w) ∂w∂λ . Thus, with the assumption that the inner optimization has converged to a local minimum, i.e., ∂finner ∂w = 0, we can convert the best-response Jacobian in Eq. 1 to the multiplication of two matrices in Eq. 2. Similarly, we can convert the best-response Jacobian w.r.t. µ by setting F (x, y) to ∂finner(µ,w) ∂w and G(x) to w(µ).

A.4 SETTING FOR THE VALIDITY CHECK

Continuing from Sec. 2.2, we describe the synthetic binary setting in Roh et al. (2021) , which is used to empirically verify how close the solutions from our approximation strategy are to the optimal ones. For generating the synthetic dataset, we use a method in Zafar et al. (2017a), which produces two input attributes (x 1 , x 2 ), one binary label attribute y, and one binary group attribute z. We draw each sample (x 1 , x 2 , y) from Gaussian distributions and make z follow a biased distribution. In detail, we generate each sample (x 1 , x 2 , y) from two Gaussian distributions: (x 1 , x 2 )|y = 0 ∼ N ([-2; -2], [10, 1; 1, 3]) and (x 1 , x 2 )|y = 1 ∼ N ([2; 2], [5, 1; 1, 5] ). Then, we make z follow a biased distribution: Pr(z = 1) = Pr((x ′ 1 , x ′ 2 )|y = 1)/[Pr((x ′ 1 , x ′ 2 )|y = 0) + Pr((x ′ 1 , x ′ 2 )|y = 1)] where (x ′ 1 , x ′ 2 ) = (x 1 cos(π/4) -x 2 sin(π/4), x 1 sin(π/4) + x 2 cos(π/4)) . This synthetic dataset contains training, validation, and test sets with 2k, 1k, and 1k samples, respectively. In this experiment, we use logistic regression models for all algorithms, as in Roh et al. (2021) .

A.5 COMPARISON WITH EXACT INVERSE HESSIAN COMPUTATION

Continuing from Sec. 2.2, we compare our identity matrix-based approximation results with the exact inverse Hessian computation results. We use the same setting described in Sec. A.4, where it is tractable to compute the exact inverse Hessian. Figure 4 shows the group ratios of Dr-Fairness (with the identity matrix approximation) and Dr-Fairness with the exact inverse Hessian computation. We can see that Dr-Fairness, which uses the identity matrix approximation to estimate the inverse Hessian in Eq. 2, converges to similar group ratios to those in computing the exact inverse Hessian. It implies that our solution is close to the exact solution despite the method's simplicity. Another observation is that the data ratios in Figure 4b converge within fewer iterations than those in Figure 4a . We note that although the number of required iterations is fewer when computing the exact inverse Hessian, the training time is much slower if the number of parameters increases. Thus, the approximation used in Dr-Fairness can be a reasonable solution to estimate the inverse Hessian, which is usually intractable for large models and data. 

A.6 COMPARISON WITH OTHER PROBLEM FORMULATION METHODS

Continuing from Sec. 2.1, we discuss the advantages of using bilevel optimization compared to other problem formulation methods, especially using distributionally robust optimization (DRO) (Sinha et al., 2017) . DRO is one of the prominent problem formulation methods in machine learning, which can solve a target objective in a min-max formulation, but we believe that our bilevel formulation is more suitable to handle both real and generated data while improving group fairness. In our scenario, real data and generated data play very different roles in the bilevel objectives, and such roles are difficult to capture via a DRO formulation. • In detail, at the test-time evaluation, we only care about the fairness and accuracy losses (in our outer objective) on the real data distribution. Therefore, the empirical risk for generated data (L gen ) only appears in the inner objective for model parameters update. • Directly applying DRO to empirical risks of real and generated data does not lead to the same effect as our bilevel objective because this ignores the EO-based loss in our outer objective, and more importantly, it is unclear what is the benefit of optimizing the real/generated data sampling ratio to maximize the empirical risks. If we consider an example where we have a high loss on generated data and a low loss on real data, then DRO should increase the sampling ratio for the generated data to increase the overall loss. However, the high loss on generated data could be the result of the low quality of generated data, and increasing its sampling ratio might instead hurt the performance and fairness. On the other hand, as shown in Sec. 3.1.4, Dr-Fairness would decrease the ratio of generated data when its quality is low.

B APPENDIX -EXPERIMENTS

B.1 EXPERIMENTAL SETTINGS Continuing from Sec. 3, we provide detailed information on the experimental settings. We use PyTorch for all experiments and utilize the pre-trained ResNet50 (He et al., 2016) provided by PyTorch library (i.e., torchvision (Marcel & Rodriguez, 2010) ). We change the last fully-connected layer of each model with the number of corresponding label classes. When training, we update all model parameters in the pre-trained model. The batch size of all experiments is 128. We set the learning rate for updating model parameters to 0.0001. For the data ratio (λ and µ) updates in Dr-Fairness, we use the Adam optimizer and set the learning rate for the ratio update to 0.005 in all experiments. When calculating the gradients w.r.t. model parameters or data ratios in Eq. 3, we use the autograd functionality in PyTorch. We apply the exponential moving average when updating the model parameters in Dr-Fairness. To prevent the overfitting, we use the validation set when measuring the fairness and accuracy losses in the outer objective of our bilevel optimization. Similarly, we use the validation set in other baselines if they require the computation of additional (fairness) losses in the algorithm. Hyperparameters For Dr-Fairness, we choose k from a candidate set {0.1, 1, 10, 20} to have the best fairness score while minimizing the accuracy degradation in the validation set. We set the learning rates for λ and µ to 0.005. We initialize λ to the original (y, z)-ratios in the real data. We initialize µ to 0.5 for CelebA (i.e., we start with 50% real and 50% generated data) and 0.99 for ImageNet People Subtree (i.e., 99% real and 1% generated data). We use a higher (conservative) µ initially for ImageNet People Subtree because its generated data has lower quality than that for CelebA. For all baselines, we choose the hyperparameters that show the best fairness while minimizing the accuracy degradation in the validation set.

B.2 FILTERING LABEL CLASSES IN IMAGENET PEOPLE SUBTREE

Continuing from Sec. 3, we explain how we filter the label classes in the ImageNet People Subtree dataset. Initially, the dataset contains 284 label classes. Among them, we filter out classes that are vague, duplicates, or too small with few samples. First, we filter vague classes like "ex-president" and "junior", which are hard to classify even for human annotators. To decide whether each class is vague or not, we perform internal crowdsourcing. For each class, we gather 3 expert decisions and do a majority vote. Also, we remove classes that are conceptually duplicates of others. Finally, we set the allowed minimum sample size to 50 and ignore the classes with fewer than 50 samples. As a result, 112 classes are used in our experiments.

B.3 DATA GENERATION

Continuing from Sec. 3, we explain the details on data generation. For experiments on CelebA, we use LACE (Nie et al., 2021) to generate data. LACE is a controllable generation method that uses an energy-based model (EBM) in the latent space of a pre-trained generative model such as StyleGAN2 (Karras et al., 2020) . We consider StyleGAN2 pre-trained on the CelebA-HQ dataset as our base generative model. In LACE, we first need to train the latent classifiers in the w-space of StyleGAN2, each of which corresponds to an energy function for an individual attribute in the EBM formulation (see Eq. ( 4) in (Nie et al., 2021) ). Since we mainly focus on five attributes (i.e., age, gender, smile, glasses, and haircolor) in the CelebA experiments, we end up with five latent classifiers. Next, for each combination of attribute values (e.g., age='young', gender='female', smile='true', glasses='true', and haircolor='black'), we use the ordinal differential equation (ODE) sampler in the latent space to sample the corresponding images. We repeat the above sampling process until we cover all the combinations of attribute values. For experiments on ImageNet People Subtree, we use a guided diffusion model (Dhariwal & Nichol, 2021) with classifier guidance (Song et al., 2020; Dhariwal & Nichol, 2021) to generate data. Since there exist no ADM checkpoints pre-trained on ImageNet People Subtree, we first fine-tune the ImageNet-pretrained ADM on ImageNet People Subtree, where the ADM model that we use conditions on the 112 labels. For classifier guidance, we also need to first train three time-dependent attribute classifiers, each corresponding to a demographic attribute (i.e., gender, skin color, and age), on noisy images produced by the diffusion process. In particular, we fine-tune the noisy image classifier (also pre-trained on ImageNet) with three new prediction heads on 10% of the annotated data. Next, for each combination of label and attribute values, we pass the label value as the input of the conditional ADM and use the classifier guidance (i.e., the guidance from the fine-tune noisy image classifier in Eq. ( 10) of Dhariwal & Nichol (2021) ) with the scale s = 15. Similarly, we repeat the above sampling process until we cover all the combinations of label and attribute values. Here we describe the number of generated samples in each dataset. In CelebA, we consider 5 attributes for the controllable generation: gender (male and female), age (young and old), smile (true and false), glasses (true and false), and haircolor (black, blond, and others). Thus, these 5 attributes yield 48 class combinations (i.e., 2 4 × 3). We generate a total of 96k samples, where there are 2k samples for each attribute combination (e.g., 2k samples for (age='young', gender='female', smile='true', glasses='true', and haircolor='black')). In ImageNet People Subtree, there are 112 label classes and 3 group attributes: gender (male, female, and unsure), skin color (light, medium, and dark), and age (child, adult, middle, and retired). Thus, we have 4,032 combinations (i.e., 112 × 3 × 3 × 4) for the controllable generation. We generate 32 samples for each combination, which results in about 129k samples in total. Figures 5 and 6 show examples of the generated images. We note that the controllable generation for ImageNet People Subtree is more challenging due to its large number of (y, z)-classes and labeling noises. Thus, the resulting generated data is noisier than the generated data in CelebA.

B.4 COMPARISON BETWEEN FAIRNESSGAN AND PAIRAUG

Continuing from Sec. 3.1.1, we compare FairnessGAN (Sattigeri et al., 2019) and PairAug (Ramaswamy et al., 2021) . Table 6 shows the accuracy and fairness performances of the two algorithms on CelebA, where they consider the setting of binary label attribute y (attractive) and binary group attribute z (gender). We show the numbers that are reported in Ramaswamy et al. (2021) . As a result, PairAug shows better accuracy and equalized odds performances compared to FairnessGAN. higher accuracy and fairness and is thus desirable. In CelebA, Dr-Fairness achieves both better accuracy and fairness performances compared to all the baselines. In ImageNet People Subtree, 1) the baselines Simple Sampling, PairAug, and Dom. Indep. show strictly worse performances than Dr-Fairness, 2) FairBatch achieves higher fairness than ours, but the accuracy degradation is severe, and 3) the non-fair baseline shows high accuracy, but much worse fairness. Thus, we can conclude that Dr-Fairness achieves the best accuracy-fairness tradeoffs in both datasets. We explain when Dr-Fairness can improve both fairness and accuracy compared to other fairness baselines. We believe this phenomenon is related to the optimal accuracy-fairness tradeoff, which is known to be determined by the data distribution (Menon & Williamson, 2018) . When the performance of a fair algorithm lies on the optimal accuracy-fairness tradeoff, any other algorithm can only achieve either better fairness or better accuracy, but cannot improve both. However, when the fairness algorithms do not achieve the optimal accuracy-fairness tradeoff in the given data, there is an opportunity to improve the model's performances toward the optimal tradeoff line. For example, we observe that Dr-Fairness improves both accuracy and fairness compared to other baselines in CelebA, implying that the baselines do not achieve the optimal accuracy-fairness tradeoff in the first place. The main approaches for satisfying group fairness are: 1) fix the training data to mitigate bias (Kamiran & Calders, 2011; Zemel et al., 2013) , 2) modify the training process to prevent the model from learning bias (Zafar et al., 2017a; b; Zhang et al., 2018a; Agarwal et al., 2018; Roh et al., 2020; 2021) , or 3) alter the outputs of the trained model to achieve fairness metrics (Hardt et al., 2016) . As discussed in Sec. 4, most of these algorithms are not designed to handle large number of groups or labels, and our contribution is to support such large-scale scenarios for real-world applications. In addition, there are other related studies on 1) fair data reweighing (Li & Liu, 2022; Jiang & Nachum, 2020; Krasanakis et al., 2018) , 2) fair augmentation (Chuang & Mroueh, 2021) , and 3) fair representations (Shui et al., 2022) . • The data reweighing techniques (Li & Liu, 2022; Jiang & Nachum, 2020; Krasanakis et al., 2018) • There is another interesting work called FairMixup (Chuang & Mroueh, 2021) , which augments training data for fairness using mixup methods (Zhang et al., 2018b) . However, the key difference from ours is that FairMixup only augments the data within the original training data distribution. FairMixup also cannot dynamically adjust sampling ratios from different groups explicitly. In contrast, Dr-Fairness can utilize any additional data, which is not limited to the original training (real) data distribution, and also find the optimal sampling ratios among groups and between real and generated data. • Recently, a fair representation paper (Shui et al., 2022) uses the bilevel optimization formulation with the implicit function theorem and shows promising results. Although this work also uses a bilevel formulation, it targets a different problem from ours, where the goal is to map the input feature X into the latent variable X ′ for fairness. In comparison, we use the bilevel formulation to adjust the ratio among groups and between real and generated data to improve fairness. Specifically, we design the inner objective by explicitly separating the group-wise terms and real/generated data terms to adequately apply the data weights. We note that this inner structure is different from Shui et al. (2022) , where they apply the common outputs of the outer objective to all terms in the inner objective that only considers real data. In addition, when approximating the inverse Hessian matrix resulted by the implicit function theorem (e.g., Eq. 2), Shui et al. (2022) use the conjugate gradient (CG) method (Rajeswaran et al., 2019) , whereas we utilize the identitymatrix approximation (Luketina et al., 2016; Geng et al., 2021) . We note that the identity-matrix approximation is known to be much more efficient and may achieve on-par or sometimes better performances compared to the CG method (Lorraine et al., 2020) . Although not our immediate focus, there are other important research lines for fairness: 1) fulfilling other fairness definitions (e.g., individual fairness (Dwork et al., 2012) and causal fairness (Kusner et al., 2017) ) and 2) handling noisy or missing group labels (Hashimoto et al., 2018; Celis et al., 2021) . We believe that supporting these aspects can be promising future directions.



Figure 1: (a) Our framework iteratively updates the data ratios among groups and between real and generated data based on the fairness and accuracy of the intermediate model. (b) Performances on CelebA, using gender as the group attribute and age as the label attribute. Compared to the original model, the 1:1 ratio baseline (Ramaswamy et al., 2021) does not significantly improve group fairness, measured through equalized odds (EO) disparity. FairBatch(Roh et al., 2021) shows high fairness by adaptively selecting real data only, but loses accuracy. In comparison, Dr-Fairness (ours) achieves high fairness, while not sacrificing accuracy.

Comparison of group ratios λ from Dr-Fairness and FairBatch. Both converge to similar ratios.

Figure 3: The performances of Dr-Fairness by varying the hyperparameter k when (y, z) = (haircolor, (gender, age)). The first two graphs show the performance changes during the training, and the last graph shows the accuracy-fairness tradeoff curves. Compared to FairBatch, Dr-Fairness shows a better tradeoff.

Group ratios in Dr-Fairness.

Group ratios with the exact inverse Hessian.

Figure 4: Comparison of group ratios λ from Dr-Fairness (with the identity matrix approximation) and Dr-Fairness with the exact inverse Hessian computation. Both converge to similar ratios.

(a) Tradeoffs in CelebA when (y, z) = (haircolor, (gender, age)). (b) Tradeoffs in ImageNet People Subtree when z = (gender, skin color, age).

Figure 7: Accuracy-unfairness graphs to visualize the algorithm performances on the CelebA and ImageNet People Subtree datasets. Being on the lower right is desirable (high accuracy and fairness).

are indeed relevant to our data sampling framework, as they keep finding data weights to improve group fairness. Compared to our work,Jiang & Nachum (2020) andKrasanakis et al. (2018) require multiple re-training of the model, andLi & Liu (2022) uses additional assumptions, including the loss function being twice differentiable and strictly convex in the model parameters. Therefore, applying these reweighing techniques when training models on large-scale data may lead to significant training times due to multiple re-trainings or performance degradation due to violations of the assumptions. In comparison, Dr-Fairness works well in large-scale scenarios as in our experiments.

Functionality comparison of algorithms.

Algorithm 1: Model Training with Dr-Fairness Input: real data (x real , y real , z real ), generated data (x gen , y gen , z real ) d real , d gen ← (x train , y train , z real ), (x gen , y gen , z real ) w ← initial model parameters λ, µ ← initialize sampling ratio logits Get current sampling ratios σ y (λ), S(µ) for each iteration do minibatch=Dr-Fairness(w, d real , d gen , λ, µ) Update w according to the minibatch (optionally with exponential moving average (EMA)) Output :model parameters w Algorithm 2: Dr-Fairness Input: model parameters w, data d real and d gen , group ratio λ, real data ratio µ Calculate f outer and f inner according to w, d real , d gen , λ, and µ

shows the accuracy and fairness performances of different algorithms on CelebA when training w.r.t. EO (see results of training w.r.t. DP in Sec. B.5). Here we consider two scenarios: 1) binary setting of y (age) & z (gender) and 2) non-binary setting of y (haircolor) & z (gender, age). In Sec. B.6, we show similar results for the experiments on other group and label combinations. Also, in Sec. B.7, we visually demonstrate the accuracy-fairness tradeoffs of the algorithms.

Performances on the CelebA test set when training w.r.t. EO on two scenarios: binary y (age) & z (gender) and non-binary y (haircolor) & z (gender, age). We compare Dr-Fairness with three types of baselines: 1) non-fair baseline, 2) fair pre-processing baselines: Simple Sampling, PairAug, and PairAug*, and 3) fair in-processing baselines: Fair. Const., Dom. Indep., and FairBatch. Note that PairAug and Fair. Const. cannot be trivially extended to the non-binary label setting, so we only show their results in the first column.

Ablation study on CelebA, where we consider the setting of non-binary y (haircolor) and binary z (gender). We mark noticeable performance degradations with underlines.

Analysis of the behavior of Dr-Fairness when the quality of generated data changes. We add different amounts of Gaussian noise to the original (clean) generated data. In the severe case, we fully replace the generated data with the noise. We set y to haircolor (non-binary) and z to gender (binary).

Performances on the ImageNet People Subtree test set when training w.r.t. equalized odds (EO) for either age or all combinations of groups. Other settings are identical to Table 2.

Additional comparisons between baselines using generated data. We compare PairAug and Fairness-GAN, where the results are fromRamaswamy et al. (2021). The baselines use the attractive attribute as the label and gender attribute as the group.

Performances on the CelebA test set w.r.t. demographic parity (DP) on the binary y (age) and z (gender) scenario. Other settings are identical to Table2. We mark the best and second best performances among the fairness algorithms with bold and underline, respectively.

Performances on the ImageNet People Subtree test set w.r.t. equalized odds (EO). We use DeiT-S(Touvron et al., 2021), a variant of ViT(Dosovitskiy et al., 2021), for all algorithms. We mark the best and second best performances among the fairness algorithms with bold and underline, respectively. Other settings are identical to Table2.

annex

 7 shows the accuracy and fairness performances of the algorithms. Similar to the results in Table 2 , Dr-Fairness shows higher fairness than the pre-processing (1:1 data ratio) baselines (i.e., simple sampling, PairAug, and PairAug*) and higher accuracy than the in-processing baselines (i.e., fairness constraint, domain independence, and FairBatch). Here, as the pre-processing baselines are not explicitly designed for DP, their fairness performances in terms of DP are sometimes worse than the original ResNet50 results. Since DP aims to ensure the same positive prediction rates without considering the true labels, the training sometimes needs to overfit on the positive labels for specific groups to improve DP. Thus, simply equalizing the data ratio among groups may not be enough to improve DP compared to the EO case.

B.6 OTHER RESULTS ON CELEBA WITH DIFFERENT SETTINGS

Continuing from Sec. 3.1.1, we perform experiments on the CelebA dataset with different group and label combinations. Tables 8 and 9 show the accuracy and fairness performances on the following two settings: (y, z) = (smiling, gender) and (y, z) = (haircolor, gender). In both cases, we observe consistent results to those in Sec. 3.1.1, where Dr-Fairness achieves high fairness (esp. EO) while not sacrificing accuracy. We note that in these two settings, the bias amplification values of the non-fair baseline are already very small (i.e., good enough), so the fair algorithms may not further improve the bias amplification.

B.7 ACCURACY-FAIRNESS TRADEOFFS

Continuing from Sec. 3, we visually demonstrate the accuracy-fairness tradeoffs of Dr-Fairness and the baselines on the CelebA and ImageNet People Subtree datasets. Figure 7 shows the accuracy and unfairness performances of the baselines and Dr-Fairness. Here, being on the lower-right indicates

