RE-WEIGHTING BASED GROUP FAIRNESS REGULAR-IZATION VIA CLASSWISE ROBUST OPTIMIZATION

Abstract

Many existing group fairness-aware training methods aim to achieve the group fairness by either re-weighting underrepresented groups based on certain rules or using weakly approximated surrogates for the fairness metrics in the objective as regularization terms. Although each of the learning schemes has its own strength in terms of applicability or performance, respectively, it is difficult for any method in the either category to be considered as a gold standard since their successful performances are typically limited to specific cases. To that end, we propose a principled method, dubbed as FairDRO, which unifies the two learning schemes by incorporating a well-justified group fairness metric into the training objective using a classwise distributionally robust optimization (DRO) framework. We then develop an iterative optimization algorithm that minimizes the resulting objective by automatically producing the correct re-weights for each group. Our experiments show that FairDRO is scalable and easily adaptable to diverse applications, and consistently achieves the state-of-the-art performance on several benchmark datasets in terms of the accuracy-fairness trade-off, compared to recent strong baselines.

1. INTRODUCTION

Machine learning algorithms are increasingly used in various decision-making applications that have societal impact; e.g., crime assessment (Julia Angwin & Kirchner, 2016) , credit estimation (Khandani et al., 2010) , facial recognition (Buolamwini & Gebru, 2018; Wang et al., 2019) , automated filtering in social media (Fan et al., 2021) , AI-assisted hiring (Nguyen & Gatica-Perez, 2016) and law enforcement (Garvie, 2016) . A critical issue in such applications is the potential discrepancy of model performance, e.g., accuracy, across different sensitive groups (e.g., race or gender) (Buolamwini & Gebru, 2018) , which is easily observed in the models trained with a vanilla empirical risk minimization (ERM) (Valiant, 1984) when the training data has unwanted bias. To address such issues, the fairnessaware learning has recently drawn attention in the AI research community. One of the objectives of fairness-aware learning is to achieve group fairness, which focuses on the statistical parity of the model prediction across sensitive groups. The so-called in-processing methods typically employ additional machinery for achieving the group fairness while training. Depending on the type of machinery used, recent in-processing methods can be divided into two categories (Caton & Haas, 2020) : regularization based methods and re-weighting based methods. Regularization based methods incorporate fairness-promoting terms to their loss functions. They can often achieve good performance by balancing the accuracy and fairness, but be applied only to certain types of model architectures or tasks, such as DNNs (e.g., MFD (Jung et al., 2021) or FairHSIC (Quadrianto et al., 2019) ) or binary classification tasks (e.g., Cov (Baharlouei et al., 2020) ). On the other hand, re-weighting based methods are more flexible and can be applied to a wider range of models and tasks by adopting simpler strategy to give higher weights to samples from underrepresented groups. However, most of them (e.g., LBC (Jiang & Nachum, 2020) , RW (Kamiran & Calders, 2012) , and FairBatch (Roh et al., 2020) ) lack sound theoretical justifications for enforcing group fairness and may perform poorly on some benchmark datasets. In this paper, we devise a new in-processing method, dubbed as Fairness-aware Distributionally Robust Optimization (FairDRO), which takes the advantages of both regularization and re-weighting based methods. The core of our method is to unify the two learning categories: namely, FairDRO incorporates a well-justified group fairness metric in the training objective as a regularizer, and optimizes the resulting objective function using a re-weighting based learning method. More specifically, we first present that a group fairness metric, Difference of Conditional Accuracy (DCA) (Berk et al., 2021) , which is a natural extension of Equalized Opportunity (Hardt et al., 2016) to the multi-class, multi-group label settings, is equivalent (up to a constant) to the average (over the classes) of the roots of the variances of groupwise 0-1 losses. We then employ the Group DRO formulation, which uses the χ 2 -divergence ball including quasi-probabilities as the uncertainty set, for each class separately to convert the DCA (or variance) regularized group-balanced empirical risk minimization (ERM) to a more tractable minimax optimization. The inner maximizer in the converted optimization problem is then used as re-weights for the samples in each group, making a unified connection between the reweighting and regularization-based fairness-aware learning methods. Lastly, we develop an efficient iterative optimization algorithm, which automatically produces the correct (sometimes even negative) re-weights during the optimization process, in a more principled way than other re-weighting based methods. In our experiments, we empirically show that our FairDRO is scalable and easily adaptable to diverse application scenarios, including tabular (Julia Angwin & Kirchner, 2016; Dua et al., 2017) , vision (Zhang et al., 2017) , and natural language text (Koh et al., 2021) datasets. We compare with several representative in-processing baselines that apply either re-weighting schemes or surrogate regularizations for group fairness, and show that FairDRO consistently achieves the state-of-the-art performance on all datasets in terms of accuracy-fairness trade-off thanks to leveraging the benefits of both kinds of fairness-aware learning methods.

2.1. IN-PROCESSING METHODS FOR GROUP FAIRNESS

Regularization based methods add penalty terms in their objective function for promoting the group fairness. Due to non-differentiability of desired group fairness metrics, they use weaker surrogate regularization terms; e.g., Cov (Zafar et al., 2017b) , Rényi (Baharlouei et al., 2020) and FairHSIC (Quadrianto et al., 2019) employ a covariance approximation, Rényi correlation (Rényi, 1959) and Hilbert Schmidt Independence Criterion (HSIC) (Gretton et al., 2005) between the group label and the model as fairness constraints, respectively. Jung et al. (2021) devised MFD that uses Maximum Mean Discrepancy (MMD) (Gretton et al., 2005) regularization for distilling knowledge from a teacher model and promoting group fairness at the same time. Although these methods can achieve high performance when hyperparameters for strengths of the regularization terms are well-tuned, they are sensitive to choices of model architectures and task settings as they employ surrogate regularization terms; see Sec. 5 and Appendix C.2 for more details. Meanwhile, some other works (Agarwal et al., 2018; Cotter et al., 2019) used an equivalent constrained optimization framework to enforce group fairness instead of using the regularization formulation. Namely, they consider a minimax problem of the Lagrangian function from the given constrained optimization problem and seek to find a saddle point through the alternative updates of model parameters and the Lagrangian variables. By doing so, they can successfully control the degree of fairness while maximizing the accuracy. However, their alternating optimization algorithms require severe computational costs due to the repetitive full training of the model. Furthermore, we empirically observed that they fail to find a feasible solution when applied to complex tasks in vision or NLP domains. A more detailed discussion is in Appendix B. As alternative in-processing methods, several re-weighting based methods (Kamiran & Calders, 2012; Jiang & Nachum, 2020; Roh et al., 2020; Agarwal et al., 2018) have also been proposed to address group fairness. Kamiran & Calders (2012) proposed a re-weighting scheme (RW) based on the ratio of the number of data points per each group. Recently, Label Bias Correction (LBC) (Jiang & Nachum, 2020) and FairBatch (Roh et al., 2020) have been developed, which adaptively adjust weights and mini-batch sizes based on the average loss per group, respectively. Perhaps, the most similar to our work is Agarwal et al. (2018) , which demonstrates that a Lagrangian formulation of fairness-constrained optimization problem can be reduced to a cost-sensitive classification problem and solved through a re-weighting learning scheme with relabling of class labels. However, their method is limited to binary classification tasks, whereas our FairDRO, using the DRO framework, can be applied to more complex tasks with non-binary class labels.

2.2. DISTRIBUTIONALLY ROBUST OPTIMIZATION (DRO)

Distributionally robust optimization (DRO) (Ben-Tal et al., 2009) is a general robust optimization framework that was originally used for learning a model that is robust to potential test distribution shift from the training set, by optimizing for the worst-case distribution in an uncertainty set. The DRO framework is now widely adopted in other applications, including algorithmic fairness.

DRO in algorithmic fairness

Numerous works in the algorithmic fairness literature have recently utilized the DRO frameworks in various settings. For example, a line of works (Jiang et al., 2020; Taskesen et al., 2020; Wang et al., 2021; Mandal et al., 2020) embeds the DRO frameworks into the fairness-constrained optimization problem to make a fair model robust to distribution shifts of a test dataset. Another related line of works (Wang et al., 2020; Hashimoto et al., 2018) proposed DRO formulations for achieving fairness in more challenging settings such as learning with noisy group labels or sequential learning. In spite of their effectiveness, both of above lines of works do not use DRO as a direct tool for achieving group fairness. Similarly to ours, Rezaei et al. (2020) ; Yurochkin et al. (2019) adopt DRO frameworks for the purpose of achieving fairness. However, the key difference is that they either require the group labels at test time or target individual fairness. On the other hand, our FairDRO directly aims to achieve group fairness while not requiring group labels at test time. A more detailed discussion is in Appendix B. DRO in other applications Applications of DRO are actively studied in other contexts, e.g., image classification (Sagawa et al., 2020) , multilingual machine translation (Oren et al., 2019; Zhou et al., 2021) , long-tailed classification (Samuel & Chechik, 2021) or out-of-distribution generalization (Krueger et al., 2021; Xie et al., 2020) . For example, Sagawa et al. (2020) proposed Group DRO which constructs the uncertainty set of joint distributions over the class-group label pairs to withstand the spurious correlations between the class and group labels. Although these works are effective in their applications, the notion of statistical parity across groups (i.e., group fairness) has not been explicitly pursued as in our FairDRO.

3. NOTATIONS AND FAIRNESS CRITERION

Notations We consider a classification problem where each sample consists of an input x ∈ X , a class label y ∈ Y and a group label a ∈ A. The group label is defined by sensitive attributes such as gender or race. Given a training dataset D with N samples (i.e., D = {(x i , y i , a i )} N i=1 ), we aim to find a fair classifier θ FAIR : X → Y satisfying the given fairness criterion while maintaining the high classification accuracy. Moreover, given a loss function ℓ(θ, (x, y)) : Θ × (X × Y) → R, we denote L(θ, D) ≜ 1 |D| |D| i=1 ℓ(θ, (x i , y i )). Additionally, D a and D y a denote the subsets of D that are confined to the samples with group label a or class-group label pair (y, a), respectively.

Fairness criterion

The notion of fairness may depend on the different points of view on how discrimination is defined (Hardt et al., 2016; Dwork et al., 2012; Chouldechova, 2017; Berk et al., 2021) . For example, one may argue that a classifier is fair if its prediction is not dependent on a (Demographic Parity), while the other can insist that the classifier should be conditionally independent of a given the true class (Equalized Odds). In our work, we focus on the Equalized Conditional Accuracy (ECA) (Berk et al., 2021) , which can naturally cover the multi-class, multi-group label setting. A classifier θ satisfies ECA if all of the accuracies among groups are the same for each given class, i.e., ∀a, a ′ ∈ A, y ∈ Y, P ( Y = y|A = a, Y = y) = P ( Y = y|A = a ′ , Y = y), where Y is the prediction of θ. We measure the degree of unfairness with Difference of Conditional Accuracy (DCA), denoted as ∆ DCA , by taking the average of the maximum accuracy gaps among groups over y: ∆ DCA ≜ 1 |Y| y∈Y ∆ y , ∆ y ≜ max a,a ′ |P ( Y = y|A = a, Y = y) -P ( Y = y|A = a ′ , Y = y)|. Remark: We note that ECA has close connections with the two popular group fairness criteria, Equalized Odds (EO) and Equal Opportunity (EOpp) (Hardt et al., 2016) . EO requires the conditional independence between Y and A given Y , and hence, we can easily observe that ECA is equivalent to EO for the binary classification setting. Furthermore, we argue that ECA is a natural extension of EOpp to the multi-class setting; namely, ECA requires equality of TPR among groups for each class label, which resembles the argument for justifying EOpp.

4. FAIRNESS-AWARE DRO (FAIRDRO)

Our goal is to unify the regularization and re-weighting based methods by incorporating the DCA (1) in the objective function for learning. To that end, we first show in Sec. 4.1 that the empirical DCA is equivalent to the average (over the classes) of the roots of the variances of groupwise 0-1 losses. Then, we utilize the previous result (Xie et al., 2020) Proposition 1 Assume ℓ(θ) = 1{θ(x) ̸ = y}. Then, the following inequalities hold for all y ∈ Y, 2 Var {L(θ, D y a )} a∈A ≤ ∆ y ≤ 2|A| 2 Var {L(θ, D y a )} a∈A , in which ∆ y is the empirical version of ∆ y and Var({x i } d i=1 ) ≜ 1 d i (x i -1 d i x i ) 2 is the variance of d numbers {x i } d i=1 . Thus, ∆ y of a classifier is 0 if and only if Var {L(θ, D y a )} a∈A = 0 . The proof is in Appendix A. By taking the average over y in (2) , Proposition 1 implies that the empirical DCA is equivalent to the average (over y) of the square roots of the variances of the groupwise 0-1 losses, up to a constant. Building this equivalence, we are now able to utilize the Group DRO formulation given in the next subsection to make a connection between DCA regularization and robust optimization.  q a L(θ, D a ), in which ∆ |A| is the |A|-simplex. The applications of Group DRO are actively studied in various contexts, e.g., image classification (Sagawa et al., 2020) or machine translation (Zhou et al., 2021) . In particular, Xie et al. (2020) proposed Risk Variance Penalization (RVP), an OOD generalization method, using a variance regularization with respect to risks of each domain (a formulation first introduced by V-REx (Krueger et al., 2021) ). The author showed the connection between the variance regularization and Group DRO, when the uncertainty set is the following χ 2 -divergence ball with radius ρ > 0: Q ρ ≜ q ∈ R |A| : a∈A q a = 1, D ϕ q 1 |A| 1 1 1 ≤ ρ , in which D ϕ (p∥q) ≜ i q i ϕ(p i /q i ) with ϕ(t) = (t -1) 2 , and 1 |A| 1 1 1 ∈ R |A| denotes the uniform distribution. Note (4) does not have the nonnegativity constraint on q, and hence, depending on ρ, Q ρ can also include quasi-probabilities, which have negative components as well. Namely, Xie et al. (2020) shows the following lemma that the inner maximization of Group DRO with Q ρ uncertainty set is equivalent to the group-balanced empirical loss plus the group-wise variance term as a regularizer. Lemma 1 (Xie et al., 2020, Proposition 1) For any finite loss ℓ(θ, (x, y)) and Q ρ , we have max q∈Qρ a∈A q a L(θ, D a ) = 1 |A| a L(θ, D a ) + ρ Var {L(θ, D a )} a∈A . To be self-contained, the proof of the lemma is given in Appendix A.

4.3. TRAINING OBJECTIVE OF FAIRDRO

To make the connection between Proposition 1 and Lemma 1 and realize the DCA regularization for fairness-aware training, we propose to apply the Group DRO separately for each class y. Namely, we define our FairDRO as minimizing the average of the worst-case losses defined for each class y: θ FairDRO ≜ arg min θ 1 |Y| y∈Y max q y ∈Qρ a∈A q y a L(θ, D y a ), in which Q ρ is as defined in ( 4), q y = {q y a } a∈A is a quasi-probability vector, and ρ is a tunable hyperparameter. We stress that while applying Group DRO by treating each pair (y, a) as a "group" has been considered in Sagawa et al. (2020) , applying Group DRO separately for each class as in FairDRO has not been considered before. Now, we present our key result which follows by combining Proposition 1 and Lemma 1. Corollary 1 Assume ℓ(θ) = 1{θ(x) ̸ = y}. Then, given a dataset D, there is a corresponding positive constant C D in the range of [ √ 2, √ ] such that θ FairDRO also achieves min θ 1 |Y||A| (y,a) L(θ, D y a ) + C D √ ρ ∆ DCA . The corollary implies that when the 0-1 loss is used, FairDRO is a regularization based method, as described in Sec. 2.1, with the exact DCA serving as a regularizer. We emphasize that we do not solve the minimization in (7) directly. ∆ DCA can have high variance when estimated from a small mini-batch, particularly when the number of groups is large, and is even hard to be minimized due to non-differentiability. We indeed empirically observe in Sec. 5.4 that directly solving (7) using approximation of ∆ DCA with a differentiable cross-entropy loss performs poorly. Instead, we solve a much more tractable minimax optimization in (6) where the maximization is a simple linear optimization and the minimization is for the re-weighted group losses. At this point, it is now clear that during solving (6), FairDRO produces the re-weights {q y a }'s for each (y, a), which makes it also categorized as a re-weighting based method. The following proposition shows how the re-weights are characterized and controlled by the hyperparameter ρ. Proposition 2 For Q ρ in (4) with ρ > 0, each element of the optimum q y * for the inner-maximization of (6) can be obtained by a closed-form solution: q y * a = 1 |A| + ρ |A| × L(θ, D y a ) - a∈A 1 |A| L(θ, D y a ) ∥L(θ, D y a ) - a∈A 1 |A| L(θ, D y a )∥ 2 , ( ) and is in the range of 1 |A| - √ ρ(|A|-1) |A| , 1 |A| + √ ρ(|A|-1) |A| . The proof is in Appendix A. Proposition 2 shows that some elements of q y * can be negative depending on the radius ρ and the groupwise loss L(θ, D y a ). Moreover, it shows that the range of each element of the optimum q y * is automatically determined by a radius ρ and the number of groups |A|, which means FairDRO controls the degree of the group penalization by varying ρ. We will show in Sec. 5 that we can find a proper degree of penalization by tuning ρ so that the performances on various datasets in terms of accuracy-fairness trade-off are significantly improved. A crucial difference from typical re-weighting based methods is that our re-weights are obtained as the solution of an optimization problem as opposed to those in Kamiran & Calders (2012) ; Jiang & Nachum (2020) that are heuristically obtained based on the ratio of the number of data points or losses for each group. Also, we allow negative values for q y a in Q ρ , which enables penalizing the high accuracy group more aggressively than typical re-weighting methods that only control the positive weights for each group a and class y.

4.4. AN EFFICIENT ITERATIVE OPTIMIZATION FOR FAIRDRO

A canonical way of solving the minimax optimization in DRO is to use a primal-dual method (Nemirovski et al., 2009) in which one alternates between a regular gradient descent on θ and an exponentiated gradient (EG) ascent (Kivinen & Warmuth, 1997) on q. However, the EG algorithm is widely used when the variables lie in the probability simplex because it is a special case of mirror descent when the convex function for the Bregman divergence is the negative Shannon entropy. Thus, EG cannot be applied to solving (6) since our uncertainty set also allows the quasi-probability for q y . Hence, we employ the smoothed Iterated Best Response (IBR) which is a variant of IBR recently used for solving a DRO problem in (Zhou et al., 2021) . Using standard IBR for the inner maximization step, q is updated as the closed-form solution of the maximization (i.e., ( 8)), instead of the gradient ascent of q. It is known that solving the minimax optimization using the standard IBR has a convergence guarantee w.r.t θ under some regularity assumptions (Zhou et al., 2021) . However, it does not imply the convergence of q y , and we empirically observed in Appendix D.6 that the q y * ,t oscillates unsteadily when the loss for each group fluctuates. To that end, we applied the learning rate scheduling of η t q = 1 -t T with T being the number of total epochs, which results in updating q y,t smoothly. Namely, our overall update rules for θ and q y are θ t+1 ← θ t -η t θ ∇ θ 1 |Y| y∈Y a∈A q y,t a L(θ t , D y a ) , (update θ) q y,t+1 ← (1 -η t q )q y,t + η t q q y * ,t , ∀y ∈ Y, (update q y ) (10) in which q y * ,t ≜ arg max q y ∈Qρ a∈A q y a L(θ t+1 , D y a ), Algorithm 1: FairDRO Iterative Optimization Input : Dataset {D y a } y∈Y,a∈A , radius ρ, epoch T , iteration I, mini-batch size B Randomly initialize θ 0 Set q y,0 ← 1 |A| 1 1 1 for y ∈ Y, in which 1 1 1 ∈ R |A| for t = 0 to T -1 do // θ-update for i = 0 to I -1 do Sample a mini-batch {(x j , y j , a j )} j∈[B] , equally from D y a for each (y, a) ∈ Y × A. Update θ t with (9) using the mini-batch end // q-update Update q y,t for each y ∈ Y with (10) end Output : θ T , {q y,T } y∈Y and η t θ and η t q denote the learning rates for θ and q y , respectively. By applying the smoothed IBR in (10), we empirically observe improvements in the group fairness and more stable convergence of q y (detailed results are given in Appendix D.6). We note that for (7) to hold, we use 0-1 loss in (6), which makes our training objective still nondifferentiable with respect to θ. Thus, in practice, we solves the cross-entropy loss for the outer minimization ( 9) and the 0-1 loss for the inner maximization (10). We also note that θ is updated in a mini-batch manner, whereas q is updated after computing the group losses for all training data points. Algorithm 1 illustrates the summary of FairDRO optimization scheme.

5. EXPERIMENTAL RESULTS

In this section, we demonstrate the versatility of FairDRO by comparing it with various baseline methods on three modalities of datasets: (1) two tabular datasets, UCI Adult (Dua et al., 2017) and ProPublica COMPAS (Julia Angwin & Kirchner, 2016), ( 2) two vision datasets, UTKFace (Zhang et al., 2017) , CelebA (Liu et al., 2015) (in Appendix D.2) and ( 3) a natural language dataset, CivilComments-WILDS (Koh et al., 2021) . We use logistic regression, ResNet18 (He et al., 2016) , and a pre-trained BERT (Devlin et al., 2019; Wolf et al., 2019) as the classifier for each modality, respectively. For the evaluation, we plotted the convex envelops of the trade-offs between the accuracy and ∆ DCA for varying hyperparameters, i.e., the pareto frontiers. As a single number evaluation metric, we also reported the best DCA with at least 95% accuracy of the vanilla trained model (Scratch) in the Appendix D.1. Since the most datasets are severely skewed toward a specific group or class, we report the balanced accuracy over all combinations of groups and classes as Bellamy et al. (2018) (the issue discussed more in Appendix D.7). Moreover, we give a thorough ablation study on our design choice in (6), namely, the classwise DRO formulation and the uncertainty set (4) in Sec. 5.4 and Appendix D.4, respectively. We further visualize the re-weights produced by FairDRO in Sec. 5.4. More implementation details are given in Appendix C. Comparison methods. We compare FairDRO with vanilla training without any fairness constraint (Scratch) and nine in-processing fair-training methods; three re-weighting based methods (RW (Kamiran & Calders, 2012) , LBC (Jiang & Nachum, 2020) , FairBatch (Roh et al., 2020) ), four regularization based methods (Cov (Zafar et al., 2017b) , FairHSIC (Quadrianto et al., 2019) , Rényi (Baharlouei et al., 2020) , MFD (Jung et al., 2021) ) and two methods employing a constrained optimization (EGR (Agarwal et al., 2018) , PL (Cotter et al., 2019) ). Since we consider a group-class balanced accuracy, we modified the objective functions of all baselines including Scratch such that they use the group-class balanced ERM loss, not the standard ERM loss. 2 We note that Cov, EGR and Rényi are task-dependent, and FairHSIC and MFD are model-dependent; namely, Cov and EGR are designed only for binary classification tasks, Rényi is for tasks with binary groups, and FairHSIC and MFD are for DNN-based classifiers. We also note that all baselines except RW were implemented for targeting EO. Considering that ECA is equivalent to EO only in binary classification as we mentioned in Section 3, we also reported the difference of EO (DEO) for multi-class classification tasks in Appendix D.1. We provide more details of implementations of the baselines in Appendix C.

5.1. TABULAR DATASETS

Two tabular datasets, UCI Adult (Dua et al., 2017) (Adult) and ProPublica COMPAS (Julia Angwin & Kirchner, 2016) (COMPAS), are used for the benchmark. Both tasks are binary classification tasks (predicting whether the income of an individual exceeds $50K per a year and whether a defendant re-offends, respectively) with binary sensitive groups ("Female" and "Male", and "Caucasian" and "Non-Caucasin", respectively). Following pre-processing of Bellamy et al. (2018) , each dataset includes 45K and 5K rows, respectively. Fig. 1 shows the accuracy-fairness trade-offs for varying hyperparameters on Adult and COMPAS. We note that there is only a single score for RW because it has no controllable hyperparameter. In the figure, we observe that FairDRO is placed to the most top-right area among methods, showing the best trade-off, and even achieves the lowest level of DCA with a slight loss of accuracy. This implies that FairDRO successfully finds better weights (i.e., quasi-probability q y ) than other re-weighting based methods and employs a better regularizer than other regularization based methods. Furthermore, our FairDRO can provide pareto-frontiers in a wider range than other re-weighting baselines by introducing negative weights. 

5.2. VISION DATASETS

We also evaluate FairDRO on UTKFace (Zhang et al., 2017) , a face dataset with multi-class and multi-group labels. UTKFace includes more than 20K images with age, gender and ethnicity labels. We divide the age labels into three classes ("0 to 19", "20 to 40", and "more than 40"), and set them as the class labels. We also use four ethnic classes ("White", "Black", "Asian", and "Indian") as the group labels. The description and results on another vision dataset, CelebA (Liu et al., 2015) , are given in Appendix D.2. From Fig. 2 (left), we confirm that FairDRO significantly improves the trade-off, compared to most baselines. For MFD, we note that it provides a competitive pareto-frontier with higher accuracy than FairDRO, thanks to the knowledge distillation effect. However, MFD requires additional computations for training a teacher model, and its high performance is sensitive to task domains (refer to the next result on the NLP task). We again observe that the re-weighting based baselines are ineffective in achieving fairness on UTKFace, while FairDRO achieves strong performance due to its systematic optimization of the group weights. We further note that much lower accuracy of PL implies that its minimax optimization fails to produce good performance due to difficulty of finding feasible saddle points. Since EO is not equivalent to ECA for non-binary target task as aforementioned above, we additionally report the difference of EO (DEO) for UTKFace in Appendix D.1 and show FairDRO again outperforms other baselines in DEO.

5.3. LANGUAGE DATASET

In order to show the versatility of FairDRO, we additionally consider a natural language dataset, CivilComments-WILDS (Koh et al., 2021) . CivilComments is a large collection of comments on online articles taken from the Civil Comments platform. It comprises 450K comments which are annotated with identity attributes by 10 crowdworkers and majority votes. The CivilComments task is to predict whether a given comment is toxic or not. We set ethnicity as the sensitive attribute and extract the subset of comments that contain identities w.r.t. ethnicity ("Black", "White", "Asian", "Latino" and "Ohters"). Note that the dataset has continuous values for the identities (between 0 and 1), hence, we binarized the continuous group values by setting the maximum as 1 and the others as 0, since the existing in-processing methods are only available to handle discrete group labels. The results are shown in Fig. 2 (right). Note that since a pre-trained BERT is fine-tuned only for small epochs, the update for q y of FairDRO and weights of re-weighting baselines is executed per 100 iterations, instead of every epoch. We empirically observed that FairHSIC fails to converge due to the gradient exploding. Consistent with the previous results, FairDRO achieves the best trade-off on CivilComments. Note that although MFD is very competitive on UTKFace, it is not effective on CivilComments.

5.4. ANALYSIS

Ablation study. Fig. 3 shows the accuracy-fairness trade-off for each ablated version of FairDRO on each dataset. "FairDRO (w/o DRO)" directly solves (7) by approximating both terms in the objective with the cross-entropy loss; namely, both terms are replaced with the group-class balanced average cross-entropy loss and the average (over y) of the maximum gaps of groupwise cross-entropy losses. The "RVP (w /classwise)" is also similar, but uses the average (over y) of the variances of the groupwise cross-entropy losses (as shown in ( 5)) as a regularization term instead of the direct approximation of ∆ DCA . FairDRO (w/o classwise) defines only one uncertainty set over the (y, a) pair in ( 6) for the ablation study of the classwise treatment. In Fig. 3 , we note that RVP (w/ classwise) on CivilComments fails to converge by the gradient exploding. Our observations from the figures are as follows. First, we see that FairDRO achieves consistently better pareto-frontiers on all datasets than FairDRO (w/o DRO) and RVP (w/classwise), which asserts that solving (7) by a re-weighting learning scheme via our classwise DRO formulation is essential for better performance and stable learning. Second, we clearly observe that FairDRO (w/o classwise) shows sub-optimal accuracy-DCA trade-offs (except for Adult), indicating that our classwise DRO formulation is critical for the optimal performance since it makes the connection with the DCA regularization more clearly. Further ablation studies for the choice of the uncertainty set are in Appendix D.4. Visualization of q y . We visualize the learning dynamics of the group weights (q y ), including training accuracies (Fig. 4 (a) and (b)) and test accuracies (Fig. 4 (c )) for Scratch and FairDRO on COMPAS. For simplicity, we report the result for y = 1 (for y = 0 is in Appendix D.5). In Fig. 4 (a), q 1 assigns the same value for each group as Scratch employs the balanced ERM loss. At the top of the plot, Scratch shows a large discrepancy in accuracy between the two groups throughout training. On the other hand, Fig. 4 (b) shows FairDRO adjusts q 1 by assigning low values to the high accuracy group (Group 0) and high values to the low accuracy group (Group 1). Note that q 1 initially fluctuates and even assigns negative values for the high accuracy group but eventually converges via the learning rate scheduling in (10). As a result, we clearly observe both the training and test accuracy gaps between the groups smoothly become negligible as the training continues. We believe this example clearly shows the effectiveness of introducing the quasi-probability in Q ρ , and our FairDRO framework in achieving small test accuracy gaps between the groups, i.e., ∆ DCA .

6. CONCLUDING REMARK

We proposed a theoretically well-justified in-processing method for achieving group fairness. To make a clear connection between re-weighting and regularization based methods, our FairDRO employs the classwise Group DRO framework with χ 2 -divergence ball including quasi-probabilities as an uncertainty set. Then, the FairDRO training objective directly incorporates DCA as a regularizer and can be effectively solved by our proposed re-weighting optimization scheme. As a result, our DRO formulation enables FairDRO to have strengths of both re-weighing based and regularization based methods. Indeed, we showed through our experiments that our FairDRO is applicable to various settings and consistently achieves superior accuracy-fairness trade-off on benchmark datasets.

APPENDIX

We provide additional materials in this document. We include the proofs for Lemma 1 and Proposition 1 and 2 in Appendix A, additional related work in Appendix B, and implementation details such as optimization, implementations of baselines, and hyperparameter search in Appendix C. Finally, we report additional experimental results in Appendix D.

A PROOFS

Proof of Proposition 1 We define L y ∈ R |A|×|A| for brevity: L y i,j ≜L(θ, D y i ) -L(θ, D y j ), 1 ≤ ∀i, j ≤ |A|. If we denote Λ y ∈ R |A| 2 as Λ y ≜ (L y 1,1 , ..., L y |A|,1 , L y 1,2 , ..., L y |A|,2 , ..., L y |A|,|A| ), then, we have the following chain of equalities: Var {L(θ, D y a )} a∈A = 1 |A| a L(θ, D y a ) 2 - 1 |A| a L(θ, D y a ) 2 = 1 2|A| 2 a a′ L(θ, D y a ) -L(θ, D y a′ ) 2 = 1 2|A| 2 ∥Λ y ∥ 2 2 . Note that as we consider ℓ to be zero-one loss function, we have the followings: ∆ y ≜ max a,a′ 1 |D y a | |D y a | i=1 1{θ(x i ) ̸ = y} - 1 |D y a′ | |D y a′ | i=1 1{θ(x i ) ̸ = y} = max a,a′ |L(θ, D y a ) -L(θ, D y a′ )| = ∥Λ y ∥ ∞ . Since ∥Λ y ∥ ∞ = max i |Λ y i | ≤ i (Λ y i ) 2 = ∥Λ y ∥ 2 , we obtain the second inequality of (2) . Note also that ∥Λ y ∥ 2 = i (Λ y i ) 2 ≤ i ∥Λ y ∥ 2 ∞ = |A| 2 ∥Λ y ∥ 2 ∞ = |A|∥Λ y ∥ ∞ . We thereby obtain the first inequality of (2). From ( 2), we obtain ∆ y = 0 if Var {L(θ, D y a )} a∈A = 0 for all y ∈ Y. Considering variance is always non-negative, we obtain the necessary condition that ∆ y = 0 implies Var {L(θ, D y a )} a∈A = 0 for all y ∈ Y. Proof of Lemma 1 Suppose that a model parameter θ is given. As Let v y ≜ q y -1 |A| 1 1 1, where 1 1 1 ∈ R |A| . From a∈A q y a = 1, we have a∈A v y a = 0. Considering D ϕ q 1 |A| 1 1 1 = a∈A 1 |A| (|A|q y a -1) 2 = a∈A |A|(q y a -1 |A| ) 2 = |A| a∈A (v y a ) 2 , the following problems are equivalent: max q y ∈Qρ a∈A q y a L(θ, D y a ) = max v y a∈A 1 |A| L(θ, D y a ) + a∈A v y a L(θ, D y a ) - a∈A 1 |A| L(θ, D y a ) subject to a∈A v y a = 0, ∥v y ∥ 2 2 ≤ ρ |A| . Also, we have the following inequalities: Thus, Lemma 1 holds if and only if q y a = 1 |A| + ρ |A| × L(θ, D y a ) - a∈A 1 |A| L(θ, D y a ) ∥L(θ, D y a ) - a∈A 1 |A| L(θ, D y a )∥ 2 , for all a ∈ A. Proof of Proposition 2 From the proof of Lemma 1, the equality in ( 5) holds if and only if q y a = 1 |A| + ρ |A| × L(θ, D y a ) - a∈A 1 |A| L(θ, D y a ) ∥L(θ, D y a ) - a∈A 1 |A| L(θ, D y a )∥ 2 , for ∀a ∈ A. We now consider each element of q y * . For given y ∈ Y, we define γ y ∈ R |A| as follows: γ y a ≜ L(θ, D y a ) -a 1 |A| L(θ, D y a ) ∥L(θ, D y a ) -a 1 |A| L(θ, D y a )∥ 2 , for any a ∈ A. Fix i ∈ A by symmetry. Then, we have: By the Cauchy-Schwarz inequality, |γ y i | = |(1, ..., 1) • (γ y 1 , ..., γ y i-1 , γ y i+1 , ..., γ y |A| )| ≤ |A| -1 • (γ y 1 ) 2 + • • • + (γ y i-1 ) 2 + (γ y i+1 ) 2 + • • • + (γ y |A| ) 2 = |A| -1 • 1 -(γ y i ) 2 , in which the first and last equality follow from (A.3) and (A.4), respectively. From the above inequality, we have the followings: (γ y i ) 2 ≤ (|A| -1) • (1 -(γ y i ) 2 ), |A|(γ y i ) 2 ≤ |A| -1, |(γ y i )| ≤ |A| -1 |A| . (A.5) Considering (A. 3), the equality in (A.5) is attained if and only if γ y i = ± |A| -1 |A| , γ y a = ∓ 1 |A|(|A| -1) , ∀a ∈ A\i. Thus, we can derive the range described in Proposition 2.

B ADDITIONAL RELATED WORKS B.1 FAIRNESS-CONSTRAINED OPTIMIZATION

As mentioned in Sec. 2.1, some works (Agarwal et al., 2018; Cotter et al., 2019) employ a constrained optimization framework to enforce the group fairness. They formulate a minimax problem of the Lagrangian function from the given constrained optimization problem and find a saddle point by alternating the outer minimization step w.r.t. model parameters and the inner maximization step w.r.t. the Lagrangian variables. When updating the model parameters, they employ a convex upper bounded surrogate of 0-1 loss such as hinge loss or cross entropy loss in order to address non-differentiability of the fairness constraints. Further, Cotter et al. (2019) employ a non-zero sum formulation that uses the surrogates for updating the model parameters and use the original 0-1 loss for updating the Lagrangian dual variables. Although both Cotter et al. (2019) and our method similarly solve the minimax problems with 0-1 loss terms by using surrogate losses in the minimization step (w.r.t. model parameters) and the original 0-1 loss in the maximization step, we emphasize that Fair DRO address a different kind of minimax problem compared to that of Cotter et al. (2019) . Namely, their approach solves a minimax problem involving both model parameters and Lagrangian variables, while we only focus on the minimization w.r.t. model parameters, with a fixed ρ (which corresponds to a fixed Lagrangian variable). Instead, based on our theoretical results that the minimization problem w.r.t. model parameters can be formulated as another minimax problem based on the DRO framework, FairDRO solves such minimax problem by the non-zero sum formulation. Furthermore, we argue that our surrogates would be more advantageous in terms of bounding the original objective function. When implementing the algorithm of Cotter et al. (2019) , they make surrogates of group fairness constraints by typically using an upper bounded function of 0-1 loss such as the hinge loss or the cross entropy loss. However, the modified fairness constraint terms with such surrogates sometimes are not necessarily sharp upper bounds of the original constraints, e.g., multi-class classification setting. The reason is because the group fairness constraints, such as EO or ECA, are typically defined as the gap of averaged 0-1 losses for measuring the parity over groups. However, if we use the cross entropy loss as a surrogate in equation ( 6), we can get valid convex upper bound of (7) whenever q y is a maximizer of the inner maximization problem, and all entries of q y are positive values. Thus, we expect that our FairDRO can outperform PL by solving the original objective function better, which is shown in our experimental results.

B.2 ROBUST OPTIMIZATION IN ALGORITHMIC FAIRNESS

Much more works related to algorithmic fairness have recently utilized the DRO frameworks for their own goals. A line of works embeds the DRO frameworks into the fairness-constraint optimization problem to make a model fair, and robust to distribution shifts of a test dataset (Jiang et al., 2020; Taskesen et al., 2020; Wang et al., 2021; Mandal et al., 2020) . For example, Mandal et al. (2020) aimed to train fair classifiers to be with respect to weighted perturbations in the training distribution by considering all possible weight set over the training samples. Wang et al. (2021) and Taskesen et al. ( 2020) proposed DRO objectives with Wasserstein uncertainty sets which are combined by group fairness constraints. Although all aforementioned methods showed improvements in group fairness and robustness, the key differences from FairDRO are that we leverage the DRO framework for directly promoting group fairness, not robustness, and consider the uncertainty set over groups, not samples. Some works proposed DRO formulations for achieving fairness in more challenging settings like learning with imperfect group labels or sequential learning. Wang et al. (2020) proposed how to achieve the group fairness successfully given noisy group labels only, by considering the worst-case distribution of group labels. Hashimoto et al. (2018) devised repeated loss minimization using the DRO-based training objective with the χ 2 -divergence uncertainty set for encouraging long-term fairness in a sequential setting where the group populations changes over time. However, they do not also use DRO as a direct tool for achieving group fairness. There are some recent works perhaps closest to our method in terms of using DRO for achieving the fairness. Rezaei et al. ( 2020) incorporated a group fairness constraint into statistic-matching based robust log-loss minimization problem for training fair logistic regression models. Although they showed nice theoretical benefits in terms of its convexity and convergence, their method has a limitation that group information is needed at the test time. Yurochkin et al. (2019) aims to achieve the fairness of a classifier using DRO frameworks. However, they only focus on a novel individual fairness notion called as distributionally robust fairness (DRF), not group fairness.

B.3 MINIMAX OPTIMIZATION IN ALGORITHMIC FAIRNESS

Several works adopted min-max optimization techniques for group fairness (although they are not based on DRO frameworks). Zhang et al. (2018) proposed an adversarial debiasing technique that eliminates group information of latent features of a neural network (NN) model through the minimax game of an adversary and the NN model. Baharlouei et al. (2020) and Jiang et al. (2020) proposed novel regularization terms for group fairness using Rényi correlation and Wasserstein distance respectively, which lead to min-max formulations. We emphasize that these methods rely on additional terms or adversaries for promoting group fairness, while our method directly uses the DRO-based min-max formulation as a fairness regularization term.

C MORE IMPLEMENTATION DETAILS

We used PyTorch (Paszke et al., 2019) ; Experiments are performed on a server with AMD Ryzen Threadripper PRO 3975WX CPUs and NVIDIA RTX A5000 GPUs. All models are evaluated on separate test sets, and all experiments are repeated with 4 different random seeds. All the reported results are the averaged results.

C.1 OPTIMIZATION

For tabular and vision datasets, we train all models with the AdamW optimizer (Loshchilov & Hutter, 2019) for 70 epochs. We set the mini-batch size and the weight decay as 128 and 0.001, respectively. The initial learning rate is set as 0.001 and decayed by cosine annealing technique (Loshchilov & Hutter, 2017) . For the language dataset, we fine-tune pre-trained BERT with the AdamW optimizer for 3 epochs. We set the mini-batch size and the weight decay as 24 and 0.001, respectively. The initial learning rate is set as 0.00002 and adjusted with a learning rate schedule using a warm-up phase followed by a linear decay. All results are reported for the model at the last epoch.

C.2 IMPLEMENTATION DETAILS OF IN-PROCESSING BASELINES

Here, we describe the implementation details of each in-processing baseline used for the experiments. The original LBC (Jiang & Nachum, 2020) requires multiple full-training iterations by alternatively re-weighting each group based on the given fairness metric and re-training the full dataset. Since this optimization procedure needs a very high computation budget, we modify full-training iterations to 5 epochs and 14 iterations for vision and language datasets. FairBatch (Roh et al., 2020) and RW (Kamiran & Calders, 2012) are implemented in the way they were originally proposed. Cov (Zafar et al., 2017b) utilized a fairness constraint based on Covariance between the group label and the signed distance of the feature vectors from the decision boundary of a classifier. Although they used the Disciplined convex-concave program (DCCP) (Shen et al., 2016) solver for the constrained problem, we apply our optimization procedure (i.e., AdamW) by setting the covariance-based constraint as a regularization term. We note that extenstion Cov to multi-class classifcation tasks is non-trivial because the signed distance is not obviously defined for multi-class decision boundary. MFD (Jung et al., 2021) and FairHSIC (Quadrianto et al., 2019) use additional fairness-promoting regularization terms based on MMD and HSIC. For FairHSIC, we only implement the second term of their decomposition loss (i.e., the HSIC loss between the feature representations and the group labels). For implementing their regularization terms, we use the Gaussian RBF kernel of which the variance parameter is set as the mean of squared distance between all data points, following Jung et al. (2022) . Since regularization terms used in both MFD and FairHSIC are applied to feature vectors of DNN model, they are limited to DNN-based models. Baharlouei et al. (2020) attempted to remove the non-linear dependency between the model prediction and the sensitive attribute by using Rényi correlation as a regularization term. The resulting training objective is a minimax optimization problem where the maximization is for computing the Rényi correlation, and the minimization is for learning a model. When the group label is binary, we use modified Algorithm 2 in Baharlouei et al. (2020) , where the minimization step (i.e., line 4 in Algorithm 2) is processed in a mini-batch manner. Although Algorithm 1 is designed for the non-binary group label, applying the algorithm to a task with non-binary group labels is not straightforward because the second term in the RHS of equation 6 is a biased estimator when using the mini-batch optimization.

C.3 HYPERPARAMETERS FOR MAIN RESULTS

We perform the grid search on the hyperparameter candidates for every method. The full hyperparameter search space is illustrated in Tab. C.1.

C.4 MODEL SELECTION RULE FOR A SINGLE EVALUATION METRIC

The accuracy-fairness trade-off may exist in many real-world applications (Dutta et al., 2020) ; better accuracy leads to worse fairness, and vice versa. For example, Fig. 1 shows the trade-off for the Adult and COMPAS datasets. Therefore, when using a single evaluation the model should be carefully selected for fair evaluation. To that end, we explore varying hyperparameters (HPs) of each method for controlling the strength between accuracy and fairness. We also emphasize that although the accuracy of FairDRO is slightly worse than the one of RW and Scratch on COMPAS, FairDRO shows significantly better DCA (5.20) compared to RW (14.87) and Scratch (14.51). In Tab. D.2, we observe consistent results with Tab. D.1. We again note that we marked results of COV and FairHSIC as N/A because COV is designed only for binary classification and FairHSIC fails to converge due to the gradient exploding. We again observe that FairDRO achieves the best level of the fairness metric (DCA) compared to other methods. We also provide difference of Equalized Odds (DEO) on UTKFace, denoted by ∆ DEO , which is defined as follows: ∆ DEO ≜ 1 |Y| 2 y,y ′ ∈Y max a,a ′ |P ( Y = y ′ |A = a, Y = y) -P ( Y = y ′ |A = a ′ , Y = y)|. Note that ∆ DCA and ∆ DEO are the same for binary labels. Although LBC, FairBatch, and FairHSIC were devised for achieving EO, we again show that FairDRO outperforms baselines on UTKFace with respect to DEO. (Kamiran & Calders, 2012) 77.60 (±0.68) 15.49 (±1.41) LBC (Jiang & Nachum, 2020) 88.96 (±0.97) 6.67 (±2.39) FairBatch (Roh et al., 2020) 83.85 (±0.70) 10.62 (±0.86) Cov (Zafar et al., 2017a) 90.10 (±0.25) 7.57 (±0.84) Rényi (Baharlouei et al., 2020) 86.22 (±0.79) 10.62 (±1.10) FairHSIC (Quadrianto et al., 2019) 86.63 (±0.64) 9.10 (±3.19) MFD (Jung et al., 2021) 90.97 (±0.35) 4.44 (±1.13) PL (Cotter et al., 2019) 83.23 (±1.06) To show consistent improvements, we conducted an additional experiment on the CelebA dataset, which includes about 200K celebrities' face images with 40 annotated binary attributes. Among the attributes, we select "blond hair" as the class label and "gender" as the group label, which is the set of attributes widely used in many previous works (Sagawa et al., 2020; Jung et al., 2022; Chuang & Mroueh, 2021; Liu et al., 2021) . We continue from Sec. 5.4 and provide further study of FairDRO. We analyze the effect of each constraint employed by FairDRO. Tab. D.5 shows the overall accuracy, ∆ DCA and the worst accuracy over groups for each ablation case. "Classwise", "χ 2 -divergence" and "Quasi-prob" (4) indicate each component added when defining the uncertainty set of Group DRO (i.e., simplex over groups) as described in Sec. 4. For a fair comparison, we apply the same optimization scheme (described in Sec. 4.4) for all DRO variants. As the supremum in ( 3) is attained at the vertex of the simplex ∆ |Y|×|A| , Group DRO minimizes the worst-case loss over the (y, a) pair, thereby it can prevent the lowest accuracy across the group and class from becoming too low. Thus, Group DRO substantially improves and achieves the best worst-case accuracy over groups compared to ERM (i.e., 1st -→ 2nd row). However, it shows the worst DCA among the DRO variants, confirming the mismatch between the criterion Group DRO uses and the group fairness metric. We observe that changing the simplex uncertainty set of Group DRO to the proposed χ 2 -divergence and quasi-prob uncertainty sets (4) (i.e., 2nd -→ 3rd row) has an impact on the group fairness mostly. In addition, reducing the size of uncertainty sets via classwise treatment (i.e., 3rd -→ 6th row) further improves the group fairness. Finally, including negative values in the uncertainty set also helps to improve the group fairness (i.e., 5th -→ 6th row) on most datasets, by more severely penalizing the major groups with high accuracy with negative weights.



In this subsection, the term "group" is used in a more inclusive manner. Namely, the group here refers to any hierarchical structure in the data distribution, e.g., domain, environment, and sensitive attribute. The results of employing original objective functions are reported in Appendix D.3. From the results, we observe that using the balanced ERM loss improves the balanced accuracy and DCA for the most methods.



Figure 1:The trade-offs between accuracy and DCA on Adult (left) and COMPAS (right). The performances of "Scratch" are highlighted by the black dotted lines. Each point is the result corresponding to a different hyperparameter of each method. Each line is a convex envelop of points of the corresponding method.

Figure 2: The trade-offs between accuracy and DCA on UTKFace (left) and CivilComments (right). Details are the same as Fig. 1.

Figure 3: Ablation study of FairDRO. RVP and FairDRO (w/o classwise) are for the ablation study of our key two components, the DRO formulation, and classwise treatment, respectively. The best performance of each result is in Tab. D.4.

by the Cauchy-Schwartz inequality. The second inequality follows from Var {L(θ, D y a )} a∈A = The equality in (A.2) is attained if and only if the vector v y satisfies: (i) v y and the |A|-dimensional vector whose a-th element is L(θ, D y a ) -a∈A 1 |A| L(θ, D y a ) are in the same direction; (ii) ||v y || 2 2 = ρ |A| . This implies that the a-th element of v y should be

a∈A γ y a = 0, and thus, γ y i = -a∈A\i γ y a , (A.3) a∈A (γ y a ) 2 = 1, and thus, (γ y i ) 2 = 1 -a∈A\i (γ y a ) 2 . (A.4)

Figure D.1: Trade-offs between accuracy and DCA on CelebA. The "Scratch" performances are highlighted by the dotted lines

Tab. D.3 and Fig. D.1  show a similar result trend to one of UTKFace in Sec. 5.2. FairDRO again outperforms the baselines in terms of DCA and achieves a competitive trade-off with MFD. We also observe re-weighting based baselines show worse DCA than FairDRO, meaning that FairDRO finds the better weights over groups for lower DCA.D.3 RESULTS WITH THE STANDARD ERM LOSSFor fair comparison with FairDRO, we used the balanced ERM loss for the training of each baseline as mentioned in Sec. 5. Since all baselines were originally designed for employing the standard ERM, we further reported the results for all methods solving the standard ERM loss. Fig. D.2 show the best performances and the accuracy-fairness trade-offs on Adult, COMPAD, UTKFace, and CivilComments datasets.

Figure D.2: The trade-offs between accuracy and DCA. Each method is trained with the standard ERM. Details are the same as Fig. 1.

We then select the best model showing the best fairness criterion ∆ DCA while achieving at least 95% of the accuracy of the vanilla-trained model. With this selection rule, we report the best performance of each model throughout Appendix D. The HPs for each method were extensively searched, but fairly in terms of computation budgets. The search range of each hyperparameter is listsed in Tab. C.1.

The best performances on Adult and COMPAS datasets. The number in the parentheses with ± stands for the standard deviation of each metric obtained from several independent runs with 4 different seeds. The target accuracy (higher is better) and DCA (1) (lower is better) are shown. We follow the proposed model selection criterion in Appendix C.4. From Tab. D.1 and Tab. D.2, we observe FairDRO achieves the best ∆ DCA with moderate variances. For example, in Tab. D.1, while the best baselines achieve ∆ DCA of 2.55 and 6.03 on Adult and COMPAS, respectively, our FairDRO achieves 1.99 and 5.20 for the same datasets, respectively.

The best performances on UTKFace and CivilComments. The number in the parentheses with ± stands for the standard deviation of each metric obtained from several independent runs with 4 different seeds. "N/A" denotes the method is not trainable. All other details are the same as Tab. D.1.

1.46 (±0.84) FairDRO 87.95 (±1.61) 2.57 (±0.82)

The best performances on CelebA. Details are the same as Tab. D.1.

Ablation study of FairDRO. RVP (w/ classwise) and FairDRO (w/o classwise) are for the ablation study of our key two components, the DRO formulation and classwise treatment, respectively. We follow the same model selection criterion introduced in Appendix C.4. ∆ DCA (↓) Acc (↑) ∆ DCA (↓) Acc. (↑) ∆ DCA (↓) Acc. (↑) ∆ DCA (↓)

ACKNOWLEDGMENTS

This work was supported in part by the NRF grant [NRF-2021R1A2C2007884] and IITP grants [No.2021-0-01343, No.2021-0-02068, No.2022-0-00959, No.2022-0-00113]] funded by the Korean government, and SNU-NAVER Hyperscale AI Center.

annex

Table D .5: Ablation study for uncertainty set of FairDRO. We show the effects of each component of the proposed FairDRO: GDRO and the constraints on the uncertainty set, including classwise uncertainty set, χ 2 -divergence ball, and quasi-probability. All numbers are average results over Adult, COMPAS, UTKFace, and CivilComments datasets. GDRO is an abbreviation for Group DRO. GDRO 

D.5 VISUALIZATION OF q 0

We continue from Sec. 5.4 and provide the visualization results including the weights (q y ), training accuracy, and test accuracy for y = 0. In 4 (right) shows that our smoothing technique has obvious effect on preventing the oscillation of q when using IBR. Furthermore, Tab. D.6 demonstrates that stable dynamics of q can significantly improve performance in terms of DCA.

D.7 ISSUE OF STANDARD ACCURACY

In the experiments, we reported the balanced accuracy for measuring the performance of the models instead of the standard accuracy. While the balanced accuracy offers several advantages in measuring performance on imbalanced datasets, it also has some drawbacks. In general, it may underestimate the performance of certain groups with relatively large numbers. As in most cases, the groups with large numbers often become majority groups. The balanced accuracy metric may not capture accuracy reduction in majority group. In real-world applications, there is also concern about a decrease in majority-group accuracy.However, we argue that reporting the vanilla standard accuracy would be problematic when the dataset is severely class-imbalanced. For example, Tab. D.7 shows the number of samples in each dataset with respect to the class and group labels. For the case of Adult dataset, 75.2% of the samples are labeled as y = 0, hence, a naive predictor that always predict Ŷ = 0 will achieve the standard accuracy of 75.2% and DCA= 0, which are clearly overestimating the (accuracy, fairness) performance of the classifier. Therefore, to avoid such issue, we use the group-class balanced accuracy, rather than the standard accuracy.

