FAIRBATCH: BATCH SELECTION FOR MODEL FAIRNESS

Abstract

Training a fair machine learning model is essential to prevent demographic disparity. Existing techniques for improving model fairness require broad changes in either data preprocessing or model training, rendering themselves difficult-to-adopt for potentially already complex machine learning systems. We address this problem via the lens of bilevel optimization. While keeping the standard training algorithm as an inner optimizer, we incorporate an outer optimizer so as to equip the inner problem with an additional functionality: Adaptively selecting minibatch sizes for the purpose of improving model fairness. Our batch selection algorithm, which we call FairBatch, implements this optimization and supports prominent fairness measures: equal opportunity, equalized odds, and demographic parity. FairBatch comes with a significant implementation benefit -it does not require any modification to data preprocessing or model training. For instance, a single-line change of PyTorch code for replacing batch selection part of model training suffices to employ FairBatch. Our experiments conducted both on synthetic and benchmark real data demonstrate that FairBatch can provide such functionalities while achieving comparable (or even greater) performances against the state of the arts. Furthermore, FairBatch can readily improve fairness of any pre-trained model simply via fine-tuning. It is also compatible with existing batch selection techniques intended for different purposes, such as faster convergence, thus gracefully achieving multiple purposes.

1. INTRODUCTION

Model fairness is becoming essential in a wide variety of machine learning applications. Fairness issues often arise in sensitive applications like healthcare and finance where a trained model must not discriminate among different individuals based on age, gender, or race. While many fairness techniques have recently been proposed, they require a range of changes in either data generation or algorithmic design. There are two popular fairness approaches: (i) pre-processing where training data is debiased (Choi et al., 2020) or re-weighted (Jiang and Nachum, 2020) , and (ii) in-processing in which an interested model is retrained via several fairness approaches such as fairness objectives (Zafar et al., 2017a; b) , adversarial training (Zhang et al., 2018) , or boosting (Iosifidis and Ntoutsi, 2019) ; see more related works discussed in depth in Sec. 5. However, these approaches may require nontrivial re-configurations in modern machine learning systems, which often consist of many complex components. In an effort to enable easier-to-reconfigure implementation for fair machine learning, we address the problem via the lens of bilevel optimization where one problem is embedded within another. While keeping the standard training algorithm as the inner optimizer, we design an outer optimizer that equips the inner problem with an added functionality of improving fairness through batch selection. Our main contribution is to develop a batch selection algorithm (called FairBatch) that implements this optimization via adjusting the batch sizes w.r.t. sensitive groups based on the fairness measure of an intermediate model, measured in the current epoch. For example, consider a task of predicting whether individual criminals re-offend in the future subject to satisfying equalized odds (Hardt et al., 2016) where the model accuracies must be the same across sensitive groups. In case the model is less accurate for a certain group, FairBatch increases the batch-size ratio of that group in the next batch -see Sec. 3 for our adjusting mechanism described in detail. Fig. 1a shows FairBatch's behavior when running on the ProPublica COMPAS dataset (Angwin et al., 2016) . For equalized odds, our framework (to be described in Sec. 2) introduces two reweighting parameters (λ 1 , λ 2 ) for the purpose of adjusting the batch-size ratios of two sensitive groups (in this experiment, men and women). After a few epochs, FairBatch indeed achieves equalized odds, i.e., the accuracy disparity between sensitive groups conditioned on the true label (denoted as "ED disparity") is minimized. FairBatch also supports other prominent group fairness measures: equal opportunity (Hardt et al., 2016) and demographic parity (Feldman et al., 2015) . A key feature of FairBatch is in its great usability and simplicity. It only requires a slight modification in the batch selection part of model training as demonstrated in Fig. 1b and does not require any other changes in data preprocessing or model training. Experiments conducted both on synthetic and benchmark real datasets (ProPublica COMPAS (Angwin et al., 2016) , AdultCensus (Kohavi, 1996) , and UTKFace (Zhang et al., 2017) ) show that FairBatch exhibits greater (at least comparable) performances relative to the state of the arts (both spanning pre-processing (Kamiran and Calders, 2011; Jiang and Nachum, 2020) and in-processing (Zafar et al., 2017a; b; Zhang et al., 2018; Iosifidis and Ntoutsi, 2019) techniques) w.r.t. all aspects in consideration: accuracy, fairness, and runtime. In addition, FairBatch can improve fairness of any pre-trained model via fine-tuning. For example, Sec. 4.2 shows how FairBatch reduces the ED disparities of ResNet18 (He et al., 2016) and GoogLeNet (Szegedy et al., 2015) pre-trained models. Finally, FairBatch can be gracefully merged with other batch selection techniques typically used for faster convergence, thereby improving fairness faster as well. Notation Let w be the parameter of an interested classifier. Let x ∈ X be an input feature to the classifier, and let ŷ ∈ Y be the predicted class. Note that ŷ is a function of (x, w). We consider group fairness that intends to ensure fairness across distinct sensitive groups (e.g., men versus women). Let z ∈ Z be a sensitive attribute (e.g., gender). Consider the 0/1 loss: (y, ŷ) = 1(y = ŷ), and let m be the total number of train samples. Let L y,z (w) be the empirical risk aggregated over samples subject to y = y and z = z: L y,z (w) := 1 my,z i:y i =y,zi=z (y i , ŷi ) where m y,z := |{i : y i = y, z i = z}|. Similarly, we define L y, (w) := 1 my, i:y i =y (y i , ŷi ) and L ,z (w) := 1 m ,z i:zi=z (y i , ŷi ) where m y, := |{i : y i = y}| and m ,z := |{i : z i = z}|. The overall empirical risk is written as L(w) = 1 m i (y i , ŷi ). We utilize ∇ for gradient and ∂ for subdifferential.

2. BILEVEL OPTIMIZATION FOR FAIRNESS

In order to systematically design an adaptive batch selection algorithm, we formalize an implicit connection between adaptive batch selection and bilevel optimization. Bilevel optimization consists of an outer optimization problem and an inner optimization problem. The inner optimizer solves an (Bottou, 2010) as an inner optimizer and viewing the batch selection algorithm as an outer optimizer, the process of training a fair classifier can be seen as a process of solving a bilevel optimization problem. Batch selection + minibatch SGD = bilevel optimization solver Consider a scenario where one is minimizing the overall empirical risk L(w) via minibatch SGD. The minibatch SGD algorithm picks b of the m indices uniformly at random, say j 1 , j 2 , . . . , j b , and updates its iterate with 1 b b i=1 ∇ (y ji , ŷji ), called a batch gradient. Note that a batch gradient is an unbiased estimate of the true gradient ∇L(w). Since the empirical risk minimization (ERM) formulation does not take a fairness criterion into account, its minimizer usually does not satisfy the desired fairness criterion. To address this limitation of ERM, we adjust the way minibatches are drawn so that the desired fairness guarantee is satisfied. For instance, as we described in the introduction, we can draw minibatches with a larger number of train samples from a certain sensitive group so as to achieve a higher accuracy w.r.t. the group. Once the minibatch distribution deviates from the uniform distribution, the batch gradient estimate is not anymore an unbiased gradient estimate of the overall empirical risk. Instead, it is an unbiased estimate of a reweighted empirical risk. In other words, if we draw train example i with probability p i for all i such that p i = 1, the batch gradient is an unbiased estimate of L (w) = i p i (y i , ŷi ). This observation enables us the following bilevel optimization-based interpretation of how batch selection interacts with inner optimization algorithm. At initialization, minibatch SGD optimizes the (unweighted) empirical risk. Based on the outcome of the inner optimization, the outer optimizer refines p := (p 1 , p 2 , . . . , p m ), the sampling probability of each train example. The inner optimizer now takes minibatches drawn from a new distribution and reoptimizes the inner objective function. Due to the new minibatch distribution, the inner objective now becomes a reweighted empirical risk w.r.t. p. This procedure is repeated until convergence. See Algorithm 1 for pseudocode. Therefore, a batch selection algorithm together with an inner optimization algorithm can be viewed as a pair of outer optimizer and inner optimizer for the following bilevel optimization problem: min p Cost(w p ), w p = arg min w L (w), where Cost(•) captures the goal of the optimization. Two questions arise. First, how can we design the cost function to capture a desired fairness criterion? Second, how can we design an update rule for the outer optimizer? Can we develop an algorithm with a provable guarantee? In the rest of this section, we show how one can design proper cost functions to capture various fairness criteria. In Sec. 3, we will develop an efficient update rule of FairBatch. Equal opportunity For illustrative purpose, assume for now the binary setting (Y = Z = {0, 1}). A model satisfies equal opportunity (Hardt et al., 2016) if we have equal positive prediction rates conditioned on y = 1, i.e., L 1,0 (w) = L 1,1 (w). Since the ERM formulation does not take the fairness criterion into account, these two quantities differ in general. To mitigate this, we adjust the sampling probability between L 1,0 (w) and L 1,1 (w). More specifically, we propose the following procedure to draw a sample. First, we randomly pick which subset of data to sample data from. We pick the set y = 1, z = 0 with probability λ, the set y = 1, z = 1 with probability m1, m -λ, and the set y = 0 with probability m0, m . We then pick a sample from the chosen set, uniformly at random. This leaves us with a single-dimensional outer optimization variable λ, which controls the sampling bias between data with y = 1, z = 0 and data with y = 1, z = 1. Also, we design the cost function as |L 1,0 (w λ ) -L 1,1 (w λ )| to capture the equal opportunity criterion. Thus, we have the following bilevel optimization problem: min λ∈[0, m 1, m ] |L 1,0 (w λ ) -L 1,1 (w λ )|, w λ = arg min w λL 1,0 (w) + ( m1, m -λ)L 1,1 (w) + m0, m L 0, (w). Equalized odds Similarly, we can design a bilevel optimization problem to capture equalized odds (Hardt et al., 2016) , which desires the prediction to be independent from the sensitive attribute conditional on the true label, i.e., L 0,0 (w) = L 0,1 (w) and L 1,0 (w) = L 1,1 (w). Again, the empirical risk minimizer does not satisfy these two conditions in general. To mitigate these disparities, we adjust (i) the sampling probability between L 0,0 (w) and L 0,1 (w) and (ii) the sampling probability between L 1,0 (w) and L 1,1 (w). To achieve this, we use the following procedure to draw a sample. First, we pick the set y = 0, z = 0 with probability λ 1 , the set y = 0, z = 1 with probability m0, m -λ 1 , the set y = 1, z = 0 with probability λ 2 , and the set y = 1, z = 1 with probability m1, m -λ 2 . We then pick one data point at random from the chosen set. This leaves us with a two-dimensional outer optimization variable λ = (λ 1 , λ 2 ). To capture the equalized odds criterion, we design the outer objective function as: max{|L 0,0 (w) -L 0,1 (w)|, |L 1,0 (w) -L 1,1 (w)|}. This gives us the following bilevel optimization problem: min λ∈[0, m 0, m ]×[0, m 1, m ] max{|L 0,0 (w λ ) -L 0,1 (w λ )|, |L 1,0 (w λ ) -L 1,1 (w λ )|}, w λ = arg min w λ 1 L 0,0 (w) + ( m0, m -λ 1 )L 0,1 (w) + λ 2 L 1,0 (w) + ( m1, m -λ 2 )L 1,1 (w). Demographic parity Demographic parity (Feldman et al., 2015) is satisfied if two sensitive groups have equal positive prediction rates. If m y,z 's are all equal, then L 0,0 (w) = L 1,0 (w) and L 0,1 (w) = L 1,1 (w) can serve as a sufficient condition for demographic parity; see Sec. A.1 for why and how to handle demographic parity when this condition does not hold. To satisfy this sufficient condition, we now adjust (i) the the sampling probability between L 0,0 (w) and L 1,0 (w) and (ii) the the sampling probability between L 0,1 (w) and L 1,1 (w). This gives us the following bilevel optimization problem: min λ∈[0, m ,0 m ]×[0, m ,1 m ] max{|L 0,0 (w λ ) -L 1,0 (w λ )|, |L 0,1 (w λ ) -L 1,1 (w λ )|}, w λ = arg min w λ 1 L 0,0 (w) + ( m ,0 m -λ 1 )L 1,0 (w) + λ 2 L 0,1 (w) + ( m ,1 m -λ 2 )L 1,1 (w). Beyond binary labels/sensitive attributes While the previous examples assumed binary-valued labels and sensitive attributes, our framework is applicable to the cases where the alphabet sizes are beyond binary. As an example, consider the equal opportunity criterion when Z = {0, 1, . . . , n z -1}. The condition reads L 1,0 (w) = L 1,1 (w) = • • • = L 1,nz-1 (w). To satisfy this condition, we adjust the sampling probability between L 1,j (w)'s by introducing nz 2 -dimensional outer optimization variable λ, and design the outer objective function as max j1,j2∈Z |L 1,j1 (w)-L 1,j2 (w)|. In our implementation, however, we only use (n z -1)-dimensional disparity objectives as an approximation (i.e., max j1∈{0,1,...,nz-2} |L 1,j1 (w) -L 1,j1+1 (w)|) for better efficiency. Suppose the level of disparity is when FairBatch compares all possible combination pairs of sensitive groups. Now suppose we only optimize on the sequential (n z -1) disparity objectives. Then we will fail to ensure that other objectives like |L 1,3 (w) -L 1,1 (w)| are within . In the worst case, the objective |L 1,nz-1 (w) -L 1,1 (w)| may be (n z -1) × , as we only guarantee that each |L 1,j1 (w) -L 1,j1+1 (w)| ≤ . If is small enough, the disparity of our approximation becomes reasonable as well. One can also handle other fairness criteria in a similar way.

3. UPDATE RULE OF FAIRBATCH

We design efficient update rules of FairBatch for different numbers of disparities. Let us define d as the dimension of the outer optimization variable λ, which is the same as the total number of disparities. We first analyze the simplest case where d = 1. We show that a simple gradient descent algorithm can provably solve the outer optimization problem. The equal opportunity example in the previous section falls in this category. We then extend the algorithm developed for the one-dimensional case to the multi-dimensional (d > 1) case. Equalized odds and demographic parity fall in this category.

3.1. UPDATE RULE FOR d = 1

When d = 1, the general form of our bilevel optimization problem can be written as follows: min λ∈[0,c1] |f 1 (w λ ) -g 1 (w λ )|, w λ = arg min w λf 1 (w) + (c 1 -λ)g 1 (w) + h(w), where c 1 > 0 a constant. Let F (λ) = |f 1 (w λ ) -g 1 (w λ )|. The following lemma shows that F (λ) is quasiconvex in λ under some mild conditions, and its signed gradient can be efficiently computed. Lemma 1 (Quasi-convexity of F (λ)). For d = 1, if f 1 (•), g 1 (•), and h(•) satisfy 1. h(w) = 0 or 2. if f 1 (•), g 1 (•), and h(•) are twice differentiable, λ∇ 2 f 1 (w λ ) + (c 1 -λ)∇ 2 g 1 (w λ ) + ∇ 2 h(w λ ) 0 for every λ ∈ [0, c 1 ], then F (λ) is quasi-convex, i.e., F (tλ + (1 -t)λ ) ≤ max F (λ), F (λ ) for all t ∈ [0, 1] and λ, λ . Also, if F (•) = 0, then ∂ λ F (λ) = {v} and sign (v) = sign (g 1 (w λ ) -f 1 (w λ )). Remark 1. The quasiconvexity of F (λ) is valid when at least one of the conditions in Lemma 1 holds. For the second condition, if f 1 (•), g 1 (•), and h(•) are convex, this condition will hold unless all the three functions share their stationary points, which is very unlikely. While there is no theoretical guarantee for the non-convex settings, FairBatch still shows on par or better results than the other fairness approaches in general settings where the functions may not be convex (see Sec. 4). The proof for Lemma 1 can be found in Sec. A.2. Note that quasiconvexity immediately implies a unique minimum (Boyd et al., 2004) . Thus, we design the following signed gradient-based optimization algorithm: ∀t ∈ {0, 1, . . .} : λ (t+1) = λ (t) -α • sign(g 1 (w λ ) -f 1 (w λ )). This algorithm increases λ by α if f 1 (w λ ) ≤ g 1 (w λ ) and decreases λ by α otherwise. Recall that this is consistent with our intuition: It increases the sampling probability of a disadvantageous group and decreases that of an advantageous group. The following proposition shows that the proposed algorithm converges to the optimal solution. Proposition 1. Let λ * = arg min λ F (λ) and t ∈ Z 0+ . Then, |λ (t) -λ * | ≤ max{|λ (0) -λ * |-tα, α}. Remark 2. F (λ) is not necessarily convex even when we assume the inner objective functions f 1 (•) and g 1 (•) are convex or even strongly convex. See Sec. A.3 for a counter example.

3.2. UPDATE RULE FOR d ≥ 1

We now develop an efficient update algorithm for the following general bilevel optimization: min λ∈Λ max i=1,...,d |f i (w λ ) -g i (w λ )|, w λ = arg min w d i=1 [λ i f i (w) + (c i -λ i )g i (w)] + h(w). Here , Λ = [0, c 1 ] × [0, c 2 ] × • • • × [0, c d ], where c i 's are some positive constants. Denoting by F (λ) the outer objective function, let us first derive the gradient of it. Under some mild conditions (see Sec. A.4) on f i (•)'s, g i (•)'s, and h(•): γ i := sign (g i * (w) -f i * (w))(∇f i * (w) -∇g i * (w)) H -1 λ (∇f i (w) -∇g i (w)) ∈ ∂ λi F (λ), ∀i, where i * = arg max i |f i (w) -g i (w)| , and H λ is positive definite. See Sec. A.4 for the derivation. Since subdifferential is always a convex set, it follows that γ := (γ 1 , γ 2 , . . . , γ d ) ∈ ∂ λ F (λ). Computing the subgradient γ requires us to compute H λ , which involves the Hessian matrices of the inner objective function. To avoid this expensive computation, we approximate γ ≈ (0, 0, . . . , γ i , . . . , 0). See Sec. A.5 for the rationale and intuition behind this approximation. Then, similar to the case of d = 1, we have sign(γ) = (0, 0, . . . , sign (g i * (w λ ) -f i * (w λ )) , 0, . . . , 0). This gives us the general update rule of FairBatch (see Sec. A.6 for pseudocode): ∀t ∈ {0, 1, . . .} : λ (t+1) i * = λ (t) i * -α • sign(g i * (w λ ) -f i * (w λ )), λ (t+1) i = λ (t) i , ∀i = i * .

4. EXPERIMENTS

We use logistic regression in all experiments except for Sec. 4.2 where we fine-tune ResNet18 (He et al., 2016) and GoogLeNet (Szegedy et al., 2015) in order to demonstrate FairBatch's ability to improve fairness of pre-trained models. We evaluate all models on separate test sets and repeat all experiments with 10 different random seeds. We use PyTorch, and our experiments are performed on a server with Intel i7-6850 CPUs and NVIDIA TITAN Xp GPUs. See Sec. B.1 for more details. Measuring Fairness Here we first focus on the equal opportunity (EO) and demographic parity (DP) measures in Sec. Datasets We generate a synthetic dataset of 3,000 examples with two non-sensitive attributes (x 1 , x 2 ), a binary sensitive attribute z, and a binary label y, using a method similar to the one in (Zafar et al., 2017a) . A tuple (x 1 , x 2 , y) is randomly generated based on the two Gaussian distributions: (x 1 , x 2 )|y = 0 ∼ N ([-2; -2], [10, 1; 1, 3]) and (x 1 , x 2 )|y = 1 ∼ N ([2; 2], [5, 1; 1, 5]). For z, we generate biased data using an unfair scenario Pr(z = 1) = Pr(( x 1 , x 2 )|y = 1)/[Pr((x 1 , x 2 )|y = 0)+ Pr((x 1 , x 2 )|y = 1)] where (x 1 , x 2 ) = (x 1 cos(π/4) -x 2 sin(π/4), x 1 sin(π/4) + x 2 cos(π/4)). We use the real benchmark datasets: ProPublica COMPAS (Angwin et al., 2016) and AdultCensus (Kohavi, 1996) datasets with 5,278 and 43,131 examples, respectively. We use the same pre-processing as in IBM's AI Fairness 360 (Bellamy et al., 2019) and use GENDER as the sensitive attribute. We also employ the UTKFace dataset (Zhang et al., 2017) with 23,708 images to demonstrate the fine-tuning ability of FairBatch in Sec. 4.2. Baselines We employ three types of baselines: (1) non-fair training with logistic regression (LR); (2) fair training via pre-processing; and (3) fair training via in-processing. For pre-processing methods, we first consider a simple approach that we call Cutting, which evens the data sizes of sensitive groups via saturating them to the smallest-group data size. One can think of a similar alternative approach: Boosting all of the smaller-group data sizes to the largest one, but we do not report herein due to similar performances that we found relative to Cutting. The other two are the state of the arts: reweighing (Kamiran and Calders, 2011) (RW) and Label Bias Correction (Jiang and Nachum, 2020) (LBC). RW intends to balance importance levels across sensitive groups via example weighting, but sticks with these weights throughout the entire model training, unlike FairBatch. LBC iteratively trains an entire model with example weighting towards an unbiased data distribution. For in-processing methods, we compare with the following three: Fairness Constraints (Zafar et al., 2017a;b) (FC), Adversarial Debiasing (Zhang et al., 2018) (AD), and AdaFair (Iosifidis and Ntoutsi, 2019) . FC incorporates a regularization term in an effort to reduce the disparities among sensitive groups. AD is an adversarial learning approach that intends to maximize the independence between the predicted labels and sensitive attributes. In our experiments, a slight modification is made to AD for improving training stability: Not employing one regularization term used for restricting the training direction. AdaFair is an ensemble technique that equips the prominent AdaBoost (Friedman et al., 2000) with a fairness aspect. Here the examples that lead to unfair and inaccurate performances are considered to be the difficult instances. In our experiments, natural generalization of AdaFair intended for ED is made to encompass EO and DP; see Sec. B.3 for the generalization. While AdaFair bears spiritual similarity to FairBatch in a sense that mistreated examples are weighted progressively, it comes with a significant distinction in update scale. It is basically a boosting technique; hence such updates are done in distinctive predictors through different rounds; see Sec. 5 for details. FairBatch Settings To set α, we start from a candidate set of values within the range [0.0001, 0.05] and use cross-validation on the training set to choose the value that results in the highest accuracy with low fairness violation. The default batch sizes are: 100 (synthetic); 200 (COMPAS), 1,000 (AdultCensus); and 32 (UTKFace). Table 1 : Performances on the synthetic, COMPAS, and AdultCensus test sets w.r.t. equal opportunity (EO). We compare FairBatch with three types of baselines: (1) non-fair method: LR; (2) fair training via pre-processing: Cutting, RW (Kamiran and Calders, 2011) , and LBC (Jiang and Nachum, 2020) ; (3) fair training via in-processing: FC (Zafar et al., 2017b) , AD (Zhang et al., 2018) , and AdaFair (Iosifidis and Ntoutsi, 2019) . Experiments are repeated 10 times. 1 compares FairBatch against the other approaches on the synthetic, COMPAS, and AdultCensus test sets w.r.t. accuracy, EO disparity, and complexity (reflected in the number of epochs). In Sec. B.4, we also present the convergence plot of EO disparity as a function of the number of epochs. LR in row 1 is logistic regression without any fairness technique. The pre-processing techniques in rows 2-4 reduce EO disparity yet while sacrificing the accuracy performance. The in-processing techniques in rows 5-7 further reduce EO disparity yet still sacrificing accuracy. FairBatch, presented in the last row, offers comparable (or even greater) fairness performance while sacrificing less accuracy. We also present accuracy and fairness trade-off curves of FairBatch in Sec. B.5. One key implementation benefit is reflected in the small numbers of epochs. We also obtain consistent wall clock times, presented in Sec. B.6. As mentioned earlier, AdaFair is the most similar in spirit to FairBatch as it adjusts example weights based on the fairness performances of prior models. We demonstrate in Sec. B.7 that FairBatch and AdaFair indeed show similar convergence behaviors yet in different scales (rounds for AdaFair vs. epochs for FairBatch). One distinctive feature of FairBatch relative to AdaFair is the use of a single model training, thus enabling much faster speed (22.5-96x). We also make similar comparisons yet w.r.t. another fairness measure: DP disparity. See Table 2 . Recall that minimizing DP disparity involves adjusting two hyperparameters (λ 1 , λ 2 ), which also means that d = 2. Although FairBatch's theoretical guarantees hold only when using one hyperparameter (i.e., d = 1), we nonetheless see similar results where FairBatch is on par or better than the other approaches, while being the most robust in all aspects. 1 and 2 already demonstrate FairBatch's performance against the state of the arts, in this section we emphasize the usability of FairBatch by showing how it can improve fairness of any pretrained unfair model via fine-tuning and only compare it with Cutting, which is also easy to adopt. Table 3 shows how FairBatch improves fairness of pre-trained models (ResNet18 (He et al., 2016) and GoogLeNet (Szegedy et al., 2015) ) on the UTKFace dataset (Zhang et al., 2017) . Each image has three types of attributes: GENDER, RACE, and AGE. We use RACE as the sensitive attribute and consider two scenarios where the label attribute is GENDER or AGE. While GENDER is binary, AGE is multi-valued (<21, 21-40, 41-60, and >60), so we extend FairBatch in a straightforward fashion; see Sec. B.8 for details. Both Cutting and FairBatch reduce the ED disparities of the original pre-trained models. However, only FairBatch does so without sacrificing accuracy performance.

4.3. COMPATIBILITY WITH OTHER BATCH SELECTION TECHNIQUES

We demonstrate another key aspect of FairBatch: Compatibility with existing batch selection approaches that use importance sampling for faster convergence in training. The key functionality of the prior batch selection techniques is that examples considered to be "important" are given higher weights so as to be sampled more frequently. FairBatch can easily be tuned to accommodate such functionality: determining the batch-ratios of sensitive groups and then sampling using the importance weights per group. We evaluate FairBatch combined with one prominent technique, loss-based weighting (Loshchilov and Hutter, 2016) , on our synthetic dataset using EO and DP. We find that FairBatch indeed converges more quickly. It uses about 50 fewer epochs with similar fairness performances; see Sec. B.9 for the EO and DP convergence plots.

5. RELATED WORK

Model Fairness Various fairness measures have been proposed to reflect legal and social issues (Narayanan, 2018) . Among them, we focus on group fairness measures: equal opportunity (Hardt et al., 2016) , equalized odds (Hardt et al., 2016) , and demographic parity (Feldman et al., 2015) . A variety of techniques have been proposed and can be categorized into (1) pre-processing techniques (Kamiran and Calders, 2011; Zemel et al., 2013; Feldman et al., 2015; du Pin Calmon et al., 2017; Choi et al., 2020; Jiang and Nachum, 2020) , which debias or reweight data, (2) in-processing techniques (Kamishima et al., 2012; Zafar et al., 2017a; b; Agarwal et al., 2018; Zhang et al., 2018; Cotter et al., 2019; Roh et al., 2020) , which tailor the model training for fairness, and (3) postprocessing techniques (Kamiran et al., 2012; Hardt et al., 2016; Pleiss et al., 2017; Chzhen et al., 2019) , which perturb only the model output without touching upon the inside. Most of these methods require broad changes in data preprocessing, model training, or model outputs in machine learning systems (Venkatasubramanian, 2019) . In contrast, FairBatch only requires a single-line change in code to replace batch selection while achieving comparable or even greater performances against the state of the arts. Among the fairness techniques, AdaFair (Iosifidis and Ntoutsi, 2019) is the most similar in spirit to FairBatch. AdaFair extends the well-known AdaBoost (Friedman et al., 2000) where examples that lead to poor accuracy or fairness are boosted, i.e., given higher weights during the next round of training a new model that is added to the ensemble. In comparison, FairBatch is based on theoretical foundations of bilevel optimization and effectively performs the reweighting during each epoch (not through rounds), which leads to an order of magnitude improvement in speed as shown in Sec. 4.1. Although not our immediate focus, there are other noteworthy fairness measures: (1) individual fairness (Dwork et al., 2012) where close examples should be treated similarly, (2) causality-based fairness (Kilbertus et al., 2017; Kusner et al., 2017; Zhang and Bareinboim, 2018; Nabi and Shpitser, 2018; Khademi et al., 2019) , which aims to overcome the limitations of non-causal approaches by understanding the causal relationship between attributes, and (3) distributionally robust optimization (DRO) (Sinha et al., 2017) -based fairness (Hashimoto et al., 2018) , which achieves accuracy parity without the knowledge of sensitive attribute by balancing the risks across all distributions. Extending FairBatch to support these measures is an interesting future work. Finally, Chouldechova and Roth (2018) describe three causes of unfairness that help clarify Fair-Batch's fairness contributions: (1) minimizing average error fits majority populations, (2) bias encoded in data, and (3) the need to explore and gather more data. FairBatch addresses the cause (1) via balancing the sensitive group ratios within a batch. FairBatch also addresses (2) in some cases. For example, consider the recidivism prediction problem described in (Chouldechova and Roth, 2018) where minority populations have biased labels. In this case, FairBatch can be configured to make the recidivism prediction rate for the minority population similar to those of other populations. There may be other types of data bias that FairBatch is not able to address. Finally, FairBatch does not directly address (3) where one must gather more data for better fairness. Instead, there is a recent line of work that studies data collection techniques (Tae and Whang, 2021) for fairness.

Batch Selection

The batch selection literature for SGD focuses on analyzing the effect of batch sizes (Keskar et al., 2017; Masters and Luschi, 2018) and various sampling techniques (Shamir, 2016; Gürbüzbalaban et al., 2019) . More recently, importance sampling techniques have been proposed for faster convergence (Loshchilov and Hutter, 2016; Alain et al., 2016; Stich et al., 2017; Csiba and Richtárik, 2018; Katharopoulos and Fleuret, 2018; Johnson and Guestrin, 2018) . In comparison, FairBatch takes the novel approach of using batch selection for better fairness and is compatible with other existing techniques.

6. CONCLUSION

We addressed model fairness via the lens of bilevel optimization and proposed the FairBatch batch selection algorithm. The bilevel optimization provides a natural framework where the inner optimizer is SGD, and the outer optimizer performs adaptive batch selection to improve fairness. We presented FairBatch for implementing this optimization and showed how its underlying theory supports the fairness measures: equal opportunity, equalized odds, and demographic parity. We showed that FairBatch offers respectful performances that are on par or even better than the state of the arts w.r.t. all aspects in consideration: accuracy, fairness, and runtime. Also, FairBatch can readily be adopted to machine learning systems with a minimal change of replacing the batch selection with a single-line of code and be gracefully merged with other batch selection techniques used for faster convergence.

A APPENDIX -THEORY AND ALGORITHMS

A.1 DEMOGRAPHIC PARITY We continue from Sec. 2 and provide more details on how we can capture demographic parity using our bilevel optimization framework. Proposition 2. If m 0,0 = m 0,1 = m 1,0 = m 1,1 , then L 0,0 (w) = L 1,0 (w) and L 0,1 (w) = L 1,1 (w) can serve as a sufficient condition for demographic parity. Proof. Slightly abusing the notation, we denote by Pr(•) the empirical probability. The demographic parity is satisfied when Pr(ŷ = 1|z = 0) = Pr(ŷ = 1|z = 1) holds. Thus, Pr(ŷ = 1, y = 0|z = 0) + Pr(ŷ = 1, y = 1|z = 0) = Pr(ŷ = 1, y = 0|z = 1) + Pr(ŷ = 1, y = 1|z = 1). Since (|1 -y|, •) = 1 -(y, •), we have 1 m ,0 i:yi=0,zi=0 (1 -(y i , ŷi )) + 1 m ,0 i:yi=1,zi=0 (y i , ŷi ) = 1 m ,1 i:yi=0,zi=1 (1 -(y i , ŷi )) + 1 m ,1 i:yi=1,zi=1 (y i , ŷi ).

By replacing

i:y i =y,zi=z (y i , ŷi ) = m y,z L y,z (w), m 0,0 m ,0 (1 -L 0,0 (w)) + m 1,0 m ,0 L 1,0 (w) = m 0,1 m ,1 (1 -L 0,1 (w)) + m 1,1 m ,1 L 1,1 . If m 0,0 = m 0,1 = m 1,0 = m 1,1 , this reduces to L 0,0 (w) = L 1,0 (w) and L 0,1 (w) = L 1,1 , the above condition reduces to -L 0,0 (w) + L 1,0 (w) = -L 0,1 (w) + L 1,1 A sufficient condition to the above condition is L 0,0 (w) = L 1,0 (w) and L 0,1 (w) = L 1,1 (w). In general, the condition of the above proposition does not hold. Observe that another sufficient condition to demographic parity is as follows: m 1,0 m ,0 L 1,0 (w) - m 1,1 m ,1 L 1,1 (w) = 0 m 0,0 m ,0 L 0,0 (w) - m 0,1 m ,1 L 0,1 (w) = m 0,0 m ,0 - m 0,1 m ,1 Let us define L 1,0 (w) = m1,0 m ,0 L 1,0 (w), L 1,1 (w) = m1,1 m ,1 L 1,1 (w), L 0,0 (w) = m0,0 m ,0 L 0,0 (w), L 0,1 (w) = m0,1 m ,1 L 0,1 (w), and c = m0,0 m ,0 - m0,1 m ,1 . Also, define |x| c = max{x -c, c -x}. Then, we have the following bilevel optimization problem: min λ∈[0,1]×[0,1] max{|L 1,0 (w λ ) -L 1,1 (w λ )|, |L 0,0 (w λ ) -L 0,1 (w λ )| c }, w λ = arg min w λ 1 L 0,0 (w) + (1 -λ 1 )L 1,0 (w) + λ 2 L 0,1 (w) + (1 -λ 2 )L 1,1 (w).

A.2 PROOF FOR LEMMA 1

We continue from Sec. 3.1 and provide a full proof for Lemma 1. Here we recall Lemma 1. Lemma 1 (Quasi-convexity of F (λ)). For d = 1, if f 1 (•), g 1 (•), and h(•) satisfy 1. h(w) = 0 or 2. if f 1 (•), g 1 (•), and h(•) are twice differentiable, λ∇ 2 f 1 (w λ ) + (c 1 -λ)∇ 2 g 1 (w λ ) + ∇ 2 h(w λ ) 0 for every λ ∈ [0, c 1 ], then F (λ) is quasi-convex, i.e., F (tλ + (1 -t)λ ) ≤ max F (λ), F (λ ) for all t ∈ [0, 1] and λ, λ . Also, if F (•) = 0, then ∂ λ F (λ) = {v} and sign (v) = sign (g 1 (w λ ) -f 1 (w λ )). Proof. It it known that a continuous function f : R → R is quasiconvex if and only if at least one of the following conditions holds: 1) nondecreasing, 2) nonincreasing, and 3) nonincreasing and then nondecreasing (Boyd et al., 2004) . We will prove the lemma by showing that the function F (λ) is quasiconvex by showing that it is nonincreasing and then nondecreasing. More precisely, we will show that f 1 (w λ ) -g 1 (w λ ) is a nonincreasing function. It is easy to see that this directly implies that |f 1 (w λ ) -g 1 (w λ )| is nonincreasing and then nondecreasing. Case 1 (h(w) = 0) Consider λ 1 and λ 2 such that λ 1 > λ 2 . If we can show f 1 (w * λ1 ) ≤ f 1 (w * λ2 ) and g 1 (w * λ1 ) ≥ g 1 (w * λ2 ), then this implies that f 1 (w λ )-g 1 (w λ ) is a nonincreasing function. Indeed, this is very intuitive: If we increase λ, the inner optimization problems puts a higher weight on f 1 (•), resulting in a lower value of f 1 (w * ) and a higher value of g 1 (w * ). We formally show this by contradiction. By the definition of w λ , we have the following two conditions: λ 1 f 1 (w * λ1 ) + (c 1 -λ 1 )g 1 (w * λ1 ) ≤ λ 1 f 1 (w) + (c 1 -λ 1 )g 1 (w), ∀w, λ 2 f 1 (w * λ2 ) + (c 1 -λ 2 )g 1 (w * λ2 ) ≤ λ 2 f 1 (w) + (c 1 -λ 2 )g 1 (w), ∀w. This completes the proof of the first claim by contradiction. The second claim immediately follows the first claim. Since F (λ) = |f 1 (w λ ) -g 1 (w λ )|, we have dF (λ) dλ = sign (f 1 (w λ ) -g 1 (w λ )) d dλ (f 1 (w λ ) -g 1 (w λ )). As shown in the earlier part of this proof, f 1 (w λ ) -g 1 (w λ ) is a nonincreasing function, i.e., df1(w λ )-g1(w λ ) dλ ≤ 0. Thus, sign( dF (λ) dλ ) = sign(g 1 (w λ ) -f 1 (w λ )). Case 2 (If f 1 (•), g 1 (•), and h(•) are twice differentiable, λ∇ 2 f 1 (w λ ) + (c 1 -λ)∇ 2 g 1 (w λ ) + ∇ 2 h(w λ ) 0 for every λ ∈ [0, c 1 ]) In this part of the proof, we will denote w λ by w for simplicity. To show that f 1 (w) -g 1 (w) is a nondecreasing function (in λ), consider the derivative: d dλ (f 1 (w) -g 1 (w)) = (∇f 1 (w) -∇g 1 (w)) dw dλ To compute dw dλ , we implicitly differentiate (with respect to λ) the following stationary equation. λ∇f 1 (w) + (c 1 -λ)∇g 1 (w) + ∇h(w) = 0 ⇒ ∇f 1 (w) + λ∇ 2 f 1 (w) • dw dλ -∇g 1 (w) + (c 1 -λ)∇ 2 g 1 (w) • dw dλ + ∇ 2 h(w) • dw dλ = 0 By rearranging terms, we have λ∇ 2 f 1 (w) + (c 1 -λ)∇ 2 g 1 (w) + ∇ 2 h(w) dw dλ = -(∇f 1 (w) -∇g 1 (w)). By the assumption, λ∇ 2 f 1 (w)+(c 1 -λ)∇ 2 g 1 (w)+∇ 2 h(w) is positive definite and hence invertible. Thus, dw dλ = -λ∇ 2 f 1 (w) + (c 1 -λ)∇ 2 g 1 (w) + ∇ 2 h(w) -1 (∇f 1 (w) -∇g 1 (w)). Therefore, d dλ (f 1 (w) -g 1 (w)) = -(∇f 1 (w) -∇g 1 (w)) λ∇ 2 f 1 (w) + (c 1 -λ)∇ 2 g 1 (w) + ∇ 2 h(w) -1 (∇f 1 (w) -∇g 1 (w)). Note that λ∇ 2 f 1 (w) + (c 1 -λ)∇ 2 g 1 (w) + ∇ 2 h(w) -1 is also positive definite. Thus, d dλ (f 1 (w) -g 1 (w) ) is always negative, and hence f 1 (w) -g 1 (w) is a decreasing function. Now, observe that (f 1 (w) -g 1 (w)) • d dλ (f 1 (w) -g 1 (w)) ∈ ∂ λ F (λ). Therefore, if F (•) = 0, then ∂ λ F (λ) = {v} and sign (v) = sign (g 1 (w) -f 1 (w)). We continue from Sec. 3.2 and provide an example where inner objective's convexity does not imply outer objective's convexity. Consider the following strongly convex functions f 1 (•) and g 1 (•): f 1 (w) = e w + e -w 5 , g 1 (w) = (w -1) 2 Shown in Fig. 2 is the outer objective function F (λ). One can observe that it is not convex. Note that it is quasiconvex by Lemma 1. A.4 GRADIENT WHEN d ≥ 1 We continue from Sec. 3.2 and derive the gradient of the outer objective function. Recall how we formulated the general bilevel optimization problem: min λ∈Λ max i=1,...,d |f i (w λ ) -g i (w λ )|, w λ = arg min w d i=1 [λ i f i (w) + (c i -λ i )g i (w)] + h(w). In this section, we will prove the following: sign (g i * (w) -f i * (w))(∇f i * (w) -∇g i * (w)) H -1 λ (∇f i (w) -∇g i (w)) ∈ ∂ λi F (λ), ∀i. Assume that d i=1 [λ i ∇ 2 f i (w λ ) + (c i -λ i )∇ 2 g i (w λ )] + ∇ 2 h(w λ ) 0 for every λ ∈ Λ. In this part of the proof, we will denote w λ by w for simplicity. To compute dw dλi , we implicitly differentiate (with respect to λ i ) the following stationary equation. [λ j ∇f j (w) + (c j -λ j )∇g j (w)] + ∇h(w) = 0 ⇒ ∇f i (w) + λ i ∇ 2 f i (w) • ∂w ∂λ i -∇g i (w) + (c i -λ i )∇ 2 g i (w) • ∂w ∂λ i + 1≤j≤d, j =i λ j ∇ 2 f j (w) • ∂w ∂λ i + (c j -λ j )∇ 2 g j (w) • ∂w ∂λ i + ∇ 2 h(w) • ∂w ∂λ i = 0 By rearranging terms, we have   d j=1 λ j ∇ 2 f j (w) + (c j -λ j )∇ 2 g j (w) + ∇ 2 h(w)   ∂w ∂λ i = -(∇f i (w) -∇g i (w)). By the assumption, H λ := d j=1 λ j ∇ 2 f j (w) + (c j -λ j )∇ 2 g j (w) +∇ 2 h(w) is positive definite and hence invertible. Thus, ∂w ∂λ i = -H -1 λ (∇f i (w) -∇g i (w)). Now observe that F (λ) = |f i * (w λ ) -g i * (w λ )|. Therefore, sign (f i * (w) -g i * (w)) ∂ ∂λ i (f i * (w) -g i * (w)) ∈ ∂ λi F (λ). Since ∂ ∂λ i (f i * (w) -g i * (w)) = -(∇f i * (w) -∇g i * (w)) H -1 λ (∇f i (w) -∇g i (w)), we have -sign (f i * (w) -g i * (w))(∇f i * (w) -∇g i * (w)) H -1 λ (∇f i (w) -∇g i (w)) ∈ ∂ λi F (λ).

A.5 RATIONALE AND INTUITION BEHIND THE APPROXIMATION

We continue from Sec. 3.2 and provide more justifications for the gradient approximation technique. Assume that d i=1 [λ i ∇ 2 f i (w λ ) + (c i -λ i )∇ 2 g i (w λ )] + ∇ 2 h(w λ ) 0 for every λ ∈ Λ. Then, the gradient can be fully characterized as in equation 4. The rationale behind the approximation γ ≈ (0, 0, . . . , γ i * , . . . , 0) is that |γ i * | will be maximized at i * if ∇f 1 (w) -∇g 1 (w) ≈ ∇f 2 (w) -∇g 2 (w) ≈ • • • ≈ ∇f d (w) -∇g d (w) . This is because (∇f i * (w) -∇g i * (w)) H -1 λ (∇f i (w) -∇g i (w)) is an inner product between H -1/2 λ (∇f i * (w) - ∇g i * (w)) and H -1/2 λ (∇f i (w) -∇g i (w) ), and they are always perfectly aligned when i = i * . This approximation is also intuitive. Recall that changing λ i * affects the weights associated with f i * (w) and g i * (w) in the inner optimization problem. Thus, changes in λ i * will directly affect F (λ) = |f i * (w) -g i * (w)|. On the other hand, changing λ i for i = i * does not affect the weights associated with f i * (w) and g i * (w) but only affects the weights of other terms, so it will only indirectly and weakly affect F (λ).

A.6 FAIRBATCH ALGORITHMS IN PSEUDOCODE

We continue from Sec. 3.2 and present the FairBatch algorithms in pseudocode. Algorithms 2, 3, and 4 show how λ is adjusted for equal opportunity, equalized odds, and demographic parity, respectively. From the intermediate model at each epoch (or after a certain iterations), we first obtain f (w) and g(w), which correspond to the losses conditioned on each class. Then, one can update the current value of λ by comparing f (w) and g(w). Algorithm 4: Adaptive adjustment of λ w.r.t. demographic parity. Input: Intermediate model, criterion, train data (x train , z train , y train ), previous lambda λ (t-1) , and FairBatch's learning rate α output = model (x train ) loss = criterion (output, 1) d y=0 = sum(loss[(y = 0, z = 0)])/len(z = 0)sum(loss[(y = 0, z = 1)])/len(z = 1)  d y=1 = sum(loss[(y = 1, z = 0)])/len(z = 0) -sum(loss[(y = 1, z = 1)])/len(z = 1) if |d y=0 | > |d y=1 | then λ (t) 1 =      λ (t-1) 1 -α, if d y=0 > 0 λ (t-1) 1 + α, if d y=0 < 0 λ (t-1) 1 , otherwise else λ (t) 2 =      λ (t-1) 2 + α, if d y=1 > 0 λ (t-1) 2 -α, if d y=1 < 0 λ (t-1) 2 , otherwise Output :Next lambda λ (t)

B.4 FAIRNESS CURVES

We continue from Sec. 4.1 and show in Figures 3 and 4 the EO and DP disparity curves against the number of epochs for each fairness technique on the synthetic dataset. We also directly compare the curves of all fairness techniques in one graph as shown in Figure 5 . Since LBC and AdaFair require more than 10x many epochs than other methods, we only show their first 1000 epochs. As a result, FairBatch is one of the fastest methods to converge to low EO or DP disparities.

B.5 TRADE-OFF CURVES OF FAIRBATCH

We continue from Sec. 4.1 and show in Fig. 6 the accuracy-fairness trade-off curves of FairBatch for EO and DP on the synthetic dataset. FairBatch can be tuned by making it "less sensitive" to disparity. In Algorithms 2 and 4, notice that the λ parameters are updated if there is any disparity among sensitive groups. We modify this logic where the λ parameters are only updated if the disparity is above some threshold T . The trade-off curves in Fig. 6 are thus generated by adjusting T . For both EO and DP, we observe that there is a clear trade-off between accuracy and disparity.

B.6 WALL CLOCK TIMES

We continue from Sec. 4.1 and show in Table 5 the wall clock times (in seconds) of the experiments in Table 1 where we compare FairBatch against all the fairness techniques on the synthetic, COMPAS, and AdultCensus datasets. As a result, each runtime is proportional to the number of epochs shown in Table 1 . When comparing the runtimes of individual batches, FairBatch's batch takes 1.5x longer to run than LR's batch. 



Accuracy difference across sensitive groups in the sense of equalized odds (that we denote as "ED disparity") when running FairBatch on the ProPublica COMPAS dataset. fairsampler = FairBatch(model, criterion, train_data, batch_size, alpha, target_fairness) loader = DataLoader(train_data, sampler = fairsampler) for epoch in range(epochs): for i, data in enumerate(loader): # get the inputs; data is a list of [inputs, labels] inputs, labels = data ... model forward, backward, and optimization ... (b) PyTorch code for model training where the batch selection is replaced with FairBatch.

Figure 1: The black path in the left figure shows how FairBatch adjusts the batch-size ratios of sensitive groups using two reweighting parameters λ 1 and λ 2 (hyperparameters employed in our framework to be described in Sec. 2), thus minimizing their ED disparity, i.e., achieving equalized odds. The code in the right figure shows how easily FairBatch can be incorporated in a PyTorch machine learning pipeline. It requires a single-line change to replace the existing sampler with FairBatch, marked in blue.

Bilevel optimization with MinibatchSGD Minibatch sampling distribution ← Uniform sampling for each epoch do Draw minibatches according to minibatch sampling distribution for each minibatch do w ← MinibatchSGD(w, each minibatch) Update minibatch sampling distribution inner optimization problem, and the outer optimizer solves an outer optimization problem based on the outcomes of inner optimization. By viewing the standard training algorithm such as stochastic gradient descent (SGD)

4.1 and Sec. 4.3. The equalized odds (ED) measure is used in Sec. 4.2 and Sec. B.2. To quantify EO, ED, and DP, we compute the disparity between sensitive groups: EO disparity = max z∈Z | Pr(ŷ = 1|z = z, y = 1) -Pr(ŷ = 1|y = 1)|, ED disparity = max z∈Z,y∈Y,ŷ∈ Ŷ | Pr(ŷ = ŷ|z = z, y = y) -Pr(ŷ = ŷ|y = y)|, and DP disparity = max z∈Z | Pr(ŷ = 1|z = z) -Pr(ŷ = 1)|. As we discussed in Sec. 3, EO has a single-dimension outer optimization where the number of disparities d = 1 while ED and DP have multi-dimensional outer optimizations where d > 1.

Figure 2: F (λ) is not convex, but quasi-convex.

Figure 3: EO disparity curves of algorithms on the synthetic dataset.

(a) EO disparity curve of FairBatch and the baselines.

DP disparity curve of FairBatch and the baselines.

Figure 5: Epochs-fairness disparity curves of all algorithms together.

Figure 6: Accuracy-fairness disparity trade-off curves of FairBatch on the synthetic dataset.

Epochs-class weight curve of FairBatch.

Figure 7: Comparison of the weight changes on AdaFair and FairBatch w.r.t. equalized odds on the synthetic dataset.

FairBatch .855±.000 .012±.001 300 .681±.001 .022±.005 100 .844±.001 .011±.003 400 Performances on the synthetic, COMPAS, and AdultCensus test sets w.r.t. demographic parity (DP). The other settings are identical to those in Table1.

Performances of the pre-trained models fine-tuned with FairBatch on the UTKFace test set w.r.t. equalized odds (ED) for two fairness scenarios. While Tables1 and 2already demonstrate FairBatch's performance against the state of the arts, the emphasis here is more on FairBatch's usability where it is easy to adopt and yet improves the fairness of existing models.

Performances on the synthetic, COMPAS, and AdultCensus test sets w.r.t. equalized odds (ED). The other settings are identical to Table 1.

Wall clock times (in seconds) of the experiments in Table1using the same settings.

ACKNOWLEDGEMENTS

Yuji Roh and Steven E. Whang were supported by a Google AI Focused Research Award and by the Engineering Research Center Program through the National Research Foundation of Korea (NRF) funded by the Korean Government MSIT (NRF-2018R1A5A1059921). Kangwook Lee was supported by NSF/Intel Partnership on Machine Learning for Wireless Networking Program under Grant No. CNS-2003129. Changho Suh was supported by Institute for Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2019-0-01396, Development of framework for analyzing, detecting, mitigating of bias in AI model and training data).

availability

(2)If the lemma's statement is false, one of the following three events should occur:By adding these two inequalities with respective weights λ 1 and c 1 -λ 1 , we haveSimilarly, by adding these two inequalities with respective weights λ 2 and c 1 -λ 2 , we have). This contradicts equation 2.3. f 1 (w * λ1 ) > f 1 (w * λ2 ) and g 1 (w * λ1 ) < g 1 (w * λ2 ): By adding equation 1 with w = w * λ2 and equation 2 with w = w * λ1 , we have. By rearranging terms, we have). This contradicts the condition as the left-hand side is positive while the right-hand side is negative.

annex

Algorithm 2: Adaptive adjustment of λ w.r.t. equal opportunity. Input: Intermediate model, criterion, train data (x train , z train , y train ), previous lambda λ (t-1) , and FairBatch's learning rate α output = model (x train ) loss = criterion (output, y train )Algorithm 3: Adaptive adjustment of λ w.r.t. equalized odds. Input: Intermediate model, criterion, train data (x train , z train , y train ), previous lambda λ (t-1) , and FairBatch's learning rate α output = model (x train ) loss = criterion (output, y train )We continue from Sec. 4 and provide more details on experimental settings. We use the Adam optimizer for all trainings. We perform cross-validation on the training sets to find the best hyperparameters for each algorithm. We evaluate models on separate test sets, and the ratios of the train versus test data for the synthetic and real datasets are 2:1 and 4:1, respectively.

B.2 EQUALIZED ODDS RESULTS

We continue from Sec. 4.1 and show 

