FAIRGRAD: FAIRNESS AWARE GRADIENT DESCENT

Abstract

We address the problem of group fairness in classification, where the objective is to learn models that do not unjustly discriminate against subgroups of the population. Most existing approaches are limited to simple binary tasks or involve difficult to implement training mechanisms. This reduces their practical applicability. In this paper, we propose FairGrad, a method to enforce fairness based on a reweighting scheme that iteratively learns group specific weights based on whether they are advantaged or not. FairGrad is easy to implement and can accommodate various standard fairness definitions. Furthermore, we show that it is competitive with standard baselines over various datasets including ones used in natural language processing and computer vision.

1. INTRODUCTION

Fair Machine Learning addresses the problem of learning models that are free of any discriminatory behavior against a subset of the population. For instance, consider a company that develops a model to predict whether a person would be a suitable hire based on their biography. A possible source of discrimination here can be if, in the data available to the company, individuals that are part of a subgroup formed based on their gender, ethnicity, or other sensitive attributes, are consistently labelled as unsuitable hires regardless of their true competency due to historical bias. This kind of discrimination can be measured by a fairness notion called Demographic Parity (Calders et al., 2009) . If the data is unbiased, another source of discriminate may stem from the model itself that consistently mislabel the competent individuals of a subgroup as unsuitable hires. This can be measured by a fairness notion called Equality of Opportunity (Hardt et al., 2016) . Several such fairness notions have been proposed in the literature as different problems call for different measures. These notions can be divided into two major paradigms, namely (i) Individual Fairness (Dwork et al., 2012; Kusner et al., 2017) where the idea is to treat similar individuals similarly regardless of the sensitive group they belong to, and (ii) Group Fairness (Calders et al., 2009; Hardt et al., 2016; Zafar et al., 2017a; Denis et al., 2021) where the underlying idea is that different sensitive groups should not be disadvantaged compared to an overall reference population. In this paper, we focus on group fairness in the context of classification. The existing approaches for group fairness in Machine Learning may be divided into three main paradigms. First, pre-processing methods aim at modifying a dataset to remove any intrinsic unfairness that may exist in the examples. The underlying idea is that a model learned on this modified data is more likely to be fair (Dwork et al., 2012; Kamiran & Calders, 2012; Zemel et al., 2013; Feldman et al., 2015; Calmon et al., 2017) . Then, post-processing approaches modify the predictions of an accurate but unfair model so that it becomes fair (Kamiran et al., 2010; Hardt et al., 2016; Woodworth et al., 2017; Iosifidis et al., 2019; Chzhen et al., 2019) . Finally, in-processing methods aim at learning a model that is fair and accurate in a single step (Calders & Verwer, 2010; Kamishima et al., 2012; Goh et al., 2016; Zafar et al., 2017a; b; Donini et al., 2018; Krasanakis et al., 2018; Agarwal et al., 2018; Wu et al., 2019; Cotter et al., 2019; Iosifidis & Ntoutsi, 2019; Jiang & Nachum, 2020; Lohaus et al., 2020; Roh et al., 2020; Ozdayi et al., 2021) . In this paper, we propose a new in-processing approach based on a reweighting scheme that may also be used as a kind of post-processing approach by fine-tuning existing classifiers. Motivation. In-processing approaches can be further divided into several sub-categories (Caton & Haas, 2020) . Common amongst them are methods that relax the fairness constraints under consideration to simplify the learning process (Zafar et al., 2017a; Donini et al., 2018; Wu et al., 2019) . # The library is available as a part of the supplementary material. from fairgrad.torch import CrossEntropyLoss # Same as PyTorch's loss with some additional meta data. # A fairness rate of 0.01 is a good rule of thumb for standardized data. criterion = CrossEntropyLoss(y_train, s_train, fairness_measure, fairness_rate=0.01) # The dataloader and model are defined and used in the standard way. for x, y, s in data_loader: optimizer.zero_grad() loss = criterion(model(x), y, s) loss.backward() optimizer.step() Figure 1 : A standard training loop where the PyTorch's loss is replaced by FairGrad's loss. Indeed, standard fairness notions are usually difficult to handle as they are often non-convex and non-differentiable. Unfortunately, these relaxations may be far from the actual fairness measures, leading to sub-optimal models (Lohaus et al., 2020) . Similarly, several approaches address the fairness problem by designing specific algorithms and solvers. This is, for example, done by reducing the optimization procedure to a simpler problem (Agarwal et al., 2018) , altering the underlying solver (Cotter et al., 2019) , or using adversarial learning (Raff & Sylvester, 2018) . However, these approaches are often difficult to adapt to existing systems as they may require special training procedures or changes in the model. They are also often limited in the range of problems to which they can be applied (binary classification, two sensitive groups, . . . ). Furthermore, they may come with several hyperparameters that need to be carefully tuned to obtain fair models. The complexity of the existing methods might hinder their deployment in practical settings. Hence, there is a need for simpler methods that are straightforward to integrate in existing training loops. Contributions. In this paper, we present FairGrad, a general purpose approach to enforce fairness for gradient descent based methods. We propose to dynamically update the weights of the examples after each gradient descent update to precisely reflect the fairness level of the models obtained at each iteration and guide the optimization process in a relevant direction. Hence, the underlying idea is to use lower weights for examples from advantaged groups than those from disadvantaged groups. Our method is inspired by recent reweighting approaches that also propose to change the importance of each group while learning a model (Krasanakis et al., 2018; Iosifidis & Ntoutsi, 2019; Jiang & Nachum, 2020; Roh et al., 2020; Ozdayi et al., 2021) . We discuss these works in Appendix A. A key advantage of FairGrad is that it is straightforward to incorporate into standard gradient based solvers that support examples reweighing like Stochastic Gradient Descent. Hence, we developed a Python library (provided in the supplementary material) where we augmented standard PyTorch losses to accommodate our approach. From a practitioner point of view, it means that using FairGrad is as simple as replacing their existing loss from PyTorch with our custom loss and passing along some meta data, while the rest of the training loop remains identical. This is illustrated in Figure 1 . It is interesting to note that FairGrad only brings one extra hyper-parameter, the fairness rate, besides the usual optimization ones (learning rates, batch size, . . . ). Another advantage of Fairgrad is that, unlike the existing reweighing based approaches which often focus on specific settings, it is compatible with various group fairness notions, including exact and approximate fairness, can handle both multiple sensitive groups and multiclass problems, and can fine tune existing unfair models. Through extensive experiments, we show that, in addition to its versatility, FairGrad is competitive with several standard baselines in fairness on both standard datasets as well as complex natural language processing and computer vision tasks.

2. PROBLEM SETTING AND NOTATIONS

In the remainder of this paper, we assume that we have access to a feature space X , a finite discrete label space Y, and a set S of values for the sensitive attribute. We further assume that there exists an unknown distribution D ∈ D Z where D Z is the set of all distributions over Z = X × Y × S and that we only get to observe a finite dataset T = {(x i , y i , s i )} n i=1 of n examples drawn i.i.d. from D. Our goal is then to learn an accurate model h θ ∈ H, with learnable parameters θ ∈ R D , such that h θ : X → Y is fair with respect to a given fairness definition that depends on the sensitive attribute. In Section 2.1, we formally define the fairness measures that are compatible with our approach and provide several examples of popular notions that are compatible with our method. Finally, for the ease of presentation, throughout this paper we slightly abuse the notation P (E) and use it to represent both the true probability of an event E and its estimated probability from a finite sample.

2.1. FAIRNESS DEFINITION

We assume that the data may be partitioned into K disjoint groups denoted T 1 , . . . , T k , . . . , T K such that K k=1 T k = T and K k=1 T k = ∅. These groups highly depend on the fairness notion under consideration. They might correspond to the usual sensitive groups, as in Accuracy Parity (see Example 1), or might be subgroups of the usual sensitive groups, as in Equalized Odds (see Example 2 in the appendix). For each group, we assume that we have access to a function F k : D Z × H → R such that F k > 0 when the group k is advantaged and F k < 0 when the group k is disadvantaged. Furthermore, we assume that the magnitude of F k represents the degree to which the group is (dis)advantaged. Finally, we assume that each F k can be rewritten as follows: F k (T , h θ ) = C 0 k + K k =1 C k k P (h θ (x) = y|T k ) (1) where the constants C are group specific and independent of h θ . The probabilities P (h θ (x) = y|T k ) represent the error rates of h θ (x) over each group T k with a slight abuse of notation. Below, we show that Accuracy Parity (Zafar et al., 2017a ) respects this definition. In Appendix B, we show that Equality of Opportunity (Hardt et al., 2016) , Equalized Odds (Hardt et al., 2016) , and Demographic Parity (Calders et al., 2009) also respect this definition. Example 1 (Accuracy Parity (AP) (Zafar et al., 2017a) ). A model h θ is fair for Accuracy Parity when the probability of being correct is independent of the sensitive attribute, that is, ∀r ∈ S P (h θ (x) = y | s = r) = P (h θ (x) = y) . It means that we need to partition the space into K = |S| groups and, ∀r ∈ S, we define F (r) as the fairness level of group (r) F (r) (T , h θ ) = P (h θ (x) = y) -P (h θ (x) = y | s = r) = (P (s = r) -1)P (h θ (x) = y | s = r) + (r ) =(r) P (s = r ) P (h θ (x) = y | s = r ) where the law of total probability was used to obtain the last equality. Thus Accuracy Parity satisfies all our assumptions with C (r) (r) = P (s = r) -1, C (r) = P (s = r ) with r = r, and C 0 (r) = 0.

3. FAIRGRAD

In this section, we present FairGrad, the main contribution of this paper. We begin by discussing FairGrad for exact fairness and then present an extension to handle -fairness.

3.1. FAIRGRAD FOR EXACT FAIRNESS

To introduce our method, we first start with the following optimization problem that is standard in fair machine learning (Cotter et al., 2019) arg min h θ ∈H P (h θ (x) = y) s.t. ∀k ∈ [K], F k (T , h θ ) = 0. (2) Then, using Lagrange multipliers, denoted λ 1 , . . . , λ K , we obtain an unconstrained objective that should be minimized for h θ ∈ H and maximized for λ 1 , . . . , λ K ∈ R: L (h θ , λ 1 , . . . , λ K ) = P (h θ (x) = y) + K k=1 λ k F k (T , h θ ) . (3) To solve this problem, we propose to use an alternating approach where the hypothesis and the multipliers are updated one after the otherfoot_0 . Updating the Multipliers. To update λ 1 , . . . , λ K , we will use a standard gradient ascent procedure. Hence, given that the gradient of Problem (3) is ∇ λ1,...,λ K L (h θ , λ 1 , . . . , λ K ) =    F 1 (T , h θ ) . . . F K (T , h θ )    we have the following update rule ∀k ∈ [K]: λ T +1 k = λ T k + η λ F k T , h T θ where η λ is a rate that controls the importance of each update. In the experiments, we use a constant fairness rate of 0.01 as our initial tests showed that it is a good rule of thumb when the data is properly standardized. Updating the Model. To update the parameters θ ∈ R D of the model h θ , we use a standard gradient descent approach. However, first, we notice that given the fairness notions considered, Equation (3) can be rewritten as L (h θ , λ 1 , . . . , λ K ) = K k=1 P (h θ (x) = y|T k ) P (T k ) + K k =1 C k k λ k + K k=1 λ k C 0 k . where K k=1 λ k C 0 k is independent of h θ by definition. Hence, at iteration t, the update rule becomes θ T +1 = θ T -η θ K k=1 P (T k ) + K k =1 C k k λ k ∇ θ P (h θ (x) = y|T k ) where η θ is the usual learning rate that controls the importance of each parameter update. Here, we obtain our group specific weights ∀ k , w k = P (T k ) + K k =1 C k k λ k , that depend on the current fairness level of the model through λ 1 , . . . , λ K , the relative size of each group through P (T k ), and the fairness notion under consideration through the constants C. The exact values of these constants are given in Section 2.1 and Appendix B for various group fairness notions. Overall, they are such that, at each iteration, the weights of the advantaged groups are reduced and the weights of the disadvantaged groups are increased. The main limitation of the above update rule is that one needs to compute the group-wise gradients ∇ θ P (h θ (x) = y|T k ) = 1 n k (x,y)∈T k ∇ θ I {h θ (x) =y} . Here, I {h θ (x) =y} is the indicator function, also called the 0 -1-loss, that is 1 when h θ (x) = y and 0 otherwise. Unfortunately, this usually does not provide meaningful optimization directions. To address this issue, we follow the usual trend in machine learning and replace the 0-1-loss with one of its continuous and differentiable surrogates that provides meaningful gradients. For instance, in our experiments, we use the cross entropy loss.

3.2. COMPUTATIONAL OVERHEAD OF FAIRGRAD.

We summarize our approach in Algorithm 1. We consider batch gradient descent rather than full gradient descent as it is a popular optimization scheme. We empirically investigate the impact of the batch size in Section 4.7. We use italic font to highlight the steps inherent to FairGrad that do not appear in classic batch gradient descent. The main difference is Step 5, that is the computation of the fairness levels for each group. However, these can be cheaply obtained from the predictions of h (t) θ on the current batch which are always available since they are also needed to compute the gradient. Hence, the computational overhead of FairGrad is very limited. Compute the predictions of the current model on the batch B.

4:

Compute the group-wise losses using the predictions.

5:

Compute the current fairness level using the predictions and update the group-wise weights.

6:

Compute the overall weighted loss using the group-wise weights.

7:

Compute the gradients based on the loss and update the model. 8: end for 9: return the trained model h * θ

3.3. IMPORTANCE OF NEGATIVE WEIGHTS.

A key property of FairGrad is that we allow the use of negative weights throughout the optimization process, that is P (T k ) + K k =1 C k k λ k may become negative, while existing methods often restrict themselves to positive weights (Roh et al., 2020; Iosifidis & Ntoutsi, 2019; Jiang & Nachum, 2020) . In this Section, we show that these negative weights are important as they are sometimes necessary to learn fair models. Hence, in the next lemma, we provide sufficient conditions so that negative weights are mandatory if one wants to enforce Accuracy Parity. Lemma 1 (Negative weights are necessary.). Assume that the fairness notion under consideration is Accuracy Parity (see Example 1). Let h * θ be the most accurate and fair model. Then using negative weights is necessary as long as min h θ ∈H h θ unfair as otherwise we would have a contradiction since the fair model would also be the most accurate model for group T -1 since P (h * θ (x) = y) = P (h * θ (x) = y|T -1 ) by definition of Accuracy Parity. In other words, a dataset where the most accurate model for a given group still disadvantages this group requires negative weights.

3.4. FAIRGRAD FOR -FAIRNESS

In the previous section, we mainly considered exact fairness and we showed that this could be achieved by using a reweighting approach. In fact, we can extend this procedure to the case offairness where the fairness constraints are relaxed and a controlled amount of violations is allowed. Usually, is a user defined parameter but it can also be set by the law, as it is the case with the 80% rule in the US. The main difference with the exact fairness case is that each equality constraint in Problem (2) is replaced with two inequalities of the form ∀k ∈ [K], F k (T , h θ ) ≤ ∀k ∈ [K], F k (T , h θ ) ≥ -. The main consequence is that we need to maintain twice as many Lagrange multipliers and that the group-wise weights are slightly different. Since the two procedures are similar, we omit the details here but provide them in Appendix D for the sake of completeness.

4. EXPERIMENTS

In this section, we present several experiments that demonstrate the competitiveness of FairGrad as a procedure to learn fair models in a classification setting. We begin by presenting results over standard fairness datasets and a Natural language Processing dataset in Section 4.4. We then study the behaviour of the -fairness variant of FairGrad in Section 4.5. Next, we showcase the fine-tuning ability of FairGrad on a Computer Vision dataset in Section 4.6. Finally, we investigate the impact of batch size on the learned model in Section 4.7.

4.1. DATASETS

In the main paper, we consider 4 different datasets and postpone the results on another 6 datasets to Appendix E as they follow similar trends. We also postpone the detailed descriptions of these datasets as well as the pre-processing steps. On the one hand, we consider commonly used fairness datasets, namely Adult Income (Kohavi, 1996) and CelebA (Liu et al., 2015) . Both are binary classification datasets with binary sensitive attributes (gender). We also consider a variant of the Adult Income dataset where we add a second binary sensitive attribute (race) to obtain a dataset with 4 disjoint sensitive groups. On the other hand, to showcase the wide applicability of FairGrad, we consider the |F k (T , h θ )| (lower is better). To assess the utility of the learned models, we use their accuracy levels over the test set, that is 1 n n i=1 I h θ (xi)=yi (higher is better). All the results reported are averaged over 5 independent runs and standard deviations are provided. Note that, in the main paper, we graphically report a subset of the results over the aforementioned datasets. We provide detailed results in Appendix E, including the missing pictures as well as complete tables with accuracy levels, fairness levels, and fairness level of the most well-off and worst-off groups for all the relevant methods.

4.3. METHODS

We compare FairGrad to 6 different baselines, namely (i) Unconstrained, which is oblivious to any fairness measure and trained using a standard batch gradient descent method, (ii) an Adversarial mechanism (Goodfellow et al., 2014) using a gradient reversal layer (Ganin & Lempitsky, 2015) , similar to GRAD-Pred (Raff & Sylvester, 2018) , where an adversary, with an objective to predict the sensitive attribute, is added to the unconstrained model, (iii) BiFair (Ozdayi et al., 2021) , (iv) FairBatch (Roh et al., 2020) , (v) Constraints (Cotter et al., 2019) , a non-convex constrained optimization method, and (vi) Weighted ERM where each example is reweighed based on the size of the sensitive group the example belongs to. In all our experiments, we consider two different hypothesis classes. On the one hand, we use linear models implemented in the form of neural networks with no hidden layers. On the other hand, we use a more complex, non-linear architecture with three hidden layers of respective sizes 128, 64, and 32. We use ReLU as our activation function with batch norm normalization and dropout set to 0.2. In both cases, we optimize the cross-entropy loss. We provide the exact setup and hyper-parameter tuning details for all the methods in Appendix E.1. In several experiments, we only consider subsets of the baselines due to the limitations of the methods. For instance, BiFair was designed to handle binary labels and binary sensitive attributes and thus is not considered for the datasets with more than two sensitive groups or more than two labels. Furthermore, we implemented it using the authors code that is freely available online but does not include AP as a fairness measure, thus we do not report results related to this measure for BiFair. Similarly, we also implemented FairBatch from the authors code which does not support AP as a fairness measure, thus we also exclude it from the comparison for this measure. For Constraints, we based our implementation on the publicly available authors library but were only able to reliably handle linear models and thus we do not consider this baseline for non-linear models. Finally, for Adversarial, we used our custom made implementation. However, it is only applicable when learning non-linear models since it requires at least one hidden layer to propagate its reversed gradient.

4.4. RESULTS FOR EXACT FAIRNESS

We report the results over the Adult Income dataset using a linear model, the Adult Income dataset with multiple groups with a non-linear model, and the Twitter sentiment dataset using both linear and nonlinear models in Figures 2, 3 , and 4 respectively. In these figures, the best methods are closer to the bottom right corner. If a method is closer to the bottom left corner, it has good fairness but reduced accuracy. Similarly, a method closer to the top right corner has good accuracy but poor fairness, that is it is close to the unconstrained model. The main take-away from these experiments is that there is no fairness enforcing method that is consistently better than the others. All of them have strengths, that is datasets and fairness measures where they obtain good results, and weaknesses, that is datasets and fairness measures for which they are sub-optimal. For instance, FairGrad achieves better fairness levels for EOdds and EOpp over the Adult dataset with a linear model. However, it pays a price in terms of accuracy in those settings. Similarly, FairBatch induces better accuracy than the other approaches over Adult with linear model and EOdds and only pays a small price in terms of fairness. However, it is significantly worse in terms of fairness over the Adult Multigroup dataset with a non-linear model. Finally, BiFair is sub-optimal on Adult with EOpp, while being comparable to the other approaches on the Twitter Sentiment dataset. We observed similar trends on the other datasets, available in Appendix E.3, with different methods coming out on top for different datasets and fairness measures. In this second set of experiments, we demonstrate the capability of FairGrad to support approximate fairness (see Section 3.4). In Figure 5 , we show the performances, as accuracy-fairness pairs, of several models learned on the CelebA dataset by varying the fairness level parameter . These results suggest that FairGrad respects the constraints well. Indeed, the average absolute fairness level (across all the groups, see Section 4.2) achieved by FairGrad is either the same or less than the given threshold. It is worth mentioning that FairGrad is designed to enforce -fairness for each constraint individually which is slightly different from the summarized quantity displayed here. Finally, as the fairness constraint is relaxed, the accuracy of the model increases, reaching the same performance as the Unconstrained classifier when the fairness level of the latter is below .

4.6. FAIRGRAD AS A FINE-TUNING PROCEDURE

While FairGrad has primarily been designed to learn fair classifiers from scratch, it can also be used to fine-tune an existing classifier to achieve better fairness. To showcase this possibility, we fine-tune the ResNet18 (He et al., 2016) model, developed for image recognition, over the UTKFace dataset (Zhang et al., 2017) , consisting of human face images tagged with Gender, Age, and Race information. Following the same process as Roh et al. (2020) , we use Race as the sensitive attribute and consider two scenarios where either the gender (binary) with Demographic Parity as the fairness 1 . In both settings, FairGrad is able to learn models that are more fair than an Unconstrained fine-tuning procedure, albeit at the expense of accuracy.

4.7. IMPACT OF THE BATCH-SIZE

In this last set of experiment, we evaluate the impact of batch size on the fairness and accuracy level of the learned model. Indeed, at each iteration, in order to minimize the overhead associated with FairGrad (see Section 3.1), we update the weights using the fairness level of the model estimated solely on the current batch. When these batches are small, these estimates are unreliable and might lead the model astray. This can be observed in Table 2 where we present the performances of several linear models learned with different batch sizes on the CelebA dataset. On the one hand, for very small batch sizes, the learned models tends to have slightly lower accuracy and larger standard deviation in fairness levels. On the other hand, with a sufficiently large batch size, in this case 64 and above, the learned models are close to be perfectly fair. Furthermore, they obtain reasonable levels of accuracy since the Unconstrained model has an accuracy of 0.8532 for this problem.

5. CONCLUSION

In this paper, we proposed FairGrad, a fairness aware gradient descent approach based on a reweighting scheme. We showed that it can be used to learn fair models for various group fairness definitions and is able to handle multiclass problems as well as settings where there is multiple sensitive groups. We empirically showed the competitiveness of our approach against several baselines on standard fairness datasets and on a Natural Language Processing task. We also showed that it can be used to fine-tune an existing model on a Computer Vision task. Finally, since it is based on gradient descent and has a small overhead, we believe that FairGrad could be used for a wide range of applications, even beyond classification. Limitations and Societal Impact. While appealing, FairGrad also has limitations. It implicitly assumes that a set of weights that would lead to a fair model exists but this might be difficult to verify in practice. Thus, even if in our experiments FairGrad seems to behave quite well, a practitioner using this approach should not trust it blindly. It remains important to always check the actual fairness level of the learned model. On the other hand, we believe that, due to its simplicity and its versatility, FairGrad could be easily deployed in various practical contexts and, thus, could contribute to the dissemination of fair models. In this appendix we provide several details that were omitted in the main paper. First, in Section A, we review several works related to ours. Then, in Section B, we show that several well known group fairness measures are compatible with FairGrad. In Section C, we prove Lemma 1. Next, in Section D, we derive the update rules for FairGrad with -fairness. Finally, in Section E, we provide additional experiments.

A RELATED WORK

The fairness literature is extensive and we refer the interested reader to recent surveys (Caton & Haas, 2020; Mehrabi et al., 2021) to get an overview of the subject. Here, we focus on recent works that are more closely related to our approach. BiFair (Ozdayi et al., 2021) . This paper proposes a bilevel optimization scheme for fairness. The idea is to use an outer optimization scheme that learns weights for each example so that the trade-off between fairness and accuracy is as favorable as possible while an inner optimization scheme learns a model that is as accurate as possible. One of the limits of this approach is that it does not directly optimize the fairness level of the model but rather a relaxation that does not provide any guarantees on the goodness of the learned predictor. Furthermore, it is limited to binary classification with binary sensitive attribute. In this paper, we also learn weights for the examples in an iterative way. However, we use a different update rule. Furthermore, we focus on proper fairness definitions rather than relaxations and our objective is to learn accurate models with given levels of fairness rather than a trade-off between the two. Finally, our approach is not limited to the binary setting. FairBatch (Roh et al., 2020) . This paper proposes a batch gradient descent approach that can be used to learn fair models. More precisely, the idea is to draw the batch examples from a skewed distribution that favors the disadvantaged groups by oversampling them. In this paper, we propose to use a reweighting approach which could also be interpreted as altering the distribution of the examples based on their fairness level if all the weights were positive. However, we allow the use of negative weights, and we prove that they are sometimes necessary to achieve fairness. Furthermore, we use a different update rule for the weights. AdaFair (Iosifidis & Ntoutsi, 2019) . This paper proposes a boosting based framework to learn fair models. The underlying idea is to modify the weights of the examples depending on both the performances of the current strong classifier and the group memberships. Hence, examples that belong to the disadvantaged group and are incorrectly classified receive higher weights than the examples that belong to the advantaged group and are correctly classified. In this paper, we use a similar high level idea but we use different weights that do not depend on the performance of the model. Furthermore, rather than a boosting based approach, we consider problems that can be solved using gradient descent. Finally, while AdaFair only focuses on Equalized Odds, we show that our approach works with several fairness notions. Identifying and Correcting Label Bias in Machine Learning (Jiang & Nachum, 2020) . This paper considers the fairness problem from an original point of view as it assumes that the observed labels are biased compared to the true labels. The goal is then to learn a model with respect to the true labels using only the observed labels. To this end, the paper proposes to use an iterative reweighting procedure where positive weights for the examples and updated models are alternatively learned. In this paper, we also propose a reweighting approach. However, we use different weights that are not necessarily positive. Furthermore, our approach is not limited to binary labels and can handle multiclass problems.

B REFORMULATION OF VARIOUS GROUP FAIRNESS NOTION

In this section, we present several group fairness notions which respect our fairness definition presented in Section 2.1. Example 2 (Equalized Odds (EOdds) (Hardt et al., 2016) ). A model h θ is fair for Equalized Odds when the probability of predicting the correct label is independent of the sensitive attribute, P (h θ (x) = l | s = r, y = l) = P (h θ (x) = l | y = l) . It means that we need to partition the space into K = |Y × S| groups and, ∀l ∈ Y, ∀r ∈ S, we define F (l,r) as F (l,r) (T , h θ ) = P (h θ (x) = l | y = l) -P (h θ (x) = l | s = r, y = l) = (l,r ) =(l,r) P (s = r |y = l) P (h θ (x) = l | s = r , y = l) -(1 -P (s = r|y = l))P (h θ (x) = l | s = r, y = l) where the law of total probability was used to obtain the last equation. Thus, Equalized Odds satisfies all our assumptions with C (l,r) (l,r) = P (s = r|y = l) -1, C (l,r ) (l,r) = P (s = r |y = l), C (l ,r ) (l,r) = 0 with r = r and l = l, and C 0 (l,r) = 0. Example 3 (Equality of Opportunity (EOpp) (Hardt et al., 2016) ). A model h θ is fair for Equality of Opportunity when the probability of predicting the correct label is independent of the sensitive attribute for a given subset Y ⊂ Y of labels called the desirable outcomes, that is, ∀l ∈ Y , ∀r ∈ S P (h θ (x) = l | s = r, y = l) = P (h θ (x) = l | y = l) . It means that we need to partition the space into K = |Y × S| groups and, ∀l ∈ Y, ∀r ∈ S, we define F (l,r) as F (l,r) (T , h θ ) = P (h θ (x) = l | s = r, y = l) -P (h θ (x) = l | y = l) ∀(l, r) ∈ Y × S 0 ∀(l, r) ∈ Y × S \ Y × S which can then be rewritten in the correct form in the same way as Equalized Odds, the only difference being that C • (l,r) = 0, ∀(l, r) ∈ Y × S \ Y × S. Example 4 (Demographic Parity (DP) (Calders et al., 2009) ). A model h θ is fair for Demographic Parity when the probability of predicting a binary label is independent of the sensitive attribute, that is, ∀l ∈ Y, ∀r ∈ S P (h θ (x) = l | s = r) = P (h θ (x) = l) . It means that we need to partition the space into K = |Y × S| groups and, ∀l ∈ Y, ∀r ∈ S, we define F (l,r) as F (l,r) (T , h θ ) = P (h θ (x) = l) -P (h θ (x) = l | s = r) = (P (y = l, s = r) -P (y = l | s = r)) P (h θ (x) = y | s = r, y = l) + (l,r ) =(l,r) P (y = l, s = r ) P (h θ (x) = y | s = r , y = l) + P y = l | s = r -P y = l, s = r P h θ (x) = y | s = r, y = l - ( l,r ) =( l,r) P y = l, s = r P h θ (x) = y | s = r , y = l P y = l -P y = l | s = r where the law of total probability was used to obtain the last equation. Thus, Demographic Parity satisfies all our assumptions with C (l,r) (l,r) = P (y = l, s = r) -P (y = l | s = r), C (l,r ) (l,r) = P (y = l, s = r ) with r = r, C ( l,r) (l,r) = P y = l | s = r -P y = l, s = r , C ( l,r ) (l,r) = -P y = l, s = r with r = r, and C 0 (l,r) = P y = l -P y = l | s = r . C PROOF OF LEMMA 1 Lemma 2 (Negative weights are necessary.). Assume that the fairness notion under consideration is Accuracy Parity. Let h * θ be the most accurate and fair model. Then using negative weights is necessary as long as min h θ ∈H h θ unfair max T k P (h θ (x) = y|T k ) < P (h * θ (x) = y) . Proof. To prove this Lemma, one first need to notice that, for Accuracy Parity, since K k=1 P (T k ) = 1 we have that K k =1 C k k = (P (T k ) -1) + K k =1 k =k P (T k ) = 0. This implies that K k=1 P (T k ) + K k =1 C k k λ k = 1. This implies that, whatever our choice of λ, the weights will always sum to one. In other words, since we also have that K k=1 λ k C 0 k = 0 by definition, for a given hypothesis h θ , we have that max λ1,...,λ K ∈R K k=1 P (h θ (x) = y|T k ) P (T k ) + K k =1 C k k λ k (5) = max w1,...,w K ∈R s.t. k w k =1 K k=1 P (h θ (x) = y|T k ) w k (6) where, given w 1 , . . . , w K , the original values of lambda can be obtained by solving the linear system Cλ = w where C =    C 1 1 . . . C 1 K . . . . . . C K 1 . . . C K K    , λ =    λ 1 . . . λ K    , w =    w 1 -P (T 1 ) . . . w K -P (T K )    which is guaranteed to have infinitely many solutions since the rank of the matrix C is K -1 and the rank of the augmented matrix (C|w) is also K -1. Here we are using the fact that P (T k ) = 0, ∀k since all the groups have to be represented to be taken into account. We will now assume that all the weights are positive, that is w k ≥ 0, ∀k. Then, the best strategy to solve Problem ( 6) is to put all the weight on the worst off group k, that is set w k = 1 and w k = 0, ∀k = k. It implies that max w1,...,w K ∈R s.t. k w k =1 K k=1 P (h θ (x) = y|T k ) w k = max k P (h θ (x) = y|T k ) . Furthermore, notice that, for fair models with respect to Accuracy Parity, we have that P (h θ (x) = y|T k ) = P (h θ (x) = y) , ∀k. Thus, if it holds that min h θ ∈H h θ unfair max T k P (h θ (x) = y|T k ) < P (h * θ (x) = y) where h * θ is the most accurate and fair model, then the optimal solution of Problem (3) in the main paper will be unfair. It implies that, in this case, using positive weights is not sufficient and negative weights are necessary.

D FAIRGRAD FOR -FAIRNESS

To derive FairGrad for -fairness we first consider the following standard optimization problem arg min h θ ∈H P (h θ (x) = y) s.t. ∀k ∈ [K], F k (T , h θ ) ≤ ∀k ∈ [K], F k (T , h θ ) ≥ -. We, once again, use a standard multipliers approach to obtain the following unconstrained formulation: L (h θ , λ 1 , . . . , λ K , δ 1 , . . . , δ K ) = P (h θ (x) = y) + K k=1 λ k (F k (T , h θ ) -) -δ k (F k (T , h θ ) + ) where λ 1 , . . . , λ K and δ 1 , . . . , δ K are the multipliers that belong to R + , that is the set of positive reals. Once again, to solve this problem, we will use an alternating approach where the hypothesis and the multipliers are updated one after the other. Updating the Multipliers. To update the values λ 1 , . . . , λ K , we will use a standard gradient ascent procedure. Hence, noting that the gradient of the previous formulation is ∇ λ1,...,λ K L (h θ , λ 1 , . . . , λ K , δ 1 , . . . , δ K ) =    F 1 (T , h θ ) - . . . F K (T , h θ ) -    ∇ δ1,...,δ K L (h θ , λ 1 , . . . , λ K , δ 1 , . . . , δ K ) =    -F 1 (T , h θ ) - . . . -F K (T , h θ ) -    we have the following update rule ∀k ∈ [K] λ T +1 k = max 0, λ T k + η F k T , h T θ - δ T +1 k = max 0, δ T k -η F k T , h T θ + where η is a learning rate that controls the importance of each weight update. Updating the Model. To update the parameters θ ∈ R D of the model h θ , we proceed as before, using a gradient descent approach. However, first, we notice that given the fairness notions that we consider, Equation ( 7) is equivalent to L (h θ , λ 1 , . . . , λ K , δ 1 , . . . , δ K ) = K k=1 P (h θ (x) = y|T k ) P (T k ) + K k =1 C k k (λ k -δ k ) (8) - K k=1 (λ k + δ k ) + K k=1 (λ k -δ k )C 0 k . Since the additional terms in the optimization problem do not depend on h θ , the main difference between exact and -fairness is the nature of the weights. More precisely, at iteration t, the update rule becomes θ T +1 = θ T -η θ K k=1 P (T k ) + K k =1 C k k (λ k -δ k ) ∇ θ P (h θ (x) = y|T k ) where η θ is a learning rate. Once again, we obtain a simple reweighting scheme where the weights depend on the current fairness level of the model through λ 1 , . . . , λ K and δ 1 , . . . , δ K , the relative size of each group through P (T k ), and the fairness notion through the constants C.

E EXTENDED EXPERIMENTS

In this section, we provide additional details related to the baselines and the hyper-parameters tuning procedure. We then provide descriptions of the datasets and finally the results.

E.1 BASELINES

• Adversarial: One of the common ways of removing sensitive information from the model's representation is via adversarial learning. Broadly, it consists of three components, namely an encoder, a task classifier, and an adversary. One the on hand, the objective of the adversary is to predict sensitive information from the encoder. On the other hand, the encoder aims to create representations that are useful for the downstream task (task classifier) and, at the same time, fool the adversary. The adversary is generally connected to the encoder via a gradient reversal layer (Ganin & Lempitsky, 2015) which acts like an identity function during the forward pass and scales the loss with a parameter -λ during the backward pass. In our setting, the encoder is a Multi-Layer Perceptron with two hidden layers of size 64 and 128 respectively, and the task classifier is another Multi-Layer Perceptron with a single hidden layer of size 32. The adversary is the same as the main task classifier. We use a ReLU as the activation function with the dropout set to 0.2 and employ batch normalization with default PyTorch parameters. As a part of the hyper-parameter tuning, we did a grid search over λ, varying it between 0.1 to 3.0 with an interval of 0.2. • BiFair (Ozdayi et al., 2021) : For this baseline, we fix the weight parameter to be of length 8 as suggested in the code released by the authors. In this fixed setting, we perform a grid search over the following hyper-parameters: -Batch Size: 128,256,512 -Weight Decay: 0.0, 0.001 -Fairness Loss Weight: 0.5, 1, 2, 4 -Inner Loop Length: 5, 25, 50 • Constraints: We use the implementation available in the TensorFlow Constrained Optimization library with default hyper-parameters. • FairBatch: We use the implementation publicly released by the authors. • Weighted ERM: We reweigh each example in the dataset based on inverse of the proportion of the sensitive group it belongs to. In our initial experiments, we varied the batch size, and learning rates for both Constraints and FairBatch. However, we found that the default hyper-parameters as specified by the authors result in the best performances. In the spirit of being comparable in terms of hyper-parameter search budget, we also fix all hyper-parameters of FairGrad, apart from the batch size and weight decay. We experiment with two different batch sizes namely, 64 or 512 for the standard fairness dataset. Similarly, we also experiment with three weight decay values namely, 0.0, 0.001 and 0.01. Note that we also vary weight decay and batch sizes for FairBatch, Adversarial, Unconstrained, and BiFair approach. For all our experiments, apart from BiFair, we use Batch Gradient Descent as the optimizer with a learning rate of 0.1 and a gradient clipping of 0.05 to avoid exploding gradients. For BiFair, we employ the Adam optimizer as suggested by the authors with a learning rate of 0.001. Hyper-parameters selection procedure. As mentioned above, all our baselines come with a number of hyper-parameters (learning rates, batch size, weight decay, . . . ) and selecting the best combination is often key to avoid undesirable behaviours such as over-fitting. In this paper, we proceed as follows. First, for each method, we consider all the X possible hyper-parameter combinations and we run the training procedure for 50 epochs for each combination. Then, we retain all the models returned by the last 5 epochs, that is, for a given method, we have 5X models and the goal is to select the best one among them. Since we have access to two measures of performance, we can select either the most accurate model, the most fair model, or a trade-off between the two depending on the goal of the practitioner. In this paper, we chose to focus on the third option and we select the model with the lowest fairness score between certain accuracy intervals. More specifically, let α * be the highest validation accuracy among the 5X models, we choose the model with the lowest validation fairness score amongst all models with a validation accuracy in the interval [α * -0.03, α * ]. For FairGrad, FairBatch and Unconstrained, we considered 6 hyper-parameters combinations. For BiFair, we considered 72 such combinations, while for Adversarial, there were 90 combinations.

E.2 DATASETS

Here, we provide additional details on the datasets used in our experiments. We begin by describing the standard fairness datasets for which we follow the pre-processing procedure as described in Lohaus et al. (2020) . • Adultfoot_3 : The dataset (Kohavi, 1996) is composed of 45222 instances, with 14 features each describing several attributes of a person. The objective is to predict the income of a person (below or above 50k) while remaining fair with respect to gender (binary in this case). Following the pre-processing step of Wu et al. (2019) , only 9 features were used for training. • CelebAfoot_4 : The dataset (Liu et al., 2015) consists of 202, 599 images, along with 40 binary attributes associated with each image. We use 38 of these as features while keeping gender as the sensitive attribute and "Smiling" as the class label. • Dutchfoot_5 : The dataset ( Žliobaite et al., 2011) is composed of 60, 420 instances with each instance described by 12 features. We predict "Low Income" or "High Income" as dictated by the occupation as the main classification task and gender as the sensitive attribute. • Compasfoot_6 : The dataset (Larson et al., 2016) contains 6172 data points, where each data point has 53 features. The goal is to predict if the defendant will be arrested again within two years of the decision. The sensitive attribute is race, which has been merged into "White" and "Non White" categories. • Communities and Crimefoot_7 : The dataset (Redmond & Baveja, 2002) is composed of 1994 instances with 128 features, of which 29 have been dropped. The objective is to predict the number of violent crimes in the community, with race being the sensitive attribute. • German Creditfoot_8 : The dataset (Dua et al., 2017) consists of 1000 instances, with each having 20 attributes. The objective is to predict a person's creditworthiness (binary), with gender being the sensitive attribute. • Gaussianfoot_9 : It is a toy dataset with binary task label and binary sensitive attribute, introduced in Lohaus et al. (2020) . It is constructed by drawing points from different Gaussian distributions. We follow the same mechanism as described in Lohaus et al. (2020) , and sample 50000 data points for each class. • Adult Folktablesfoot_10 : This dataset (Ding et al., 2021) is an updated version of the original Adult Income dataset. We use California census data with gender as the sensitive attribute. There are 195665 instances, with 9 features describing several attributes of a person. We use the same preprocessing step as recommended by the authors. For all the dataset, we use a 20% of the data as a test set and 80% as a train set. We further divide the train set into two and keep 25% of the training examples as a validation set. For each repetition, we randomly shuffle the data before splitting it, and thus we had unique splits for each random seed. As a last pre-processing step, we centered and scaled each feature independently by substracting the mean and dividing by the standard deviation both of which were estimated on the training set. Twitter 



It is worth noting that, here, we do not have formal duality guarantees and that the problem is not even guaranteed to have a fair solution. Nevertheless, the approach seems to work well in practice as can be seen in the experiments. http://slanglab.cs.umass.edu/TwitterAAE/ https://susanqq.github.io/UTKFace/ https://archive.ics.uci.edu/ml/datasets/adult https://mmlab.ie.cuhk.edu.hk/projects/CelebA.html https://sites.google.com/site/conditionaldiscrimination/ https://github.com/propublica/compas-analysis http://archive.ics.uci.edu/ml/datasets/communities+and+crime https://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29 https://github.com/mlohaus/SearchFair/blob/master/examples/get synthetic data.py https://github.com/zykls/folktables ttp://slanglab.cs.umass.edu/TwitterAAE/



Figure 4: Results for the Twitter Sentiment dataset for Linear and Non Linear Models.

Figure 6: Results for the Adult dataset with different fairness measures.

Figure 7: Results for the CelebA dataset with different fairness measures.

Figure 8: Results for the Crime dataset with different fairness measures.

Figure 10: Results for the Compas dataset with different fairness measures.

Figure 11: Results for the Dutch dataset with different fairness measures.

Figure 12: Results for the German dataset with different fairness measures.

Figure 13: Results for the Gaussian dataset with different fairness measures.

Figure 14: Results for the Twitter Sentiment dataset with different fairness measures.

Algorithm 1 FairGrad for Exact Fairness Input: Groups T 1 , . . . , T K , Functions F 1 , . . . , F K , Function class H of models h θ with parameters θ ∈ R D , Learning rates η λ , η θ , and Iterator iter that returns batches of examples. Output: A fair model h * θ . 1: Initialize the group specific weights and the model.

Results for the Adult Multigroup dataset using Non Linear models.

Results for the UTKFace dataset where a ResNet18 is fine-tuned using different strategies.

Effect of the batch size on the CelebA dataset with Linear Models and EOdds as the fairness measure. (multi-valued) with Equalized Odds as fairness measure are used as the target label. The results are displayed in Table

Sentiment Analysis 12 : The dataset(Blodgett et al., 2016) consists of 200k tweets with binary sensitive attribute (race) and binary sentiment score. We follow the setup proposed byHan et al. (2021) andElazar & Goldberg (2018) and create bias in the dataset by changing the proportion of each subgroup (race-sentiment) in the training set. With two sentiment classes being happy and sad, and two race classes being AAE and SAE, the training data consists of 40% AAE-happy, 10% AAE-sad, 10% SAE-happy, and 40% SAE-sad. The test set remains balanced. The tweets are encoded using the DeepMoji(Felbo et al., 2017) encoder with no fine-tuning, which has been pretrained over millions of tweets to predict their emoji, thereby predicting the sentiment. Note that the train-test splits are pre-defined and thus do not change based on the random seed of the repetition.

Results for the Adult dataset with Linear Models. All the results are averaged over 5 runs. Here MEAN ABS., MAXIMUM, and MINIMUM represent the mean absolute fairness value, the fairness level of the most well-off group, and the fairness level of the worst-off group, respectively.

Results for the Adult dataset with Non Linear Models. All the results are averaged over 5 runs. Here MEAN ABS., MAXIMUM, and MINIMUM represent the mean absolute fairness value, the fairness level of the most well-off group, and the fairness level of the worst-off group, respectively.

Results for the CelebA dataset with Linear Models. All the results are averaged over 5 runs. Here MEAN ABS., MAXIMUM, and MINIMUM represent the mean absolute fairness value, the fairness level of the most well-off group, and the fairness level of the worst-off group, respectively.

Results for the CelebA dataset with Non Linear Models. All the results are averaged over 5 runs. Here MEAN ABS., MAXIMUM, and MINIMUM represent the mean absolute fairness value, the fairness level of the most well-off group, and the fairness level of the worst-off group, respectively.

Results for the Crime dataset with Linear Models. All the results are averaged over 5 runs. Here MEAN ABS., MAXIMUM, and MINIMUM represent the mean absolute fairness value, the fairness level of the most well-off group, and the fairness level of the worst-off group, respectively.

Results for the Crime dataset with Non Linear Models. All the results are averaged over 5 runs. Here MEAN ABS., MAXIMUM, and MINIMUM represent the mean absolute fairness value, the fairness level of the most well-off group, and the fairness level of the worst-off group, respectively. Results for the Adult with multiple groups dataset with different fairness measures.

Results for the Adult with multiple groups dataset with Linear Models. All the results are averaged over 5 runs. Here MEAN ABS., MAXIMUM, and MINIMUM represent the mean absolute fairness value, the fairness level of the most well-off group, and the fairness level of the worst-off group, respectively.

Results for the Adult with multiple groups dataset with Non Linear Models. All the results are averaged over 5 runs. Here MEAN ABS., MAXIMUM, and MINIMUM represent the mean absolute fairness value, the fairness level of the most well-off group, and the fairness level of the worst-off group, respectively.

Results for the Compas dataset with Linear Models. All the results are averaged over 5 runs. Here MEAN ABS., MAXIMUM, and MINIMUM represent the mean absolute fairness value, the fairness level of the most well-off group, and the fairness level of the worst-off group, respectively.

Results for the Compas dataset with Non Linear Models. All the results are averaged over 5 runs. Here MEAN ABS., MAXIMUM, and MINIMUM represent the mean absolute fairness value, the fairness level of the most well-off group, and the fairness level of the worst-off group, respectively.

Results for the Dutch dataset with Linear Models. All the results are averaged over 5 runs. Here MEAN ABS., MAXIMUM, and MINIMUM represent the mean absolute fairness value, the fairness level of the most well-off group, and the fairness level of the worst-off group, respectively.

Results for the Dutch dataset with Non Linear Models. All the results are averaged over 5 runs. Here MEAN ABS., MAXIMUM, and MINIMUM represent the mean absolute fairness value, the fairness level of the most well-off group, and the fairness level of the worst-off group, respectively.

Results for the German dataset with Linear Models. All the results are averaged over 5 runs. Here MEAN ABS., MAXIMUM, and MINIMUM represent the mean absolute fairness value, the fairness level of the most well-off group, and the fairness level of the worst-off group, respectively.

Results for the German dataset with Non Linear Models. All the results are averaged over 5 runs. Here MEAN ABS., MAXIMUM, and MINIMUM represent the mean absolute fairness value, the fairness level of the most well-off group, and the fairness level of the worst-off group, respectively.

Results for the Gaussian dataset with Linear Models. All the results are averaged over 5 runs. Here MEAN ABS., MAXIMUM, and MINIMUM represent the mean absolute fairness value, the fairness level of the most well-off group, and the fairness level of the worst-off group, respectively.

Results for the Gaussian dataset with Non Linear Models. All the results are averaged over 5 runs. Here MEAN ABS., MAXIMUM, and MINIMUM represent the mean absolute fairness value, the fairness level of the most well-off group, and the fairness level of the worst-off group, respectively. Weighted ERM 0.8805 ± 0.0046 Eopp 0.0912 ± 0.0008 0.1812 ± 0.0024 -0.1837 ± 0.0045 Adversarial 0.8754 ± 0.0086 Eopp 0.0808 ± 0.0066 0.1605 ± 0.0128 -0.1628 ± 0.0143 BiFair 0.88 ± 0.003 Eopp 0.086 ± 0.005 0.17 ± 0.013 -0.172 ± 0.009 FairBatch 0.874 ± 0.0035 Eopp 0.0733 ± 0.0029 0.1465 ± 0.0054 -0.1467 ± 0.0066 FairGrad 0.8543 ± 0.0082 Eopp 0.0517 ± 0.0095 0.1028 ± 0.0191 -0.1041 ± 0.0192

Results for the Twitter Sentiment dataset with Linear Models. All the results are averaged over 5 runs. Here MEAN ABS., MAXIMUM, and MINIMUM represent the mean absolute fairness value, the fairness level of the most well-off group, and the fairness level of the worst-off group, respectively. Eopp 0.002 ± 0.001 0.005 ± 0.001 0.0 ± 0.0 BiFair 0.746 ± 0.009 Eopp 0.009 ± 0.004 0.017 ± 0.009 -0.017 ± 0.009 FairBatch 0.7426 ± 0.001 Eopp 0.0429 ± 0.0005 0.0858 ± 0.0011 -0.0858 ± 0.0011 FairGrad 0.7518 ± 0.0069 Eopp 0.0024 ± 0.002 0.0049 ± 0.004 -0.0049 ± 0.004

Results for the Folktables Adult dataset with Linear Models. All the results are averaged over 5 runs. Here MEAN ABS., MAXIMUM, and MINIMUM represent the mean absolute fairness value, the fairness level of the most well-off group, and the fairness level of the worst-off group, respectively. Weighted ERM 0.7886 ± 0.0032 Eodds 0.0294 ± 0.012 0.0364 ± 0.0169 -0.0443 ± 0.0206 Constrained 0.663 ± 0.032 Eodds 0.008 ± 0.003 0.013 ± 0.004 0.004 ± 0.002 BiFair 0.768 ± 0.007 Eodds 0.008 ± 0.005 0.011 ± 0.006 -0.011 ± 0.008 FairBatch 0.788 ± 0.0027 Eodds 0.0045 ± 0.0033 0.0069 ± 0.0065 -0.0063 ± 0.0049 FairGrad 0.7885 ± 0.0027 Eodds 0.0043 ± 0.0019 0.0073 ± 0.0037 -0.0068 ± 0.0045 Unconstrained 0.7902 ± 0.0038 Eopp 0.0094 ± 0.0031 0.0162 ± 0.0053 -0.0215 ± 0.0071 Eopp 0.0012 ± 0.0015 0.0022 ± 0.0026 -0.0026 ± 0.0034 FairGrad 0.7893 ± 0.0026 Eopp 0.0011 ± 0.0009 0.0024 ± 0.002 -0.0021 ± 0.0016

Results for the Folktables Adult dataset with Non Linear Models. All the results are averaged over 5 runs. Here MEAN ABS., MAXIMUM, and MINIMUM represent the mean absolute fairness value, the fairness level of the most well-off group, and the fairness level of the worst-off group, respectively. Weighted ERM 0.7947 ± 0.0022 Eopp 0.0105 ± 0.0027 0.0181 ± 0.0047 -0.024 ± 0.0062 Adversarial 0.8108 ± 0.0161 Eopp 0.0034 ± 0.0057 0.0041 ± 0.0057 -0.0095 ± 0.017 BiFair 0.793 ± 0.008 Eopp 0.028 ± 0.017 0.048 ± 0.029 -0.064 ± 0.039 FairBatch 0.8038 ± 0.0063 Eopp 0.0008 ± 0.0005 0.0014 ± 0.0009 -0.0018 ± 0.0012 FairGrad 0.8058 ± 0.0035 Eopp 0.0014 ± 0.0014 0.003 ± 0.0031 -0.0026 ± 0.0024

