FIFA: MAKING FAIRNESS MORE GENERALIZABLE IN CLASSIFIERS TRAINED ON IMBALANCED DATA

Abstract

Algorithmic fairness plays an important role in machine learning and imposing fairness constraints during learning is a common approach. However, many datasets are imbalanced in certain label classes (e.g. "healthy") and sensitive subgroups (e.g. "older patients"). Empirically, this imbalance leads to a lack of generalizability not only of classification, but also of fairness properties, especially in over-parameterized models. For example, fairness-aware training may ensure equalized odds (EO) on the training data, but EO is far from being satisfied on new users. In this paper, we propose a theoretically-principled, yet Flexible approach that is Imbalance-Fairness-Aware (FIFA). Specifically, FIFA encourages both classification and fairness generalization and can be flexibly combined with many existing fair learning methods with logits-based losses. While our main focus is on EO, FIFA can be directly applied to achieve equalized opportunity (EqOpt); and under certain conditions, it can also be applied to other fairness notions. We demonstrate the power of FIFA by combining it with a popular fair classification algorithm, and the resulting algorithm achieves significantly better fairness generalization on several real-world datasets.

1. INTRODUCTION

). The generalization of fairness constraints (EqualizedOdds) is substantially worse than the generalization of classification error. Machine learning systems are becoming increasingly vital in our daily lives. The growing concern that they may inadvertently discriminate against minorities and other protected groups when identifying or allocating resources has attracted numerous attention from various communities. While significant efforts have been devoted in understanding and correcting biases in classical models such as logistic regressions and supported vector machines (SVM), see, e.g., (Agarwal et al., 2018; Hardt et al., 2016) , those derived tools are far less effective on modern over-parameterized models such as neural networks (NN). Furthermore, in large models, it is also difficult for measures of fairness (such as equalized odds to be introduced shortly) to generalize, as shown in Fig. 1 . In other words, fairness-aware training (for instance, by imposing fairness constraints in training) may ensure measures of fairness on the training data, but those measures of fairness are far from being satisfied on test data. Here we find that sufficiently trained ResNet-10 models generalize well on classification error but poorly on fairness constraints-the gap in equalized odds between the test and training data is more than ten times larger than the gap for classification error between test and training. In parallel, another outstanding challenge for generalization with real-world datasets is that they are often imbalanced across label and demographic groups (see Fig. 2 for imbalance in three commonly used datasets across various domains). This inherent nature of real-world data, greatly hinders the generalization of classifiers that are unaware of this innate imbalance, especially when the performance measure places substantial emphasis on minority classes or subgroups without sufficient samples (e.g., when considering the average classification error for each label class). Although generalizations with imbalanced data has been extensively studied and mitigation strategies are proposed (Cao et al., 2019; Mani & Zhang, 2003; He & Garcia, 2009; An et al., 2021; He & Ma, 2013; Krawczyk, 2016) Our contributions. Inspired by recent works on regularizing the minority classes more strongly than the frequent classes by imposing class-dependent margins (Cao et al., 2019) in standard supervised learning, we design a theoretically-principled, Flexible and Imbalance-Fairness-Aware (FIFA) approach that takes both classification error and fairness constraints violation into account when training the model. Our proposed method FIFA can be flexibly combined with many fair learning methods with logits-based losses such as the soft margin loss (Liu et al., 2016) by encouraging larger margins for minority subgroups. While our method appears to be motivated for overparameterized models such as neural networks, it nonetheless also helps simpler models such as logistic regressions. Experiments on both large datasets using overparameterized models as well as smaller datasets using simpler models demonstrate the effectiveness, and flexibility of our approach in ensuring a better fairness generalization while preserving good classification generalization. Related work. Supervised learning with imbalanced datasets have attracted significant interest in the machine learning communities, where several methods including resampling, reweighting, and data augmentation have been developed and deployed in practice (Mani & Zhang, 2003; He & Garcia, 2009; An et al., 2021) . Theoretical analyses of those methods include margin-based approaches (Li et al., 2002; Kakade et al., 2008; Khan et al., 2019; Cao et al., 2019) . Somewhat tangentially, an outstanding and emerging problem faced by modern models with real-world data is algorithmic fairness (Dwork et al., 2012; Coley et al., 2021; Deng et al., 2023) , where practical algorithms are developed for pre-processing (Feldman et al., 2015) , in-processing (Zemel et al., 2013; Edwards & Storkey, 2015; Zafar et al., 2017; Donini et al., 2018; Madras et al., 2018; Martinez et al., 2020; Lahoti et al., 2020; Deng et al., 2020) , and post-processing (Hardt et al., 2016; Kim et al., 2019) steps. Nonetheless, there are several challenges when applying fairness algorithms in practice (Beutel et al., 2019; Saha et al., 2020; Deng et al., 2022; Holstein et al., 2019) . Specifically, as hinted in Fig. 1 , the fairness generalization guarantee, especially in over-parameterized models and large datasets, is not well-understood, leading to various practical concerns. We remark that although Kini et al. (2021) claims it is necessary to use multiplicative instead of additive logits adjustments, their motivating example is different from ours and they studied SVM with fixed and specified budgets for all inputs. Cotter et al. (2019) investigate the generalization of optimization with data-dependent constraints, but they do not address the inherent imbalance in real datasets, and their experimental results are not implemented with large neural networks used in practice. To the best of our knowledge, this paper is the first tackling the open challenge of fairness generalization with imbalanced data.

2. BACKGROUND

Notation. For any k ∈ N + , we use [k] to denote the set {1, 2, • • • , k}. For a vector v, let v i be the i-th coordinate of v. We use 1 to denote the indicator function. For a set S, we use |S| to denote the cardinality of S. For two positive sequences {a k } and {b k }, we write a k = O(b k ) (or a n ≲ b n ), and a k = o(b k ), if lim k→∞ (a k /b k ) < ∞ and lim k→∞ (a k /b k ) = 0 , respectively. We use P for probability and E for expectation, and we use P and Ê for empirical probability and expectation. For the two distributions D 1 and D 2 , we use pD 1 + (1p)D 2 for p ∈ (0, 1) to denote the mixture distribution such that a sample is drawn with probabilities p and (1p) from D 1 and D 2 respectively. We use N p (µ, Σ) to denote p-dimensional Gaussian distribution with mean µ and variance Σ. Fairness notions. Throughout the paper, we consider datasets consisting of triplets of the form (x, y, a), where x ∈ X is a feature vector, a ∈ A is a sensitive attribute such as race and gender, and y ∈ Y is the corresponding label. The underlying random triplets corresponding to (x, y, a) is denoted as (X, Y, A). Our goal is to learn a predictor h ∈ H : X → Y, where h(X) is a prediction of the label Y of input X. In this paper, we mainly consider equalized odds (EO) (Hardt et al., 2016) that has been widely used in previous literature on fairness. But our method could also be directly used to equalized opportunity (EqOpt) given that EqOpt is quite similar to EO. In addition, under certain conditions, our method could also be used to demographic parity (DP), which we will mainly discuss in the Appendix. (i). Equalized odds (EO) and Equalized opportunity (EqOpt). A predictor h satisfies equalized odds if h(X) is conditionally independent of the sensitive attribute A given Y : P(h(X) = y|Y = y, A = a) = P(h(X) = y|Y = y). If Y = {0, 1} and we only require P(h(X) = 1|Y = 1, A = a) = P(h(X) = 1|Y = 1), we say h satisfies equalized opportunity. (ii). Demographic parity (DP). A predictor h satisfies demographic parity if h(X) is statistically independent of the sensitive attribute A: P(h(X) = Y |A = a) = P(h(X) = Y ).

3. THEORY-INSPIRED DERIVATION

While we will formally introduce our new approach in Section 4, this section gives an informal derivation, with an emphasis on insights. We design an imbalance-fairness-aware approach that can be flexibly combined with fair learning methods with logits-based losses. Throughout the paper, we use the lower letters, e.g. x, for realizations and capital letters, e.g. X, for random variables. Consider the supervised k-class classification problem, where a model f : X → R k provides k scores, and the label is assigned as the class label with the highest score. The corresponding predictor h(x) = arg max i f (x) i if there are no ties. Let us use P i = P(X|Y = i) to denote the conditional distribution when the class label is i for i ∈ [k] and P bal to denote the balanced distribution P Idx , where Idx is uniformly drawn from [k], i.e. k i=1 P i /k. P bal can be viewed as a distribution weighting each class equally. Similarly, let us use P i,s = P(X|Y = i, A = s) to denote the conditional distribution when Y = i and A = s. The corresponding empirical distributions induced by the training data are Pi , Pbal and Pi,s . For the training dataset {(x j , y j , a j )} j , let S i = {j : y j = i}, S i,a = {j : y j = i, a j = a}, and the corresponding sample sizes be n i and n i,a , respectively. Although P i , P bal and P i,s are all distributions on X , we sometimes use notations like (X, Y ) ∼ P i to denote the distribution of (X, i), where X ∼ P i . In classical imbalanced data analysis, the goal is to ensure a small L bal [f ] = P (X,Y )∼Pbal [f (X) Y < max l̸ =Y f (X) l ]. For our goal, we not only want to ensure a small L bal [f ], we also hope to keep the fairness violation error to be as small as possible. In order to do that, we need to take the margin of subgroups divided according to sensitive attributes in each label class (so called demographic subgroups in different classes) into account. Margin trade-off between classes of equalized odds. In the setting of standard classification with imbalanced training datasets such as in Cao et al. (2019) ; Sagawa et al. (2020) , the aim is to reach a small balanced test error L bal [f ]. However, in a fair classification setting, our aim is not only to reach a small L bal [f ], but also to satisfy certain fairness constraints at test time. Specifically, for EO, the aim is: min f L bal [f ] s.t. ∀y ∈ Y, a ∈ A, P(h(X) = y|Y = y, A = a) = P(h(X) = y|Y = y), where we recall that h(•) = arg max i f (•) i . We remark here that in addition to the class-balanced loss L bal [f ], we can also consider the loss function that is balanced across all demographic subgroups in different classes, the derivation is similar and we omit it here. Recall our motivating example in Figure 1 . Whether the fairness violation error is small at test time should also be taken into account. Thus, our performance criterion for optimization should be: L bal [f ] + αL fv , where L fv is a measure of fairness constraints violation that we will specify later, and α is a weight parameter chosen according to how much we care about the fairness constraints violation. For simplicity, we start with Y = {0, 1} and A = {a 1 , a 2 }. In the Appendix, we will further discuss the case when there are multiple classes and multiple demographic groups. We also want to clarify here the case we study is different from by naively viewing each demographic groups as a class and applying the method in Cao et al. (2019) . The main difference is that our aim is to identify the class labels at test time, and we do not assume we have access to sensitive attributes at test time. As a result, the method in Cao et al. (2019) can not be directly used. Given that we mainly consider over-parameterized models, we assume the training data is well-separated that all the training samples are perfectly classified and fairness constraints are perfectly satisfied. The setting has been considered in (Cao et al., 2019) and can be satisfied if the model class is rich, for instance, for over-parameterized models such as neural networks. We also want to emphasize even though our theory-derived method assumes well-separation, our method can be applied to not well-separated datasets, please refer to the Section 6 for more details. If all the training samples are classified perfectly by h, not only P (X,Y )∼ Pbal (h(X) ̸ = Y ) = 0 is satisfied, we also have that P (X,Y )∼ Pi,a j (h(X) ̸ = Y ) = 0 for all i ∈ Y and a j ∈ A. We remark here that P(h(X) = i|Y = i, A = a) = 1-P (X,Y )∼Pi,a (h(X) ̸ = Y ). Our performance criterion for optimization in (A.4) is: M[f ] = L bal [f ] + α i∈Y |P(h(X) = i|Y = i, A = a 1 ) -P(h(X) = i|Y = i, A = a 2 )| . By using classical margin theory bounds, We can establish the connection between margins for each demographic subgroups and the generalization performance in classification as well as fairness constraint, as proved in Theorem 3.1. Denote the margin for class i by γ i = min j∈Si γ(x j , y j ), where γ(x, y) = f (x) ymax l̸ =y f (x) l . One natural way to choose L fv is to take i∈Y |P(h(X) = i|Y = i, A = a 1 ) -P(h(X) = i|Y = i, A = a 2 )|. Theorem 3.1 (Informal) With high probability over the randomness of the training data, for Y = {0, 1}, A = {a 1 , a 2 }, and for some proper complexity measure of class F, i.e. C(F) (see more details in the Appendix), for any f ∈ F, M[f ] ≲ i∈Y 1 γ i C(F) n i + i∈Y,a∈A 2α γ i,a C(F) n i,a ≤ i∈Y 1 γ i C(F) n i + i∈Y,a∈A 2α γ i C(F) n i,a , where γ i is the margin of the i-th class's sample set S i and γ i,a is the margin of demographic subgroup's sample set S i,a . Optimizing the upper bound in (2) with respect to margins in the sense that g(γ 0 , γ 1 ) ≤ g(γ 0δ, γ 1 + δ) for g(γ 0 , γ 1 ) = i∈Y 1 γi √ ni + 2α i∈Y,a∈A 1 γi √ ni,a and all δ ∈ [-γ 1 , γ 0 ], we obtain γ 0 /γ 1 = ñ1/4 1 /ñ 1/4 0 , where the adjusted sample size ñi = niΠ a∈A ni,a ( √ Π a∈A ni,a+2α a∈A √ nini,a) 2 for i ∈ {0, 1}. From Theorem 3.1, we see how sample sizes of each subgroups are taken into account and how they affect the optimal ratio between class margins. Based on this theorem, we will propose our theoretical framework in Section 4. A closely related derivation has been used in Cao et al. (2019) , but their focus is only on the classification error and its generalization. As we will show in Example 3.1, when fairness constraints are also considered, their methods could sometimes perform poorly with respect to the generalization of those constraints. We remark here that if we do not consider the fairness constraints violation, then α = 0, and the effective sample sizes degenerate to ñi = n i . For illustration, we demonstrate the advantage of applying our approach to select margins over directly using the margin selection in Cao et al. (2019) by considering Gaussian models, which is widely used in machine learning theory (Schmidt et al., 2018; Zhang et al., 2021; Deng et al., 2021) . Specifically, our training data follow distribution: X|Y = 0 ∼ 2 i=1 π 0,ai N p (µ i , I), X|Y = 1 ∼ 2 i=1 π 1,ai N p (µ i + β * , I). Here, in class j, subgroup a i is drawn with probability π j,ai , then, given the sample is from subgroup a i in class j, the data is distributed as a Gaussian random vector. Recall the corresponding training dataset indices of subgroup a i in class j is denoted as S j,ai , and |S j,ai | = n j,ai . Consider the case α = 1 , π 0,a1 = π 0,a2 , and the following class of classifiers: F = 1{β * ⊤ x > c} : c ∈ R , which is a linear classifier class that contains classifiers differ from each other by a translation in a particular direction. Example 3.1 Given function f and set S, let dist(f, S) = min x,s∈S ∥f (x)-s∥ 2 . Consider two clas- sifiers f , f ∈ F such that dist( f , S 0 )/ dist( f , S 1 ) = ñ1/4 1 /ñ 1/4 0 and dist(f ′ , S 0 )/ dist(f ′ , S 1 ) = n -1/4 0 /n -1/4 1 . Suppose ∥β * ∥ ≫ √ p log n, ∥µ i ∥ < C, (µ * 1 -µ * 2 ) ⊤ β = 0, and π 1,a2 ≤ c 1 π 1,a1 for a sufficiently small c 1 > 0, then when n 0 , n 1 are sufficiently large, with high probability we have M[ f ] < M[f ]. Here L bal [f ] = 1 2 P[f (X) 1 < f (X) 0 |Y = 1] + 1 2 P[f (X) 0 < f (X) 1 |Y = 0] denotes the balanced mis-classification error. Remark. We provide analyses for the 0-1 loss as our ultimate goal is to strike a balance between good test accuracy and small fairness constraints violation. If we use surrogates such as the softmax-cross-entropy loss for the 0-1 loss in training, our theoretical analyses still stand since we always adjust margins based on the 0-1 loss as our interests are in quantities such as test accuracy. We provide analyses and experiments for DP in the Appendix. For ease of exposition, we focus solely on the EO constraint hereafter and discuss other constraints in the Appendix. Inspired by the margin tradeoff characterized in Section 3, we propose our FIFA approach for Flexible Imbalance-Fairness-Aware classification that can be easily combined with different types of logits-based losses, and further incorporated into any existing fairness algorithms such as those discussed in Section 5. Recall γ i,a is the margin for demographic subgroups in Theorem 3.1, and it could be written as γ i,a = γ i + δ i,a and δ i,a ≥ 0 (since γ i = min{γ i,a1 , γ i,a2 }, also see Fig. 3 for illustration), hence the middle term of Eq. ( 3) can be further upper bounded by the last term in Eq. ( 2). The final upper bound in Eq. ( 2) is indeed sufficient for obtaining the margin trade-off between classes. Nonetheless, if we want to further enforce margins for each demographic subgroup in each class, we need to use the refined bound. Specifically, in Section 3, we have identified a way to select γ 0 /γ 1 , based on which we propose to enforce margins for each demographic subgroup's training set S i,a of the form

4. FLEXIBLE COMBINATION WITH LOGITS-BASED LOSSES

γ i,a = C/ñ 1/4 i + δ i,a , where δ i,a and C are all non-negative tuning parameters. In light of the trade-off between the class margins γ 0 /γ 1 = ñ1/4 1 /ñ 1/4 0 , we can set γ i of the form C/ñ -1/4 i . Given γ i,a ≥ γ i , a natural choice for margins for subgroups is Eq. ( 3). How to select δ i,a ? Knowing the form of margins from the preceding discussions, an outstanding question remains: how to select δ i,a for imbalanced datasets? Let Y = {0, 1} and A = {a 1 , a 2 }, within each class i, we identify S i,a with the largest cardinality |S i,a | and set the corresponding δ i,a = 0. The remaining δ i,A\a are tuned as a non-negative parameter. As a further illustration, without loss of generality, assume for all i, |S i,a1 | ≥ |S i,a2 |. Thus selected {δ i,a } i,a ensures the upper bound in the middle of Eq. ( 2) is tighter in the sense that for any δ > 0, i∈Y 1 γ i √ n i,a1 + 1 (γ i + δ) √ n i,a2 ≤ i∈Y 1 (γ i + δ) √ n i,a1 + 1 γ i √ n i,a2 In the Appendix, we will present how to choose δ i,a 's when there are multiple demographic groups. Briefly speaking, our results similar to the above inequality are proved by an application of the rearrangement inequality. Simple as it is, the high-level view is meaningful -the decision boundaries of a fair predictor should be farther away from the less-frequent subgroup than the more-frequent subgroup to ensure better fairness generalization. Flexible imbalance-fairness-aware (FIFA) approach. We will demonstrate how to apply the above motivations to design better margin losses. Loosely speaking, we consider a logits-based loss ℓ((x, y); f ) = ℓ(f (x) y , {f (x) i } i∈Y\y ), which is non-increasing with respect to its first coordinate if we fix the second coordinate. Such losses include (i). 0-1 loss: 1{f (x) y < max i∈Y\y f (x) i }. (ii). Hinge loss: max{max i∈Y\y f (x) if (x) y , 0}. (iii). Softmax-cross-entropy loss:log e f (x)y /(e f (x)y + i̸ =y e f (x)i ). Our flexible imbalance-fairness-aware (FIFA) approach modifies the above losses by enforcing margin of the form in Eq. ( 3). Specifically, we use the following loss function during training ℓ FIFA ((x, y, a); f ) = ℓ(f (x) y -∆ y,a , {f (x) i } i∈Y\y ), where ∆ i,a = C/ñ 1/4 i + δ i,a . We remark here ℓ FIFA ((x, y, a); f ) is used only during training phase, where we allow access to sensitive attribute a. In the test time, we only need to use f but not a. Compute the FIFA loss ℓFIFA via (4) using reduction-labels ŷtrain (in mini-batches). 5:

5. EXAMPLE: COMBINING FIFA WITH REDUCTIONS-BASED FAIR ALGORITHMS

Update θ in h using back-propagation. 6: Logging training metrics using true labels {yi} n i=1 and attributes {ai} n i=1 . 7: end for In this section, we demonstrate the power of our approach by combining it with a popular reduction-based fair classification algorithm (Agarwal et al., 2018) as an example. In Section 6, we show that incorporating our approach can bring a significant gain in terms of both combined loss and fairness generalization comparing with directly applying their method in vanilla models trained with softmax-crossentropy losses. The reduction approach proposed in Agarwal et al. (2018) has two versions: (i). Exponentiated gradient (ExpGrad) that produces a randomized classifier; and (ii). Grid search (GridS) that produces a deterministic classifier. Our approach can be combined with both. From a high level point, the above two methods are mainly based on putting the fairness constraint as a penalty along with the objective loss function then perform a min-max optimization. The only difference is that the first algorithm ExpGrad aims to produce a randomized classifier and GridS aims to produce a deterministic classifier. To incorporate our framework with the two algorithms above, we only need to slightly modify the error function by adding a margin related term. Exponentiated gradient (ExpGrad). We first briefly describe the algorithm here. For Y = {0, 1}, by Agarwal et al. (2018) , EO constraints could be rewritten as M µ(h) ≤ c for certain M and c, where  µ j (h) = E[h(X)|E j ] for j ∈ J , M ∈ R |K|×|J | , and c ∈ R K . Here, K = A × Y × {+, -} (+, -impose positive/negative sign so as to recover | • | in constraints) and J = (A ∪ { * }) × {0, 1}. E (a,y) = {A = a, Y = y} and E ( * ,y) = {Y = y}. Let err(h) = P(h(X) ̸ = Y ), instead of considering min h∈H err(h) such that M µ(h) ≤ c, err(Q) such that M µ(Q) ≤ c, where err(Q) = h∈H Q(h) err(h), µ(Q) = h∈H Q(h)µ(h), Q is a distribution over H, and ∆ H is the collection of distributions on H. Let us further use err(Q) and μ(Q) to denote the empirical versions and also allows relaxation on constraints by using ĉ = c + ϵ, where ĉk = c k + ϵ k for relaxation ε k ≥ 0. By classic optimization theory, it could be transferred to a saddle point problem, and Agarwal et al. (2018) aims to solve the following prime dual problems simultaneously for L(Q, λ) = err(Q) + λ ⊤ (M μ(Q) -ĉ): (P) : min Q∈∆ max λ∈R |K| + ,∥λ∥ 1 ≤B L(Q, λ), (D) : max λ∈R |K| + ,∥λ∥ 1 ≤B min Q∈∆ L(Q, λ). To summarize, ExpGrad takes training data {(x i , y i , a i )} n i=1 , function class H, constraint parameters M, ĉ, bound B, accuracy tolerance v > 0, and learning rate η as inputs and outputs  and ( Q, λ ) is called a ν-approximate saddle point. As implemented in Agarwal et al. (2018) , H roughly consists of h(x) = 1{f (x) 1 ≥ f (x) 0 } for f ∈ F (in fact, a smoothed version is considered in Agarwal et al. (2018) ) and gives ( Q, λ), such that L( Q, λ) ≤ L(Q, λ)+ν for all Q ∈ ∆ H and L( Q, λ) ≤ L( Q, λ)-ν for all λ ∈ R |K| + , ∥λ∥ 1 ≤ B, err(Q) = h∈H P(h(X) ̸ = Y )Q(h) = P(f (X)Y < f (X) {0,1}\Y )Q(h). To combine our approach, we consider optimizing err new (Q) = h∈H P(f (X) Y -∆ Y,A ≤ f (X) {0,1}\Y )Q(h), such that M μnew (Q) ≤ ĉ, where μnew (Q) = h∈H Q(h)μ new (f ) and μnew j (f ) = P(f (X) Y -∆ Y,A > f (X) {0,1}\Y |E j ). We can modify ExpGrad to optimize prime dual problems simultaneously for L new (Q, λ) = err new (Q) + λ ⊤ (M μnew (Q) -ĉ). In practice, while Section 3 is motivated for deterministic classifiers, FIFA works for the randomized version toothe modified ExpGrad can be viewed as encouraging a distribution Q that puts more weights on classifiers with a certain type of margin trade-off between classes. Moreover, the modified algorithm enjoys similar convergence guarantee as the original one. Theorem 5.1 Let ρ = max f ∥M μnew (f ) -ĉ∥ ∞ . For η = ν/(2ρ 2 B), the modified ExpGrad will return a ν-approximate saddle point of L new in at most 4ρ 2 B 2 log(|K| + 1)/ν 2 iterations. Grid search (GridS). When the number of constraints is small, e.g., when there are only few sensitive attributes, one may directly perform a grid search on the λ vectors to identify the deterministic classifier that attains the best trade-off between accuracy and fairness. In practice, GridS is preferred for larger models due to its memory efficiency, since ExpGrad needs to store all intermediate models to compute the randomized classifier at prediction time, which is less feasible for over-parameterized models. We describe our flexible approach in Algorithm 1 that combines with GridS used in practice in the official code base FairLearn (Bird et al., 2020) .

6. EXPERIMENTS

We now use our flexible approach on several datasets in the classification task with a sensitive attribute. Although our method is proposed for over-parameterized models, it can also boost the performance on small models. Depending on the specific dataset and model architectures, we use either the grid search or the exponentiated gradient method developed by Agarwal et al. (2018) as fairness algorithms to enforce the fairness constraints, while adding our FIFA loss in the inner training loop. Note that our method can be combined with other fairness algorithms. Datasets. We choose both a large image dataset and two simpler datasets. We use the official train-test split of these datasets. More details and statistics are in the Appendix. (i). CelebA ( (Liu et al., 2015) ): the task is to predict whether the person in the image has blond hair or not where the sensitive attribute is the gender of the person. (ii). AdultIncome ( (Dua & Graff, 2017) ): the task is to predict whether the income is above 50K per year, where the sensitive attribute is the gender. We also use the new AdultIncome (from California in 2021) introduced by Ding et al. (2021) , where the sensitive attribute is the race. (iii). DutchConsensus ((voor de Statistiek , Statistics Netherlands)): the task is predict whether an individual has a prestigious occupation and the sensitive attribute is the gender. Both AdultIncome and DutchConsensus datasets are also used in Agarwal et al. (2018) . Method. Due to computational feasibility (ExpGrad needs to store all intermediate models at prediction time), we combine Grid Search with FIFA for the CelebA dataset and ResNet-18 and use both Grid Search and Exponentiated Gradient on the AdultIncome with logistic regression. Besides C and δ i,a , we also treat α as tuning parameters (in Eq. ( 2)). We then perform hyper-parameter sweeps on the grids (if used) over C, δ i,a and α for FIFA, and grids (if used) for vanilla training (combine fairness algorithms with the vanilla softmax-cross-entropy loss). More details are included in the Appendix. Evaluation and Generalization. When evaluating the model, we are mostly interested in the generalization performance measured by a combined loss that take into consideration both fairness violation and balanced error. We define the combined loss as L cb [f ] = 1 2 L bal [f ] + 1 2 L fv [f ] , which favors those classifiers that have a equally well-performance in terms of classification and fairness. We consider both the value of the combined loss evaluated on the test set S test , and the generalization error for a loss L is defined as GenErr ) and generalization gaps (4c,4d) of the combined loss and the fairness loss on CelebA dataset. We repeat the experiment for 20 times using the hyper-parameters corresponding to the best-performing models in Table 1 . Solid blue line marks the grid search combined with vanilla training whereas dashed orange line marks grid search combined with the FIFA loss. We also plot 95% confidence band based on 20 repeated runs. We observe that our method FIFA has significantly better generalization performance in terms of both smaller losses on the test set as well as narrower generalization gaps. [L, f ] = |L[f ](S test ) -L[f ](S train )| .

6.1. EFFECTIVENESS OF FIFA ON OVER-PARAMETERIZED MODELS

In this subsection, we thoroughly analyze the results from applying FIFA to over-parameterized models for the CelebA dataset using ResNet-18. We use the grid search algorithm with fairness violation tolerance parameter ϵ ∈ {0.01, 0.05, 0.1} (with a little abuse of notation) for all constraints. We perform sweeps on hyper-parameters C ∈ [0, 0.01], α ∈ [0, 0.01], δ 0,Male , δ 1,Male ∈ [0, 0.01], and δ 0,Female = δ 1,Female = 0. As a special case that may be of interest, when α = 0 and δ 0,Male = δ 1,Male = 0, the FIFA loss coincides with the LDAM loss proposed in Dua & Graff (2017) , with one common hyper-parameter C ∈ [0, 0.01]. We log the losses on the whole training and test set. We summarize our main findings below and give more details in the Appendix including experiments with DP constraints, delayed-reweighting (DRW, (Cao et al., 2019) ), and reweighting methods. Logits-based methods improve fairness generalization. We summarize the best results for each method under different tolerance parameter ϵ in Table 1 . Note that the actual violation may exceed the tolerance ϵ on test data. We note that both FIFA and LDAM significantly improve the test performance of both combined loss and fairness violation among all three choices of ϵ, while having comparable training performance (omitted in the table). Interestingly, directly applying reductions-based method using a vanilla model in the inner loop (the "Vanilla" columns) seems inferior, likely due to the imbalance across subgroups. This implies the effectiveness and necessity of using logits-based methods to ensure a better fairness generalization. FIFA accommodates for both fairness generalization and dataset imbalance. Although both logits-based method improve generalization as seen in Table 1 , our method FIFA has significantly better generalization performance compared with LDAM, especially in terms of fairness violation. For example, when ϵ = 0.01 and 0.05, FIFA achieves a test fairness violation that is at least 2% smaller compared with LDAM. This further demonstrates the importance of our theoretical motivations. Improvements of generalization are two-fold for FIFA. When it comes to generalization, two relevant notions are often used, namely the absolute performance on the test set, and also the generalization error between the training and test set. We compute the generalization error in Table 1 for both combined loss and fairness violation. We observe that FIFA generally dominates LDAM and vanilla in terms of both test performance and generalization error. We further illustrate this behavior in Fig. 4 , where we give 95% confidence band over randomness in training. We note that our FIFA significantly outperforms vanilla in a large margin in terms of both generalization notions, and the improvements are mostly due to better fairness generalization. In fact, as suggested by the similarity in the shapes of curves between Fig. 4c and Fig. 4d , fairness generalization dominates classification generalization, and thus improvements in fairness generalization elicit more prominently overall. 1 are used, where each configuration is repeated 20 times independently. Here blue and orange markers correspond to vanilla and FIFA respectively, and circular and cross markers correspond to training and testing metrics respectively. We observe that our FIFA method is effective in significantly lowering the Pareto frontier comparing with the vanilla method, implying that FIFA mitigates fairness generalization issues as seen in Figure 1 . Towards a more efficient Pareto frontier. In Fig. 5 (a)-(c) we plot the Pareto frontier of balanced classification error (L bal ) and fairness violation (L fv ) for all three choices of ϵ. In practice, one may be interested in a specific convex combination of the fairness violation and balanced error. We thus consider λ-weighted combined loss L λ = λL fv + (1λ)L bal with λ ∈ [0, 1] being a user-specific weight. In Fig. 5(d )-(f), we compute L λ for a grid of 100 values of λ under the same setup. We observe that FIFA with GridS achieves frontiers that are lower and more centered compared with those trained in vanilla losses with GridS. Furthermore, for most of the combining weight λ, FIFA achieves better test performance.

6.2. EFFECTIVENESS OF FIFA ON SMALLER DATASETS AND MODELS

We use logistic regression (implemented as a one-layer neural net) for the AdultIncome and Dutch-Consensus datasets with similar sweeping procedure are similar to those described in Section 6.1. Results. We tabulate the best-performing models (in terms of test combined loss) among sweeps in Table 2 and include more details in the Appendix. The observations are similar as in Section 6.1: FIFA outperforms vanilla on both dataset across three different tolerance parameter ϵ; since the datasets are much simpler in this case, the improvements are less significant.

7. DISCUSSIONS AND CONCLUSIONS

Generalization (especially in over-parameterized models) has always been an important and difficult problem in machine learning research. In this paper, we set out the first exposition in the study of the generalization of fairness constraints that has previously been overlooked. Our theoretically-backed FIFA approach is shown to mitigate poor fairness generalization observed in vanilla models large or small. We leave a more fine-grained analysis of the margins to the future work. A.2 ASSIGNMENT OF δ i,a FOR MULTI-GROUPS In this subsection, we describe how to choose δ i,a for multiple demographic groups. Recall in Section A.1, for multiple demographic groups, M multi-groups [f ] ≲ i∈Y 1 γ i C(F) n i + i∈Y,a∈A ᾱ γ i,a C(F) n i,a . (A.3) Given γ i,a ≥ γ i , similar as in the main context, we can take γ i,a = γ i + δ i,a , where δ i,a ≥ 0 are tuning parameters. Assume there are k groups. Let us first ordering |S i,a | by in a decreasing order. Without loss of generality, |S i,a1 | ≥ |S i,a2 | ≥ • • • ≥ |S i,a k |. Thus, when tune the parameters δ i,a 's, we set δ i,a1 = 0 (or we can randomly choose other a's to set δ i,a = 0 if there are ties and |S i,a | = |S i,a1 |, but for simplicity, we ignore this case), and we make sure δ i,a1 ≤ δ i,a2 ≤ • • • δ i,a k . This is optimal in the sense that if there are k constants 0 = δ 1 ≤ δ 2 ≤ δ 3 ≤ • • • ≤ δ k , then i∈Y,aj ∈A ᾱ γ i + δ i,aj C(F) n i,aj ≤ i∈Y,aj ∈A ᾱ γ i + δ i,a σ(j) C(F) n i,aj , where σ(•) is a permutation. In other words, our way to assign δ i,a 's can make the upper boupnd on RHS of Eq. (A.3) optimal. This is a direct application of rearrangement inequality, see Lemma A.1. Lemma A.1 For x 1 ≤ x 2 ≤ • • • ≤ x k , y 1 ≤ y 2 ≤ • • • ≤ y k , any permutation σ(•) x k y 1 + x k-1 y 2 + • • • + x 1 y k ≤ x σ(1) y 1 + x σ(2) y 2 + • • • + x σ(k) y k ≤ x 1 y 1 + x 2 y 2 + • • • + x k y k .

A.3 DERIVATION FOR OTHER FAIRNESS NOTIONS

In this subsection, we consider theory-inspired derivation for other fairness notions. For simplicity, we still focus on the case that Y = {0, 1} (this is necessary for EqOpt) and A = {a 1 , a 2 }. Given the derivation in this subsection, we can further derive FIFA for other fairness notions as in Section A.1. Equalized opportunity. Specifically, for EqOpt, the aim is: min f L bal [f ] s.t. ∀y ∈ Y, a ∈ A, P(h(X) = 1|Y = 1, A = a) = P(h(X) = 1|Y = 1), This is a simple version of EO in some sense. Directly using the derivation in Section A.1, we have γ 0 /γ 1 = n -1/4 0 /ñ -1/4 1 , where the adjusted sample size ñ1 = n 1 n 1,a1 n 1,a2 ( √ n 1,a1 n 1,a2 + 2α( √ n 1 n 1,a2 + √ n 1 n 1,a1 )) 2 . Demographic parity. Similar to EO, for DP, the optimization aims to: min f L bal [f ] s.t. ∀y ∈ Y, a ∈ A, P(h(X) = y|A = a) = P(h(X) = y). In this setting, we no longer can expect all the training samples are perfectly classified and fairness constraints violation is perfectly satisfied because there exists fairness and accuracy trade-off in training phase for DP. However, in real application, people always adopt relaxation in fairness constraints, i.e. |P(h(X) = y|A = a) -P(h(X) = y)| < ϵ for some ϵ > 0 (if there are only two groups, one can alternatively use |P(h(X) = y|A = a 1 ) -P(h(X) = y)|A = a 2 | ≤ ϵ). When ϵ is large enough (or P(Y = 1|A = a 1 ) is close to P(Y = 1|A = a 2 ) ), if we use suitable models, similar as in the EO setting, we would expect all the training examples are classified perfectly while satisfying | P(h(X) = y|A = a 1 ) -P(h(X) = y)|A = a 2 | ≤ ϵ. Then, we can use similar techniques to characterize a trade-off between margins.

Specifically, simple calculation leads to

i∈Y |P(h(X) = i|A = a 1 ) -P(h(X) = i|A = a 2 )| ≤ j∈{1,2} i∈{0,1} P(Y = i|A = a j )L i,aj [f ] + I, where I is a term not related to f . Thus, our optimization objective (not performance criterion) for DP can be taken as: L bal [f ] + α i,a P(Y = i|A = a)L i,a [f ], (A.4) for weight α. We can use training data to estimate P(Y = i|A = a). For simplicity, we can also use L bal [f ] + α i,a L i,a [f ], which shares the same upper bound as in Theorem 3.1 and also implies γ 0 /γ 1 = ñ-1/4 0 /ñ -1/4 1 , that will also be used in the experiments in later sections. We should also remark that when ϵ is too small, our method may not work, this can also be reflected in Table 4 , when ϵ = 0.01.  f ∈ F, M[f ] ≲ i∈Y 1 γ i C(F) n i + i∈Y,a∈A 2α γ i,a C(F) n i,a ≤ i∈Y 1 γ i C(F) n i + i∈Y,a∈A 2α γ i C(F) n i,a , where γ i is the margin of the i-th class's sample set S i and γ i,a is the margin of demographic subgroup's sample set S i,a . This following lemma is the key lemma we will use. Let us define the empirical Rademacher complexity of F of subgroup/class margin on S * as Ri (F) = 1 n i E ξ [sup f ∈F j∈Si ξ j [f (x j ) i -max i ′ ̸ =i f (x j ) i ′ ]], Ri,a (F) = 1 n i,a E ξ [sup f ∈F j∈Si,a ξ j [f (x j ) i -max i ′ ̸ =i f (x j ) i ′ ]], where ξ j is i.i.d. drawn from a uniform distribution {-1, 1}. Lemma A.2 Let Lγ,i [f ] = P X∼ Pi (max j̸ =i f (X) j > f (X) i -γ) and Lγ,(i,a) [f ] = P X∼ Pi,a (max j̸ =i f (X) j > f (X) i -γ) . With probability at least 1δ over the the randomness of the training data, for some proper complexity measure of class F, for any f ∈ F, * ∈ {i, (i, a)|i ∈ Y, a ∈ A}, and all margins γ > 0 L * [f ] ≲ Lγ, * [f ] + 1 γ R * (F) + ϵ * (n * , δ, γ * ), (A.5) where R * (F) is the empirical Rademacher complexity of F of subgroup/class margin on training dataset corresponding to index set S * , which can be further upper bnounded by C(F ) n * . Also, ϵ * (n * , δ, γ * ) is usually a low-order term in n * Proof: This is a direct application of the standard margin-based generalization bound in Kakade et al. (2008) . □ Proof of Theorem A.1. Notice that all the training samples are classified perfectly by h, not only P (X,Y )∼ Pbal (h(X) ̸ = Y ) = 0 is satisfied, we also have that P (X,Y )∼ Pi,a j (h(X) ̸ = Y ) = 0 for all i ∈ Y and a j ∈ A. We remark here that P(h(X) = i|Y = i, A = a) = 1 -P (X,Y )∼Pi,a (h(X) ̸ = Y ) = 1 -P(f (X) Y < max j̸ =Y f (X) j ) . Thus, we have L bal [f ] + α i∈Y |P(h(X) = i|Y = i, A = a 1 ) -P(h(X) = i|Y = i, A = a 2 )| ≤ L bal [f ] + α(P (X,Y )∼Pi,a 1 (h(X) ̸ = Y ) + P (X,Y )∼Pi,a 2 (h(X) ̸ = Y )). Notice L bal = 1/2 i∈Y P (X,Y )∼Pi (h(X) ̸ = Y ), then, plug in Lemma A.2 and realizing Lγ, * = 0 (all training errors are 0) by our assumption and ignoring low order terms ϵ * (n * , δ, γ * ), we have L bal [f ] + α(P (X,Y )∼Pi,a 1 (h(X) ̸ = Y ) + P (X,Y )∼Pi,a 2 (h(X) ̸ = Y )) ≤ i∈Y 1 2 L i [f ] + α i∈Y,a∈A L i,a [f ] (Notice L i,a [f ] = P (X,Y )∼Pi,a (h(X) ̸ = Y )) ≲ C(F)( i∈Y 1 γ i √ n i + 2α i∈Y,a∈A 1 γ i,a √ n i,a ). In the last formula, we multiple 2 for a nicer looking expression, it won't affect the optimal ratio for γ 0 /γ 1 . Also, notice that γ i,a ≥ γ i , we have C(F)( i∈Y 1 γ i √ n i + 2α i∈Y,a∈A 1 γ i,a √ n i,a ) ≤ C(F)( i∈Y 1 γ i √ n i + 2α i∈Y,a∈A 1 γ i √ n i,a ). The proof is complete. A.4.2 OPTIMIZATION OF γ 0 /γ 1 Theorem A.2 For binary classification, let F be a class of neural networks with a bias term, i.e. F = {f + b} where f is a neural net function and b ∈ R 2 is a bas, with Rademacher complexity upper bound R * (F) ≤ C(F ) n * . Suppose some classifier f ∈ F can achieve a total sum of margins γ ′ 0 + γ ′ 1 = β with γ ′ 0 , γ ′ 1 > 0. Then, there exists a classifier f * ∈ F with margin ratio γ * 0 /γ * 1 = ñ-1/4 0 /ñ -1/4 1 = ñ1/4 1 /ñ 1/4 0 , where the adjusted sample size ñi = niΠani,a ( √ Πani,a+α j∈A √ niΠ a∈A \j ni,a) 2 for i ∈ Y. Proof: This can directly follow the proof in Theorem 3 in Cao et al. (2019) . The only difference is that we need to solve min γ0+γ1=β i∈Y 1 γ i C(F) n i + i∈Y,a∈A 2α γ i C(F) n i,a . More specifically, by simple calculation, for Y = {0, 1}, A = {a 1 , a 2 }, i∈Y 1 γ i 1 n i + i∈Y,a∈A 2α γ i 1 n i,a = 1 γ 0 C(F) ñ0 + 1 γ 1 C(F) ñ1 , then by applying Theorem 3 in Cao et al. (2019) by replacing n i 's with ñi 's, it gives the final result. Consider two classifiers f , f ∈ F such that dist( f , S 0 )/ dist( f , S 1 ) = ñ-1/4 0 /ñ -1/4 1 and dist(f ′ , S 0 )/ dist(f ′ , S 1 ) = n -1/4 0 /n -1/4 1 . Suppose ∥β * ∥ ≫ √ p log n, ∥µ i ∥ < C, (µ * 1 - µ * 2 ) ⊤ β = 0, and π 1,a2 ≤ c 1 π 1,a1 for a sufficiently small c 1 > 0, then when n 0 , n 1 are sufficiently large, with high probability we have M[ f ] < M[f ]. Proof: Recall that ñi = niΠ a∈A ni,a ( √ Π a∈A ni,a+α a∈A √ nini,a) 2 for i ∈ {0, 1}, and our training data follow distribution: x | y = 0 ∼ 2 i=1 π 0,ai N p (µ i , I), x | y = 1 ∼ 2 i=1 π 1,ai N p (µ i + β * , I). M[f ] = 1 2 P(h(X) = 1|Y = 0) + 1 2 P(h(X) = 0|Y = 1) + α i∈Y |P(h(X) = i|Y = i, A = a 1 ) -P(h(X) = i|Y = i, A = a 2 )| = 1 2 2 i=1 π 0,ai Φ( β * ⊤ µ i -c ∥β * ∥ ) + 1 2 2 i=1 π 1,ai Φ( c -β * ⊤ µ i -∥β * ∥ 2 ∥β * ∥ ) + α • |Φ( β * ⊤ µ 0 -c ∥β * ∥ ) -Φ( β * ⊤ µ 1 -c ∥β * ∥ )| + α • |Φ( β * ⊤ µ 0 + ∥β * ∥ 2 -c ∥β * ∥ ) -Φ( β * ⊤ µ 1 + ∥β * ∥ 2 -c ∥β * ∥ )| For different margin ratio γ, we have c = µ ⊤ 1 β * + 1 1+γ ∥β * ∥ 2 + O P ( √ p log n), where the O P ( √ p log n) term accounts for the variation of random samples and is based on the fact that ∥Z∥ 2 ∼ χ 2 p if Z ∼ N p (0, I) and max Z i = O P (p log n) if Z 1 , ..., Z n i.i.d. ∼ χ 2 p . Using the fact that ∥β * ∥ ≫ √ p log n and ∥µ i ∥ < C, we then have c = ( 1 1+γ + o P (1))∥β * ∥ 2 . Similarly, we have M[f ] = Φ(- c ∥β * ∥ ) + Φ( c -∥β * ∥ 2 ∥β * ∥ ) + o(1) Then let's consider the two ratios γ = n -1/4 0 /n -1/4 1 and γ = ñ-1/4 0 /ñ -1/4 1 . In the following we compute ñ0 and ñ1 : When π 1,a2 ≤ c 1 π 1,a1 ñ1 = n1Π a∈A n1,a ( √ Π a∈A n1,a+α a∈A √ n1n1,a) 2 ∈ (0.9n 1 , n 1 ) When π 0,a2 = π 0,a1 ñ0 = n0Π a∈A n0,a ( √ Π a∈A n0,a+α a∈A √ n0n0,a) 2 = ( 1 1+2 √ 2 ) 2 n 0 . As a result, we have γ γ ∈ [1.9, 2]. When the data is imbalanced such that γ = ( n1 n0 ) 1/4 > 1, we have 0 < 1 1+γ < 1 1+γ < 1/2, and consequently Φ(- 1 1+γ ∥β * ∥ 2 ∥β * ∥ ) + Φ( 1 1+γ ∥β * ∥ 2 -∥β * ∥ 2 ∥β * ∥ ) < Φ(- 1 1+γ ∥β * ∥ 2 ∥β * ∥ ) + Φ( 1 1+γ ∥β * ∥ 2 -∥β * ∥ 2 ∥β * ∥ ). Additionally, we have that when ∥β * ∥ → ∞, the second term in the M[f ], the fairness violation error is o(1). Combining all the pieces, we have M AdultIncome and DutchConsensus. AdultIncome and Dutch-Consensus are two relatively smaller datasets that have been used for benchmarking for various fair classification algorithms such as Agarwal et al. (2018) . We convert all categorical variables to dummies and use the standard z-normalization to pre-process the data. [ f ] < M[f ]. □ There are 107 features in AdultIncome and 59 in DutchConsensus, both counting the senstive attribute, gender. We intend to use these datasets to test smaller models such as logistic regression, and we implement it as a one-layer neural net for consistency concerns, which is trained using full-batch gradient descent using Adam with learning rate 1 × 10 -4 and weight decay 5 × 10 -5 for 10000 epochs. We set δ 0,Female , δ 1,Female ∈ [0, 0.01], and δ 0,Male = δ 1,Male = 0 for the AdultIncome dataset and δ 0,Male , δ 1,Female ∈ [0, 0.01], and δ 0,Female = δ 1,Male = 0 for the DutchConsensus Dataset. All models converge after this training measured by the training metrics. Although the exact pre-processing procedures for these two datasets are not available in Agarwal et al. (2018) , we found that on vanilla models under both GridSearch and ExponentiatedGradient methods, the training and test performance (measured by total accuracy and fairness violation) are comparable with those reported in Agarwal et al. (2018) . Computational resource considerations. We perform all experiments on NVIDIA GPUs RTX 2080 Ti. Each experiment on CelebA usually takes less than two hours (clocktime) and each experiment on AdultIncome and DutchConsensus takes less than ten minutes.

C ADDITIONAL EXPERIMENTAL RESULTS

C.1 A TRAJECTORY ANALYSIS ON CELEBA One observation we made in Table 1 is that the improvements of the generalization of combined loss on CelebA is largely due to the improved generalization performance on fairness violations. It is natural to wonder whether this behavior suggests that the sweet spots of generalization performance for balanced error and fairness violation may not be aligned, i.e., there is a difference in training time scales for these two metrics to reach their optimal generalization. Furthermore, it is also open that whether one could enforce certain early stopping procedure (e.g., on the combined loss or the fairness violation) such that the generalizations on vanilla models may be improved. To explore these two questions, we plot the trajectories of training and test metrics for FIFA and vanilla (hyper-parameter chosen to be those corresponding to the best-performing models in Table 1 ) in Fig. 7 . We observe that it is difficulty to (i) identify sweet spots of generalization gaps for the vanilla models; and (ii) enforce a reasonable early stopping criterion improves the generalization performances thereof.

C.2 CELEBA AND THE DP CONSTRAINT

We presented in Table 1 our main results, CelebA dataset trained with grid search under EO constraint. We show in Table 4 the results on the DP constraints. Here all training configurations are the same as Table 1 , except that we replace the EO constraint by the DP constraint. For ease of comparison, we also recall the results on EO in We also give in Table 5 the grid search results on AdultIncome dataset, for both EO and DP constraints. We observe that on this small dataset and small model, FIFA can also improve generalization performances for both EO and DP constraints. This further exhibits the flexibility of the FIFA approach. 

C.4 EXPGRAD ON THE NEW ADULTINCOME DATASET

We give in Table 6 the results by applying FIFA+ExpGrad on the new adult income dataset. We use the employment data from California in 2021 and set the same threshold of yearly income being 50K to construct the label. However, the sensitive attributes becomes the race, where 0 for White and 1 for Black. We use the recommended covariates, which after dummification, results in 20 covariates and 129563 samples. We split the samples randomly into 80% training set and 20% test set. The other configurations remain the same.

C.5 PER-GROUP RESULTS

To have a better understanding of the per-group performance, we take a deeper look at (i) CelebA using FIFA+GridS; (ii) AdultIncome using FIFA+GridS; and (iii) New AdultIncome using FIFA+ExpGrad, all optimizing over the EO constraint. We show in Table 7 per-group accuracies and in Table 10 the differences between FPR and FNR across sensitive groups. Note that since the New AdultIncome only conatins 20 covariates, the performances of different methods differ insignificantly. 



Figure1: Each marker corresponds to a sufficiently well-trained ResNet-10 model trained on an imbalanced image classification dataset CelebA ((Liu et al., 2015)). The generalization of fairness constraints (EqualizedOdds) is substantially worse than the generalization of classification error.

Figure 2: The subgroups (sensitive attribute, either Male or Female, and label class, either + or -) are very imbalanced in many popular datasets across different domains.

Figure3: Illustration of δi,a and the margin γ of classifier: δ1,a 1 is set to be non-negative and δ1,a 2 is set to be zero as the subgroup (1, a2) is closer to the decision boundary than (1, a1).

Figure4: Loss on test set (4a,4b) and generalization gaps (4c,4d) of the combined loss and the fairness loss on CelebA dataset. We repeat the experiment for 20 times using the hyper-parameters corresponding to the best-performing models in Table1. Solid blue line marks the grid search combined with vanilla training whereas dashed orange line marks grid search combined with the FIFA loss. We also plot 95% confidence band based on 20 repeated runs. We observe that our method FIFA has significantly better generalization performance in terms of both smaller losses on the test set as well as narrower generalization gaps.

L λ [f ](Stest) with ϵ = 0.10.

Figure 5: Pareto frontiers of the balanced loss (Lbal) and fairness loss (Lfv) of CelebA using ResNet-18 with grid search combined with FIFA and vanilla softmax-cross-entropy loss respectively. (a)-(c) exhibit Porento frontier and (d)-(f) illustrate λ-combined loss. Best-performing hyper-parameters from Table1are used, where each configuration is repeated 20 times independently. Here blue and orange markers correspond to vanilla and FIFA respectively, and circular and cross markers correspond to training and testing metrics respectively. We observe that our FIFA method is effective in significantly lowering the Pareto frontier comparing with the vanilla method, implying that FIFA mitigates fairness generalization issues as seen in Figure1.

A.4 OMITTED PROOFS A.4.1 PROOF OF THEOREM 3.1 Theorem A.1 (Restatement of Theorem 3.1) With high probability over the randomness of the training data, for Y = {0, 1}, A = {a 1 , a 2 }, and for some proper complexity measure of class F, i.e. C(F), for any

of Example 3.1) Given function f and set S, let dist(f, S) = min x,s∈S ∥f (x) -s∥ 2 .

Algorithm 1 FIFA Combined Grid Search

ExpGrad obtains the best randomized classifier, by sampling a classifier h ∈ H from a distribution over H. Formally, this optimization can be formulated as min Q∈∆ H

Grid search with EO constraint on CelebA dataset(Liu et al., 2015) using ResNet-18, best results with respect to test combined loss among sweeps of hyper-parameters are shown. As an interesting special case of our FIFA method, we note that although the LDAM method improves the performance compared with vanilla GS, it is not as effective as our method.

Exponentiated gradient with EO constraint on the AdultIncome and DutchConsensus datasets using logistc regression (as a one-layer neural net), best results with respect to test combined loss among sweeps of hyper-parameters are shown.

Training and test trajectories of different metrics of ResNet-18 on CelebA dataset under FIFA and vanilla losses respectively. We note that the generalization performance of vanilla models are consistently poor as training time increases, suggesting that it is difficult to cultivate an early-stopping scheme that might alleviate poor fairness generalization. that under this configuration the models usually converges within the first 1500 iterations in terms of training losses and thus we fix the training time as 8000 iterations which corresponds to roughly four epochs.

The observations are similar to those we made for Table1, namely, FIFA improves significantly on the combined loss compared with vanilla.

Grid search with EO and DP constraint on CelebA dataset(Liu et al., 2015) using ResNet-18, best results with respect to test combined loss among sweeps of hyper-parameters are shown.C.3 GRID SEARCH ON ADULTINCOME (DP AND EO)

Grid search with EO and DP constraint on AdultIncome dataset using logistic regression, best results with respect to test combined loss among sweeps of hyper-parameters are shown.

FIFA+ExpGrad on the New Adult Income dataset.

Per-group accuracy on the CelebA dataset from FIFA+GridS (EO).

Per-group accuracy on the AdultIncome dataset from FIFA+GridS (EO).

ACKNOWLEDGEMENTS

The research of Jiayao Zhang is partially supported by DARPA FA8750-19-2-1004, IARPA 2019-19051600006, and ONR N00014-19-1-2620. The research of Linjun Zhang is partially supported by NSF DMS-2015378. The research of James Zou is partially supported by funding from NSF CAREER and the Sloan Fellowship. We would also like to thank the support of Kaiser Permanente Washington Health Research Institute and funding R01 MH125821.

ETHICS STATEMENT

We mainly consider specific type of fairness constraints such as equalized odds, equalized opportunity, and demographic parity. That may be a restriction in ethics. We hope our argument could be generalized to other fairness notions in the future.

REPRODUCIBILITY STATEMENT

Our code is available to the public on GitHub at https://github.com/zjiayao/ fifa-iclr23.

Appendix

A OMITTED DERIVATION In this section, we will talk about several missing details in the main context.

A.1 EXTENSION TO MULTI-CLASSES AND MULTI-GROUPS FOR EQUALIZED ODDS

We discussed the case that Y = {0, 1} and A = {a 1 , a 2 } in the main context. In this subsection, we discuss the extension to multi-classes and multi-groups.First, we extend the case to Y = {0, 1} and |A| ≥ 2. In that case, we can consider the constraints a,a ′ ∈A,a̸ =a ′ i∈Y |P(h(X) = i|Y = i, A = a) -P(h(X) = i|Y = i, A = a ′ )| , regardless of the order of (a, a ′ ), and there are |A| 2 pairs of (a, a ′ )'s.Thus, our performance criteria M multi-groups [f ] can be taken as:We overload the notation ᾱ = 2α(|A| -1), then we haveNotice the difference between Eq. (A.1) and Eq. ( 2) is that in Eq. (A.1), |A| ≥ 2. Thus, by similar proof as in Theorem A.1, we can obtain, where the adjusted sample sizeGiven the results above, for multiple classes with multiple groups, for i, j ∈ Y,where the adjusted sample size ñi = niΠani,a ( √ Πani,a+¯α j∈A √ niΠ a∈A \j ni,a) 2 for i ∈ Y.FIFA for multi-classes and multi-groups. We will demonstrate how to apply the above motivations to design better margin losses. Consider a logits-based loss ℓ((x, y); f ) = ℓ(f (x) y , {f (x) i } i∈Y\y ), which is non-increasing with respect to its first coordinate if we fix the second coordinate.Our flexible imbalance-fairness-aware (FIFA) approach modifies the above losses during training ℓ FIFA ((x, y, a); fwhere ∆ i,a = C/ñ 1/4 i + δ i,a , and δ i,a ≥ 0. The specific assignment of δ i,a ≥ 0 is described in Section A.2. A.4.4 PROOF OF THEOREM 5.1., the modified ExpGrad will return a ν-approximate saddle point of L new in at most 4ρ 2 B 2 log(|K| + 1)/ν 2 iterations.Proof: We consider an extended version of h in Agarwal et al. (2018) , which is a function of (x,y,a) instead of just be a function of x. h : (x, y, a) → {0, 1}. Notice that µ(h) also satisfies the requirement in Agarwal et al. (2018) with the extend version h. Thus, directly by classic result of in Freund & Scapire (1996) and theorm 1 in Agarwal et al. (2018) , the result follows. □

A.5 COMBINATION WITH OTHER ALGORITHMS

As we stated, the algorithm stated in the main context is just one of the examples that can be combined with our approach. FIFA can also be applied to many other popular algorithms such as fair representation (Madras et al., 2018) . We here show how to combine with fair representation.In Madras et al. (2018) , there are several parts, an encoder ρ, an adversary v, a decoder k and a predictor g. The optimization is:wherefor cross entropy loss ℓ c , decoding loss ℓ dec , and adversary loss ℓ adv . We can modify the cross entropy loss to ℓ c to ℓ FIFA . So,Actually, for ℓ adv , we can similarly modify for indices, but it is a little complicated and notation heavy, so we omit it here. We use the official train-test split for the CelebA dataset. For AdultIncome and DutchConsensus, we use the train_test_split procedure of the scikit-learn package with trainingtest set ratio of 0.8 and random seed of 1 to generate the training and test set. We tabulate the sizes for subgroups in Table 3 .

B IMPLEMENTATION DETAILS

The sweeps are done on the wandb platform (Biewald, 2020) , where all hyper-parameters except for the grid, are searched using its built-in Bayesian backend. All models for the same dataset are trained with a fixed number of epochs where the training accuracies converge. Batch training with size 128 is used for CelebA and full batch training is used for AdultIncome. As a special case of FIFA, when δ i,a = 0 for all i, a and α = 0 the FIFA loss degenerates to non fairness-aware LDAM loss proposed in Cao et al. (2019) ; FIFA further finetunes δ i,a and α, and to ensure a fair comparison, we set the same coverage for the the common hyper-parameter C in the sweeps. In Fig. 6 , we show the histogram of the hyper-parameter C in the sweeps for FIFA and LDAM. Note that the sweeps cover approximately the same range for this common hyper-parameter.Details on CelebA. We use the same pre-processing steps as in (Sagawa et al., 2020) to crop the images in CelebA into 224 × 224 × 3 and perform the same z-normalization for both training and test set. We use ResNet-18 models for training with the last layer being replaced to the NormLinear layer used by Cao et al. (2019) that ensures the input as well as the columns of the wight matrix (with 2 rows corresponding to each label class) has norm 1. This ensures our adjustments on the logits are comparable. We use the Adam optimizer with learning rate 1 × 10 -4 and weight decay 5 × 10 -5 to train these models with stochastic batches of sizes 128. We performed pilot experiments and learnt 

