FAIR MIXUP: FAIRNESS VIA INTERPOLATION

Abstract

Training classifiers under fairness constraints such as group fairness, regularizes the disparities of predictions between the groups. Nevertheless, even though the constraints are satisfied during training, they might not generalize at evaluation time. To improve the generalizability of fair classifiers, we propose fair mixup, a new data augmentation strategy for imposing the fairness constraint. In particular, we show that fairness can be achieved by regularizing the models on paths of interpolated samples between the groups. We use mixup, a powerful data augmentation strategy to generate these interpolates. We analyze fair mixup and empirically show that it ensures a better generalization for both accuracy and fairness measurement in tabular, vision, and language benchmarks. The code is available at https://github.com/chingyaoc/fair-mixup.

1. INTRODUCTION

Fairness has increasingly received attention in machine learning, with the aim of mitigating unjustified bias in learned models. Various statistical metrics were proposed to measure the disparities of model outputs and performance when conditioned on sensitive attributes such as gender or race. Equipped with these metrics, one can formulate constrained optimization problems to impose fairness as a constraint. Nevertheless, these constraints do not necessarily generalize since they are data-dependent, i.e they are estimated from finite samples. In particular, models that minimize the disparities on training sets do not necessarily achieve the same fairness metric on testing sets (Cotter et al., 2019) . Conventionally, regularization is required to improve the generalization ability of a model (Zhang et al., 2016) . On one hand, explicit regularization such as weight decay and dropout constrain the model capacity. On the other hand, implicit regularization such as data augmentation enlarge the support of the training distribution via prior knowledge (Hernández-García & König, 2018) . In this work, we propose a data augmentation strategy for optimizing group fairness constraints such as demographic parity (DP) and equalized odds (EO) (Barocas et al., 2019) . Given two sensitive groups such as male and female, instead of directly restricting the disparity, we propose to regularize the model on interpolated distributions between them. Those augmented distributions form a path connecting the two sensitive groups. Figure 1 provides an illustrative example of the idea. The path simulates how the distribution transitions from one group to another via interpolation. Ideally, if the model is invariant to the sensitive attribute, the expected prediction of the model along the path should have a smooth behavior. Therefore, we propose a regularization that favors smooth transitions along the path, which provides a stronger prior on the model class. We adopt mixup (Zhang et al., 2018b) , a powerful data augmentation strategy, to construct the interpolated samples. Owing to mixup's simple form, the smoothness regularization we introduce has a closed form expression that can be easily optimized. One disadvantage of mixup is that the interpolated samples might not lie on the natural data manifold. Verma et al. (2019) propose Manifold Mixup, which generate the mixup samples in a latent space. Previous works (Bojanowski et al., 2018; Berthelot et al., 2018) have shown that interpolations between a pair of latent features correspond to semantically meaningful, smooth interpolation in the input space. By constructing the path in the latent space, we can better capture the semantic changes while traveling between the sensitive groups and hence result in a better fairness regularizer that we coin fair mixup. Empirically, fair mixup improves the generalizability for both DP and EO on tabular, computer-vision, and natural language benchmarks. Theoretically, we prove for a particular case that fair mixup corresponds to a Mahalanobis metric in the feature space in which we perform the classification. This metric ensures group fairness of the model, and involves the Jacobian of the feature map as we travel along the path. In short, this work makes the following contributions: • We develop fair mixup, a data augmentation strategy that improves the generalization of group fairness metrics; • We provide a theoretical analysis to deepen our understanding of the proposed method; • We evaluate our approach via experiments on tabular, vision, and language benchmarks;

2. RELATED WORK

Machine Learning Fairness To mitigate unjustified bias in machine learning systems, various fairness definitions have been proposed. The definitions can usually be classified into individual fairness or group fairness. A system that is individually fair will treat similar users similarly, where the similarity between individuals can be obtained via prior knowledge or metric learning (Dwork et al., 2012; Yurochkin et al., 2019) . Group fairness metrics measure the statistical parity between subgroups defined by the sensitive attributes such as gender or race (Zemel et al., 2013; Louizos et al., 2015; Hardt et al., 2016) . While fairness can be achieved via pre-or post-processing, optimizing fair metrics at training time can lead to the highest utility (Barocas et al., 2019) . For instance, Woodworth et al. (2017) impose independence via regularizing the covariance between predictions and sensitive attributes. Zafar et al. (2017) regularize decision boundaries of convex margin-based classifier to minimize the disparaty between groups. Zhang et al. (2018a) mitigate the bias via minimizing an adversary's ability to predict sensitive attributes from predictions. Nevertheless, these constraints are data-dependent, even though the constraints are satisfied during training, the model may behave differently at evaluation time. Agarwal et al. (2018) analyze the generalization error of fair classifiers obtained via two-player games. To improve the generalizability, Cotter et al. (2019) inherit the two-player setting while training each player on two separated datasets. In spite of the analytical solutions and theoretical guarantees, game-theoretic approaches could be hard to scale for complex model classes. In contrast, our proposed fair mixup, is a general data augmentation strategy for optimizing the fairness constraints, which is easily compatible with any dataset modality or model class. Data Augmentation and Regularization Data augmentation expands the training data with examples generated via prior knowledge, which can be seen as an implicit regularization (Zhang et al., 2016; Hernández-García & König, 2018) where the prior is specified as virtual examples. Zhang et al. (2018b) proposes mixup, which generate augmented samples via convex combinations of pairs of examples. In particular, given two examples z i , z j ∈ R d where z could include both input and label, mixup constructs virtual samples as tz i + (1 -t)z j for t ∈ [0, 1]. State-of-the-art results are obtained via training on mixup samples in different modalities. Verma et al. (2019) introduces manifold mixup and shows that performing mixup in a latent space further improves the generalization. While previous works focus on general learning scenarios, we show that regularizing models on mixup samples can lead to group fairness and improve generalization.

3. GROUP FAIRNESS

Without loss of generality, we consider the standard fair binary classification setup where we obtain inputs X ∈ X ⊂ R d , labels Y ∈ Y = {0, 1}, sensitive attribute A ∈ {0, 1}, and prediction score Ŷ ∈ [0, 1] from model f : R d → [0, 1]. We will focus on demographic parity (DP) and equalized odds (EO) in this work, while our approach also encompasses other fairness metrics (detailed discussion in section 5). DP requires the predictions Ŷ to be independent of the sensitive attribute A, that is, P ( Ŷ |A = 0) = P ( Ŷ |A = 1). However, DP ignores the possible correlations between Y and A and could rule out the perfect predictor if Y ⊥ ⊥ A. EO overcomes the limit of DP by conditioning on the label Y . In particular, EO requires Ŷ and A to be conditionally independent with respect to Y , that is, P ( Ŷ |A = 1, Y = y) = P ( Ŷ |A = 0, Y = y) for y ∈ {0, 1}. Given the difficulty of optimizing the independency constraints, Madras et al. (2018) propose the following relaxed metrics: ∆DP(f ) = |E x∼P0 f (x) -E x∼P1 f (x)| ∆EO(f ) = y∈{0,1} E x∼P y 0 f (x) -E x∼P y 1 f (x) where we define P a = P (•|A = a) and P y a = P (•|A = a, Y = y), a, y ∈ {0, 1}. We denote the joint distribution of X and Y by P . Similar metrics have also been used in Agarwal et al. (2018) , Wei et al. (2019), and Taskesen et al. (2020) . One can formulate a penalized optimization problem to regularize the fairness measurement, for instance, (Gap Regularization): min f E (x,y)∼P [ (f (x), y)] + λ∆DP(f ), where is the classification loss. In spite of its simplicity, our experiments show that small training values of ∆DP(f ) do not necessarily generalize well at evaluation time (See section 6). To improve the generalizability, we introduce a data augmentation strategy via a dynamic form of group fairness metrics.

4. DYNAMIC FORMULATION OF FAIRNESS: PATHS BETWEEN GROUPS

For simplicity, we will first consider ∆DP as the fairness metric, and extend our development to ∆EO in section 5. ∆DP provides a static measurement by quantifying the expected difference at P 0 and P 1 . In contrast, one can consider a dynamic metric that measures the change of Ŷ while transitioning gradually from P 0 to P 1 . To convert from the static to the dynamic formulations, we start with a simple Lemma that bridges two groups with an interpolator T (x 0 , x 1 , t), which generates interpolated samples between x 0 and x 1 based on step t. Lemma 1. Let T : X 2 × [0, 1] → X be a function continuously differentiable w.r.t. t such that T (x 0 , x 1 , 0) = x 0 and T (x 0 , x 1 , 1) = x 1 . For any differentiable function f , we have ∆DP(f ) = 1 0 d dt f (T (x 0 , x 1 , t) interpolation )dP 0 (x 0 )dP 1 (x 1 )dt =: 1 0 d dt µ f (t)dt , where we define µ f (t) = E x0∼P0,x1∼P1 f (T (x 0 , x 1 , t)), the expected output of f with respect to T (x 0 , x 1 , t). Figure 2 provides an illustrative example of the idea. Lemma 1 relaxes the binary sensitive attribute into a continuous variable t ∈ [0, 1], where µ f captures the behavior of f while traveling from group 0 to group 1 along the path constructed with the interpolator T . In particular, given two examples x 0 and x 1 drawn from each group, T generates interpolated samples that change smoothly with respect to t. For instance, given two racial backgrounds in the dataset, µ f simulates how the prediction of f changes while the data of one group smoothly transforms to another. We can then detect whether there are "unfair" changes in µ f along the path. The dynamic formulation allows us to measure the sensitivity of f with respect to a relaxed continuous sensitive attribute t via the derivative d dt µ f (t). Ideally, if f is invariant to the sensitive attribute, d dt µ f (t) should be small along the path from t = 0 to 1. Importantly, a small ∆DP does not imply | d dt µ f (t)| is small for t ∈ [0, 1] since the derivative could fluctuate as it can be seen in Figure 2 .

4.1. SMOOTHNESS REGULARIZATION

To make f invariant to t, we propose to regularize the derivative along the path: (Smoothness Regularizer): R T (f ) = 1 0 d dt µ f (t) dt. Interestingly, R T (f ) is the arc length of the curve defined by µ f (t) for t ∈ [0, 1]. Now, we can interpret the problem from a geometric point of view. The interpolator T defines a curve µ f (t) : [0, 1] → R, and ∆DP(f ) = |µ f (0) -µ f (1) | is the Euclidean distance between points t = 0 and 1. ∆DP(f ) fails to capture the behavior of f while transitioning from P 0 to P 1 . In contrast, regularizing the arc length R T (f ) favors a smooth transition from t = 0 to 1, which constrains the fluctuation of the function as the sensitive attributes change. By Jensen's inequality, ∆DP(f ) ≤ R T (f ) for any f , which further justifies the validity of regularizing ∆DP(f ) with R T (f ).

5. FAIR MIXUP: REGULARIZING MIXUP PATHS

It remains to determine the interpolator T . A good interpolater shall (1) generate meaningful interpolations, and (2) the derivative of µ f (.) with respect to t should be easy to compute. In this section, we show that mixup (Zhang et al., 2018b) , a powerful data augmentation strategy, is itself a valid interpolator that satisfies both criterions. Input Mixup We first adopt the standard mixup (Zhang et al., 2018b ) by setting the interpolator as the linear interpolation in input space: T (x 0 , x 1 , t) = tx 0 + (1 -t)x 1 . It can be verified that T mixup satisfies the interpolator criterion defined in Lemma 1. The resulting smoothness regularizer has the following closed form expressionfoot_0 : R DP mixup (f ) = 1 0 ∇ x f (tx 0 + (1 -t)x 1 ), x 0 -x 1 dP 0 (x 0 )dP 1 (x 1 ) dt. The regularizer can be easily optimized by computing the Jacobian of f on mixup samples. Jacobian regularization is a common approach to regularize neural networks (Drucker & LeCun, 1992) . For instance, regularizing the norm of the Jacobian can improve adversarial robustness (Chan et al., 2019; Hoffman et al., 2019) . Here, we regularize the expected inner product between the Jacobian on mixup samples and the difference x 0 -x 1 . Manifold Mixup One disadvantage of input mixup is that the curve is defined with mixup samples, which might not lie on the natural data manifold. Verma et al. (2019) propose Manifold Mixup, which generate the mixup samples in the latent space Z. In particular, manifold mixup assumes a compositional hypothesis f •g where g : X → Z is the feature encoder and the predictor f : Z → Y takes the encoded feature to perform prediction. Similarly, we can establish the equivalence between ∆DP and manifold mixup: ∆DP(f • g) = 1 0 d dt f (tg(x 0 ) + (1 -t)g(x 1 ) Manifold Mixup )dP 0 (x 0 )dP 1 (x 1 )dt , which results in the following smoothness regularizer: R DP m-mixup (f • g) = 1 0 ∇ z f (tg(x 0 ) + (1 -t)g(x 1 )), g(x 0 ) -g(x 1 ) dP 0 (x 0 )dP 1 (x 1 ) dt. Previous works (Bojanowski et al., 2018; Berthelot et al., 2018) have showed that interpolations between a pair of latent features correspond to semantically meaningful, smooth interpolations in input space. By constructing a curve in the latent space, we can better capture the semantic changes while traveling from P 0 to P 1 .

Extensions and Implementation

The derivations presented so far, can be easily extended to Equalized Odds (EO). In particular, Lemma 1 can be extended to ∆EO by interpolating P y 0 and P y 1 for y ∈ {0, 1}: ∆EO(f ) = y∈{0,1} 1 0 d dt f (T (x 0 , x 1 , t))dP y 0 (x 0 )dP y 1 (x 1 )dt . The corresponding mixup regularizers can be obtained similarly by substituting P 0 and P 1 in R mixup and R m-mixup with P y 0 and P y 1 : R EO mixup (f ) = y∈{0,1} 1 0 ∇ x f (tx 0 + (1 -t)x 1 ), x 0 -x 1 dP y 0 (x 0 )dP y 1 (x 1 ) dt. Our formulation also encompasses other fairness metrics that quantify the expected difference between groups. This includes group fairness metrics such as accuracy equality which compares the mistreatment rate between groups (Berk et al., 2018) . Similar to equation (1), we formulate a penalized optimization problem to enforce fairness via fair mixup: (Fair Mixup): min f E (x,y)∼P [ (f (x), y)] + λR mixup (f ). Implementation-wise, we follow Zhang et al. (2018b) where only one t is sampled per batch to perform mixup. This strategy works well in practice and reduce the computational requirements.

5.1. THEORETICAL ANALYSIS

To gain deeper insight, we analyze the optimal solution of fair mixup in a simple case. In particular, we consider the classification loss (f (x), y) = -yf (x) and the following hypothesis class: H = {f |f (x) = v, Φ(x) , v ∈ R m , Φ : X → R m }. Define m ± = E x∼P± Φ(x), the label conditional mean embeddings, and m 0 = E x∼P0 Φ(x) and m 1 = E x∼P1 Φ(x), the group mean embeddings. We then define the expected difference δ ± = m + -m -and δ 0,1 = m 0 -m 1 . To derive an interpretable solution, we will consider the L2 variants of the penalized optimization problem. The following proposition gives the analytical solution when we regularize the model with ∆DP. Proposition 2. (Gap Regularization) Consider the following minimization problem min f ∈H E (x,y)∼P [ (f (x), y)] + λ 1 2 ∆DP(f ) 2 + λ 2 2 ||f || 2 H . For a fixed embedding Φ, the optimal solution f * corresponds to v * given by the following closed form: v * = 1 λ 2 δ ± -proj λ 2 λ 1 δ0,1 (δ ± ) , where proj is the soft projection defined as proj β u (x) = u⊗u ||u|| 2 +β x. The solution v * can be interpreted as the projection of the label discriminating direction δ ± on the subspace that is orthogonal to the group discriminating direction δ 0,1 . By projecting to this orthogonal subspace, we can prevent the model from using group specific directions, that are unfair directions when performing prediction. Interestingly, the projection trick has been used in Zhang et al. (2018a) , where they subtract the gradient of the model parameters in each update step with its projection on unfair directions. We then prove the optimal solution of fair mixup with the same setup as above. Similarly, we introduce an L2 variant of the fair mixup regularizer defined as follows: R DP-2 mixup (f ) = 1 0 ∇ x f (tx 0 + (1 -t)x 1 ), x 0 -x 1 dP 0 (x 0 )dP 1 (x 1 ) 2 dt, where we consider the squared absolute value of the derivative within the integral, in order to get a closed form solution. Proposition 3. (Fair Mixup) Consider the following minimization problem min f ∈H E (x,y)∼P [ (f (x), y)] + λ 1 2 R DP-2 mixup (f ) + λ 2 2 ||f || 2 H . Let m t = E x0∼P0,x1∼P1 [Φ(tx 0 + (1 -t)x 1 ) ] be the t dependent mean embedding, and ṁt its derivative with respect to t. Let D be a positive-semi definite matrix defined as follows: D = 1 0 ṁt ⊗ ṁt dt. Given an embedding Φ, the optimal solution v * has the following form: v * = (λ 1 D + λ 2 I m ) -1 δ ± . Hence the optimal fair mixup classifier can be finally written as : f (x) = δ ± , (λ 1 D + λ 2 I m ) -1 Φ(x) , which means that fair mixup changes the geometry of the decision boundary via a new dot product in the feature space that ensures group fairness, instead of simply projecting on the subspace orthogonal to a single direction as in gap regularization. This dot product leads to a Mahalanobis distance in the feature space that is defined via the covariance of time derivatives of mean embeddings of intermediate densities between the groups. To understand this better, given two points x 0 in group 0 and x 1 in group 1, by the mean value theorem, there exists x c such that: f (x 0 ) = f (x 1 ) + ∇f (x c ), x 0 -x 1 = f (x 1 ) + δ ± , (λ 1 D + λ 2 I) -1 JΦ(x c )(x 0 -x 1 ) Note that D provides the correct average conditioning for JΦ(x c )(x 0 -x 1 ), this can be seen from the expression of ṁt (D is a covariance of JΦ(x c )(x 0 -x 1 )). This conditioned Jacobian ensures that the function does not fluctuate a lot between the groups, which matches our motivation.

6. EXPERIMENTS

We now examine fair mixup with binary classification tasks on tabular benchmarks (Adult), visual recognition (CelebA), and language dataset (Toxicity). For evaluation, we show the trade-offs between average precision (AP) and fairness metrics (∆DP/∆EO) by varying the hyper-parameter λ in the objective. We evaluate both AP and fairness metrics on a testing set to assess the generalizability of learned models. For a fair comparison, we will compare fair mixup with baselines that optimize the fairness constraint at training time. In particular, we compare our method with (a) empirical risk minimization (ERM) that trains the model without regularization, (b) gap regularization, which directly regularizes the model as given in Equation (1), and (c) adversarial debiasing (Zhang et al., 2018a) introduced in section 2. Details about the baselines and experimental setups for each dataset can be found in appendix. 6.1 ADULT UCI Adult dataset (Dua & Graff, 2017 ) contains information about over 40,000 individuals from the 1994 US Census. The task is to predict whether the income of a person is greater than $50k given attributes about the person. We consider gender as the sensitive attribute to measure the fairness of the algorithms. The models are two-layer ReLU networks with hidden size 200. We only evaluate input mixup for Adult dataset as the network is not deep enough to produce meaningful latent representations. We retrain each model 10 times and report the mean accuracy and fairness measurement. In each trial, the dataset is randomly randomly split into a training, validation, and testing set with partition 60%, 20%, and 20%, respectively. The models are then selected via the performance on the validation set. Figures 3 (a) shows the tradeoff between AP and ∆DP. We can see that fair mixup consistently achieves a better tradeoff compared to the baselines. We then show the tradeoff between AP and ∆EO in figure 3 (b) . For this metric, fair mixup performs slightly better than directly regularizing the EO gap. Interestingly, fair mixup even achieves a better AP compared to ERM, indicating that mixup regularization not only improves the generalization of fairness constraints but also overall accuracy. To understand the effect of fair mixup, we visualize the expected output µ f along the path for each method (i.e µ f as function of t). For a fair comparison, we select the models that have similar AP for the visualization. As we can see in figure 3 (c ), the flatness of the path is highly correlated to ∆DP. Traininig without any regularization leads to the largest derivative along the path, which eventually leads to large ∆DP. All the fairness-aware algorithms regularize the slope to some extent, nevertheless, fair mixup achieves the shortest arc length and hence leads to the smallest ∆DP. 

6.2. CELEBA

Next, we show that fair mixup generalizes well to high-dimensional tasks with the CelebA face attributes dataset (Liu et al.) . CelebA contains over 200,000 images of celebrity faces, where each image is associated with 40 human-labeled binary attributes including gender. Among the attributes, we select attractive, smile, and wavy hair and use them to form three binary classification tasks while treating gender as the sensitive attributefoot_1 . The reason we choose these three attributes is that there exists in all these tasks, a sensitive group that has more positive samples than the other one. For each task, we train a ResNet-18 (He et al., 2016) along with two hidden layers for final prediction. To implement manifold fair mixup, we interpolate the representations before the average pooling layer. The first row in figure 4 shows the tradeoff between AP and ∆DP for each task. Again, fair mixup consistently outperforms the baselines by a large margin. We also observe that manifold mixup further boosts the performance for all the tasks. The tradeoffs between AP and ∆EO are shown in the second row of figure 4. Again, both input mixup and manifold mixup yields well generalizing classifiers. To gain further insights, we plot the path in both input space and latent space in figure 5 (a) and (b) for the "attractive" attribute classification task. Fair mixup leads to a smoother path in both cases. Without mixup augmentation, gap regularization and adversarial debiasing present similar paths and both have larger ∆DP. We also observe that the expected output µ f in the latent path is almost linear with respect to the continuous sensitive attribute t, manifold mixup being the curve with the smallest slope and hence smallest ∆DP.

6.3. TOXICITY CLASSIFICATION

Lastly, we consider comment toxicity classification with Jigsaw toxic comment dataset (Jigsaw, 2018) . The data was initially released by Civil Comments platform, which was then extended to a public Kaggle challenge. The task is to predict whether a comment is toxic or not while being fair across groups. A subset of comments have been labeled with identity attributes, including gender and race. It has been shown that some of the identities (e.g., black) are correlated with the toxicity label. In this work, we consider race as the sensitive attribute and select the subset of comments that contain identities black or asian, as these two groups have the largest gap in terms of probability of being associated with a toxic comment. We use pretrained BERT embeddings (Devlin et al., 2019) to encode each comment into a vector of size 768. A three layer ReLU network is then trained to perform the prediction with the encoded feature. We directly adopt manifold mixup since input mixup is equivalent to manifold mixup by simply setting the encoder g to BERT. Similarly, we retrain each model 10 times using randomly split training, validation, and testing sets, and report mean accuracy and fairness measurement. Figures 6 (a ) and (b) show the tradeoff between AP and ∆DP/∆EO, respectively. Again, fair mixup consistently achieves a better tradeoff for both ∆DP and ∆EO. We then show the visualization of calibrated paths for ∆DP-regularized models in Figure 6 (c). We can see that even with the powerful BERT embedding, all the baselines present fluctuated paths with similar patterns. In contrast, fair mixup introduces a nearly linear curve with a small slope, which eventually leads to the smallest ∆DP. 

7. CONCLUSION

In this work, we propose fair mixup, a data augmentation strategy to optimize fairness constraints. By bridging sensitive groups with interpolated samples, fair mixup consistently improves the generalizability of fairness constraints across benchmarks with different modalities. Interesting future directions include (1) generating interpolated samples that lie on the natural data manifold with generative models or via dynamic optimal transport paths between the groups (Benamou & Brenier, 2000) , ( 2) extending fair mixup to other group fairness metrics such as accuracy equality, and (3) estimating the generalization of fairness constraints (Chuang et al., 2020) .

A PROOFS

A.1 PROOF OF LEMMA 1 Lemma 1. Let T : X 2 × [0, 1] → X be a function continuously differentiable w.r.t. t such that T (x 0 , x 1 , 0) = x 0 and T (x 0 , x 1 , 1) = x 1 . For any differentiable function f , we have ∆DP(f ) = 1 0 d dt f (T (x 0 , x 1 , t) interpolation )dP 0 (x 0 )dP 1 (x 1 )dt . Proof. The result follows from the fundamental theorem of calculus. In particular, given an interpolator T , we first rewrite the ∆DP with the T : ∆DP(f ) = |E x∼P0 f (x) -E x∼P1 f (x)| = |E x0∼P0,x1∼P1 f (x 0 ) -f (x 1 )| = |E x0∼P0,x1∼P1 f (T (x 0 , x 1 , 0)) -E x0∼P0,x1∼P1 f (T (x 0 , x 1 , 1))| . Not that E x0∼P0,x1∼P1 f (T (x 0 , x 1 , t)) is a real-valued continuous function on t ∈ [0, 1]. Therefore, we have the following equivalence via the fundamental theorem of calculus: ∆DP(f ) = |E x0∼P0,x1∼P1 f (T (x 0 , x 1 , 0)) -E x0∼P0,x1∼P1 f (T (x 0 , x 1 , 1))| . = 1 0 d dt E x0∼P0,x1∼P1 f (T (x 0 , x 1 , t)) . A.2 PROOF OF PROPOSITION 2 Proposition 2. (Gap Regularization) Consider the following minimization problem min f ∈H E (x,y)∼P [ (f (x), y)] + λ 1 2 ∆DP(f ) 2 + λ 2 2 ||f || 2 H . For a fixed embedding Φ, the optimal solution f * corresponds to v * given by following closed form: v * = 1 λ 2 δ ± -proj λ 2 λ 1 δ0,1 (δ ± ) , where proj is the soft projection defined as proj β u (x) = u⊗u ||u|| 2 +β x. Proof. The problem above can be written as follows: min v∈R m L(v) := -( v, δ ± ) + λ 1 2 | v, δ 0,1 | 2 + λ 2 2 ||v|| 2 2 Setting first order condition to zero ∇ v L(v) = -δ ± + λ 1 δ 0,1 ⊗ δ 0,1 v + λ 2 v = 0, we obtain (λ 1 δ 0,1 ⊗ δ 0,1 + λ 2 I m )v * = δ ± . By inverting and applying the Sherman-Morrison Lemma, we have v * = (λ 1 δ 0,1 ⊗ δ 0,1 + λ 2 I m ) -1 δ ± = λ -1 1 λ 2 λ 1 I m + δ 0,1 ⊗ δ 0,1 -1 δ ± = λ -1 1 I m - ( λ1 λ2 ) 2 δ 0,1 ⊗ δ 0,1 1 + ||δ 0,1 || 2 λ1 λ2 δ ± = 1 λ 2 I m - δ 0,1 ⊗ δ 0,1 λ2 λ1 + ||δ 0,1 || 2 δ ± . Note the soft projection on a vector u is defined as follows: proj β u (x) = u ⊗ u ||u|| 2 + β x. It follows that v * = 1 λ 2 δ ± -proj λ 2 λ 1 δ0,1 (δ ± ) , which can be interpreted as the projection of the label discriminating direction δ ± on the subspace that is orthogonal to the group discriminating direction δ 0,1 . A.3 PROOF OF PROPOSITION 3 Proposition 3. (Fair Mixup) Consider the following minimization problem min f ∈H E (x,y)∼P [ (f (x), y)] + λ 1 2 R DP-2 mixup (f ) + λ 2 2 ||f || 2 H . Define m t = E x0∼P0,x1∼P1 [Φ(tx 0 + (1 -t)x 1 ) ] be the t dependent mean embedding, and ṁt its derivative with respect to t. Let D be a positive-semi definite matrix defined as follows: D = 1 0 ṁt ⊗ ṁt dt. Given an embedding Φ, the optimal solution v * has the following form: v * = (λ 1 D + λ 2 I m ) -1 δ ± . Proof. For t ∈ [0, 1], we note by ρ t the distribution of T (x 0 , x 1 , t), for x 0 ∼ P 0 , x 1 ∼ P 1 , and note m t = E x∼ρt Φ(x) , and ṁt its time derivative. We consider the 2 variant of R T (f ) in the analysis: where J denotes the Jacobian. Note that D = 1 0 ṁt ⊗ ṁt dt, hence for the classification with fair mixup regularizer, the problem is equivalent to: R 2 T (f ) = 1 0 d dt µ f (t) 2 dt = 1 0 d dt v, m t 2 dt = 1 0 | v, ṁt | 2 dt = v, min v∈R m L(v) := -( v, δ ± ) + λ 1 2 v, Dv + λ 2 2 ||v|| 2 2 Setting first order condition we obtain (λ 1 D + λ 2 I m )v * = δ ± , which gives the optimal solution v * = (λ 1 D + λ 2 I m ) -1 δ ± . The corresponding optimal fair mixup classifier can be finally written as f (x) = δ ± , (λ 1 D + λ 2 I m ) -1 Φ(x) .

B EXPERIMENT DETAILS

Adult We follow the preprocessing procedure of Yurochkin et al. (2019) by removing some features in the datasetfoot_2 . We then encode the discrete and quantized continuous attributes with one-hot encoding. We retrain each model 10 times with batch size 1000 and report the mean accuracy and fairness measurement. The models are selected via the performance on validation set. In each trial, the dataset is randomly split into training and testing set with partition 80% and 20%, respectively. The models are optimized with Adam optimizer (Kingma & Ba, 2014) with learning rate 1 × e -3 . For DP, we sample 500 datapoints for each A ∈ {0, 1} to form a batch. Similarly, for EO, we sample 250 datapoints for each (A, Y ) pair where A, Y ∈ {0, 1}. where T is a set of threshold values. ∆DP averages the ∆DP with binarized predictions derived via different thresholds. For instance, by averaging the ∆DP with thersholds T = [0.1, 0.2, • • • ., 0.9], ∆DP = 0.5 instead of 0 for the example above, which captures the unfairness between groups. We report ∆DP for each methods with thersholds T = [0.1, 0.2, • • • ., 0.9] in Figure 8 . Similarly, fair mixup exhibits the best tradeoff comparing to the baselines for demographic parity. For equalized odds, the performances of fair mixup and GapReg are similar, where fair mixup achieves a better tradeoff when ∆EO is small. 

C.3 SMALLER MODEL SIZE

To examine the effect of model size, we reduce the hidden size from 200 to 50 and show the result in Figure 9 . Overall, the performance does not vary significantly after reducing the model size. We can again observe that fair mixup outperform the baselines for ∆DP. Similar to the results in section C.2, the performances of fair mixup and GapReg are similar, where fair mixup achieves a better tradeoff when ∆EO is small. 



exchange the derivative and integral via the Dominated Convergence Theorem Disclaimer: the attractive experiment is an illustrative example and such classifiers of subjective attributes are not ethical. https://github.com/IBM/sensitive-subspace-robustness https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/data



Figure 1: (a) Visualization of the path constructed via mixup interpolations between groups that have distribution P 0 and P 1 , respectively. (b) Fair mixup penalizes the changes in model's expected prediction with respect to the interpolated distributions. The regularized model (blue curve) has smaller slopes comparing to the unregularized one (orange curve) along the path from P 0 to P 1 , which eventually leads to smaller demographic parity ∆DP.

Figure 2: The expected output µ f (t) gradually changes as t → 1. Even when ∆DP is small, | d dt µ f (t)| could still be large along the path.

Figure 3: Adult Dataset.(a,b)  The tradeoff between AP and ∆DP/∆EO. (c) Visualization of the mixup path for models that regularize ∆DP with different algorithms. We plot the calibrated curve µ f (t) := µ f (t) -µ f (0) for a better visualization. In this case, µ f (0) = 0 and |µ f (1)| = ∆DP for all the calibrated curves µ f . Therefore, we can compare the ∆DP of each method with the absolute value of the last points (t = 1). The flatness of the path is highly correlated with the ∆DP.

Figure 5: Visualization of calibrated paths on attractive classification task for ∆DP regularized models. The flatness of both input and latent path plays an important role in regularizing ∆DP.

Figure 6: Toxic Classification (a,b) The tradeoff between AP and ∆DP/∆EO. (c) Visualization of the calibrated paths for models that regularize ∆DP with different algorithms. Interestingly, fair mixup presents a nearly linear curve with small slope, while the baselines introduce "inverted-U" shaped curves.

We then expand ṁt when T is mixup:ṁt = d dt E x0∼P0,x1∼P1 Φ(tx 0 + (1 -t)x 1 ) = E x0∼P0,x1∼P1 JΦ(tx 0 + (1 -t)x 1 )(x 0 -x 1 )

Figure 8: Average ∆DP and ∆EO on Adult Dataset.

Figure 9: Reducing the Model Size on Adult Dataset.

annex

CelebA Model-wise, we extract the feature of size 512 after the average pooling layer of ResNet-18. A two-layer ReLU network with hidden size 512 is then trained to perform prediction. Percentage of positive-labeled datapoints for attractive, wavy hair, and smiling that is male are 22.7%, 18.36%, and 34.6%, respectively. We use the original validation set of CelebA to perform model selection and report the accuracy and fairness metrics on the testing set. The visualization paths are also plotted with respect to the testing data. To implement manifold mixup, we interpolate the spatial features before the average pooling layer. Similarly, all the models are optimized with Adam optimizer with learning rate 1 × e -3 .

Toxicity Classification

We download the Jigsaw toxic comment dataset from Kaggle website 4 . Percentage of positive-labeled datapoints for black and asian are 18.8% and 6.4%, respectively, which together results in a dataset of size 22835. We retrain each model 10 times with batch size 200 and report the mean accuracy and fairness measurement. The models are selected via the performance on validation set. The batch-sampling and data splitting procedure is the same as the one for Adult dataset. The models are again optimized with Adam optimizer with learning rate 1 × e -3 . 

C.2 EVALUATION METRIC

The relaxed evaluation metric could overestimate the performance when the predicted confidence is significantly different between groups. For instance, a classifier f can be completely unfair while satisfying this condition: f (x) = 1 w.p. 60%, 0 w.p. 40% on P 0 , and f (x) = 0.6 w.p. 100% on P 1 . This satisfies this expectation-based condition. However, it is highly unfair if we binarize the prediction by setting the threshold = 0.5.To overcome this issue, let f t be the binarized predictor f t (x) = 1(f(x) ≥ t), we evaluate the model with average ∆DP (∆DP) defined as follows:

