SENSEI: SENSITIVE SET INVARIANCE FOR ENFORC-ING INDIVIDUAL FAIRNESS

Abstract

In this paper, we cast fair machine learning as invariant machine learning. We first formulate a version of individual fairness that enforces invariance on certain sensitive sets. We then design a transport-based regularizer that enforces this version of individual fairness and develop an algorithm to minimize the regularizer efficiently. Our theoretical results guarantee the proposed approach trains certifiably fair ML models. Finally, in the experimental studies we demonstrate improved fairness metrics in comparison to several recent fair training procedures on three ML tasks that are susceptible to algorithmic bias.

1. INTRODUCTION

As machine learning (ML) models replace humans in high-stakes decision-making and decisionsupport roles, concern regarding the consequences of algorithmic bias is growing. For example, ML models are routinely used in criminal justice and welfare to supplement humans, but they may have racial, class, or geographic biases (Metz & Satariano, 2020) . In response, researchers proposed many formal definitions of algorithmic fairness as a first step towards combating algorithmic bias. Broadly speaking, there are two kinds of definitions of algorithmic fairness: group fairness and individual fairness. In this paper, we focus on enforcing individual fairness. At a high-level, the idea of individual fairness is the requirement that a fair algorithm should treat similar individuals similarly. Individual fairness was dismissed as impractical because there is no consensus on which users are similar for many ML tasks. Fortunately, there is a flurry of recent work that addresses this issue (Ilvento, 2019; Wang et al., 2019; Yurochkin et al., 2020; Mukherjee et al., 2020) . In this paper, we assume there is a similarity metric for the ML task at hand and consider the task of enforcing individual fairness. Our main contributions are: 1. we define distributional individual fairness, a variant of Dwork et al.'s original definition of individual fairness that is (i) more amenable to statistical analysis and (ii) easier to enforce by regularization; 2. we develop a stochastic approximation algorithm to enforce distributional individual fairness when training smooth ML models; 3. we show that the stochastic approximation algorithm converges and the trained ML model generalizes under standard conditions; 4. we demonstrate the efficacy of the approach on three ML tasks that are susceptible to algorithmic bias: income-level classification, occupation prediction, and toxic comment detection.

2. ENFORCING INDIVIDUAL FAIRNESS WITH SENSITIVE SET INVARIANCE (SENSEI) 2.1 A TRANSPORT-BASED DEFINITION OF INDIVIDUAL FAIRNESS

Let X and Y be the space of inputs and outputs respectively for the supervised learning task at hand. For example, in classification tasks, Y may be the probability simplex. An ML model is a function h : X → Y in a space of functions H (e.g. the set of all neural nets with a certain architecture). Dwork et al. (2011) define individual fairness as L-Lipschitz continuity of an ML model h with respect to appropriate metrics on X and Y: d Y (h(x), h(x )) ≤ Ld X (x, x ) (2.1) for all x, x ∈ X . The choice of d Y is often determined by the form of the output. For example, if the ML model outputs a vector of the logits, then we may pick the Euclidean norm as d Y (Kannan et al., 2018; Garg et al., 2018) . The metric d X is the crux of (2.1) because it encodes our intuition of which inputs are similar for the ML task at hand. For example, in natural language processing tasks, d X may be a metric on word/sentence embeddings that ignores variation in certain sensitive directions. In light of the importance of the similarity metric in (2.1) to enforcing individual fairness, there is also a line of work on learning the similarity metric from data (Ilvento, 2019; Wang et al., 2019; Mukherjee et al., 2020) . In our experiments, we adapt the methods from Yurochkin et al. (2020) to learn similarity metrics. Although intuitive, individual fairness is statistically and computationally intractable. Statistically, it is generally impossible to detect violations of individual fairness on zero measure subset of the sample space. Computationally, individual fairness is a Lipschitz restriction, and such restrictions are hard to enforce. In this paper, we address both issues by lifting (2.1) to the space of probability distributions on X to obtain an "average case" version of individual fairness. This version (i) is more amenable to statistical analysis, (ii) is easy to enforce by minimizing a data-dependent regularizer, and (iii) preserves the intuition behind Dwork et al. (2011) 's original definition of individual fairness. Definition 2.1 (distributional individual fairness (DIF)). Let , δ > 0 be tolerance parameters and ∆(X × X ) be the set of probability measures on X × X . Define R(h)      sup Π∈∆(X ×X ) E Π d Y (h(X), h(X )) subject to E Π d X (X, X ) ≤ Π(•, X ) = P X      . (2.2) where P X is the (marginal) distribution of the inputs in the ML task at hand. An ML model h is ( , δ)-distributionally individually fair (DIF) iff R(h) ≤ δ. We remark that DIF only depends on h and P X . It does not depend on the (conditional) distribution of the labels P Y |X , so it does not depend on the performance of the ML model. In other words, it is possible for a model to perform poorly and be perfectly DIF (e.g. the constant model h(x) = 0). The optimization problem in (2.2) formalizes correspondence studies in the empirical literature (Bertrand & Duflo, 2016) . Here is a prominent example. Example 2.2. Bertrand & Mullainathan studied racial bias in the US labor market. The investigators responded to help-wanted ads in Boston and Chicago newspapers with fictitious resumes. To manipulate the perception of race, they randomly assigned African-American or white sounding names to the resumes. The investigators concluded there is discrimination against African-Americans because the resumes assigned white names received 50% more callbacks for interviews than the resumes. We view Bertrand & Mullainathan's investigation as evaluating the objective in (2.3) at a special T . Let X be the space of resumes, and h : X → {0, 1} be the decision rule that decides whether a resume receives a callback. Bertrand & Mullainathan implicitly pick the T that reassigns the name on a resume from an African-American sounding name to a white one (or vice versa) and measures discrimination with the difference between callback rates before and after reassignment: E P 1{h(X) = h(T (X))} = P{h(X) = h(T (X))}. We consider distributional individual fairness a variant of  T :X →X E P d Y (h(X), h(T (X))) subject to E P d X (X, T (X)) ≤ . (2. 3) The map corresponding to (2.1) T IF (x) arg max d X (x,x )≤ d Y (h(x), h(x )) maps each x to its worse-case counterpart x in (2.1). It is not hard to see that T IF is a feasible map for (2.3), but it may not be optimal. This is because (2.3) only restricts T to transport points by at most on average; the optimal T may transport some points by more than . To make the two definitions more comparable, we consider an -δ version of individual fairness. A model h : X → Y satisfies ( , δ)-individual fairness at x ∈ X if and only if d Y (h(x), h(T IF (x))) = sup d X (x,x )≤ d Y (h(x), h(x )) ≤ δ. (2.4) To arrive at (2.4), we start by observing that (2.1) is equivalent to sup x∈X d Y (h(x), h(T IF (x) )) ≤ L for any x ∈ X and > 0. We fix x and and re-parameterize the right side with δ to obtain (2.4). It is possible to show that if h is ( , δ)-DIF, then there exists δ such that it satisfies ( , δ )-individual fairness for "most" x's. We formally state this result in a proposition. Proposition 2.3. If h : X → Y is ( , δ)-DIF, then P X (d Y (h(X), h(T IF (X))) ≥ τ ) ≤ δ τ for any τ > 0.

2.3. ENFORCING DIF

There are two general approaches to enforcing invariance conditions such as (2.2). The first is distributionally robust optimization (DRO): min h∈H L adv (h) sup P :W d (P,P )≤ E P (Y , h(X )) , (2.5) where is a loss function and W d (P, Q) is the Wasserstein distance between distributions on X × Y induced by the transport cost function c((x, y), (x y )) d X (x, x ) + ∞ • 1{y = y }. This approach is very similar to adversarial training, and it was considered by Yurochkin et al. (2020) for enforcing (their modification of) individual fairness. In this paper, we consider a regularization approach to enforcing (2.2): min h∈H L(h) + ρR(h), L(h) E (Y, h(X)) , (2.6) where ρ > 0 is a regularization parameter and the regularizer R is defined in (2.2). An obvious advantage of the regularization approach is it allows the user to fine-tune the trade-off between goodness-of-fit and fairness by adjusting ρ (see Figure 1 ; in Figure 3 of Appendix D we show the lack of such flexibility in the method of Yurochkin et al. (2020) ). As we shall see, although the two approaches share many theoretical properties, we show in Section 4 that the regularization approach has superior empirical performance. We defer a more in-depth comparison between the two approaches to subsection 2.4. At first blush, the regularized risk minimization problem (2.6) is not amenable to stochastic optimization because R is not an expected value of a function of the training examples. Fortunately, by appealing to duality, it is possible to obtain a dual formulation of R that is suitable for stochastic optimization. We see that fair regularization (eventually) corrects the bias in the data. Theorem 2.4 (dual form of R). If d Y (h(x), h(x )) -λd X (x, x ) is continuous (in (x, x )) for any λ ≥ 0, then R(h) = inf λ≥0 {λ + E P X r λ (h, X) }, r λ (h, X) sup x ∈X {d Y (h(X), h(x )) -λd X (X, x )}. We defer the proof of this result to Appendix A. In light of the dual form of the fair regularizer, the regularized risk minimization problem is equivalently min h∈H inf λ≥0 E P (Y, h(X)) + ρ(λ + r λ (h, X)) , (2.7) where r λ is defined in Theorem 2.4, which has the form of minimizing an expected value of a function of the training examples. To optimize with respect to h, we parameterize the function space H with a parameter θ ∈ Θ ⊂ R d and consider a stochastic approximation approach to finding the best parameter. Let w (θ, λ) and Z (X, Y ). The stochastic optimization problem we wish to solve is min w∈Θ×R+ F (w) E P f (w, Z) , f (w, Z) (Y, h θ (X)) + ρ(λ + r λ (h θ , X)). (2.8) It is not hard to see that (2.8) is a stochastic optimization problem. We summarize Sensitive Set Invariance (SenSeI) for stochastic optimization of (2.8) in Algorithm 1. Algorithm 1 SenSeI: Sensitive Set Invariance inputs: starting point ( θ 0 , λ 0 ), step sizes (η t ) repeat (X t1 , Y t1 ), . . . , (X t B , Y t B ) ∼ P sample mini-batch from P x t b ← arg max x ∈X {d Y (h θt (X t b ), h θt (x )) -λ t d X (X t b , x )}, b ∈ [B] generate worst-case examples λ t+1 ← max{0, λ t -η t ρ( -1 B B b=1 d X (X t b , x t b )}, θ t+1 ← θ t -η t ( 1 B B b=1 ∂ θ { (Y t b , h θt (X t b ))} + ρ∂ θ {d Y (h θt (X t b ), h θt (x t b ))}) until converged 2.4 ADVERSARIAL TRAINING VS FAIR REGULARIZATION Adversarial training is a popular approach to training invariant ML models. It was originally developed to defend ML models against adversarial examples (Goodfellow et al., 2014; Madry et al., 2017) . There are many versions of adversarial training; the Wasserstein distributionally robust optimization (DRO) version by Sinha et al. (2017) is most closely related to (2.5). The direct goal of adversarial training is training ML models whose risk is small on adversarial examples, and the robust risk that adversarial training seeks to minimize (2.5) is exactly the risk on adversarial examples. An indirect consequence of adversarial training is invariance to imperceptible changes to the inputs. Recall an adversarial example is a training example with the same label as a non-adversarial training example whose inputs differ imperceptibly from those of a training example. Thus (successful) adversarial training leads to ML models that ignore such imperceptible changes and are thus invariant in "imperceptible neighborhoods" of the training examples. Unlike adversarial training, which leads to invariance as an indirect consequence of adversarial robustness, fair regularization enforces fairness by explicitly minimizing a fair regularizer. A key benefit of invariance regularization is it permits practitioners to fine-tune the trade-off between goodness-of-fit and invariance by adjusting the regularization parameter (see Figure 1 ).

2.5. RELATED WORK

There are three lines of work on enforcing individual fairness. There is a line of work that enforces group fairness with respect to many (possibly overlapping) groups to avoid disparate treatment of individuals (Hébert-Johnson et al., 2017; Kearns et al., 2017; Kim et al., 2018a; b) . At a high-level, these methods repeatedly find groups in which group fairness is violated and updates the ML model to correct violations. Compared to these methods that approximate individual fairness with group fairness, we directly enforce individual fairness. There is another line of work on enforcing individual fairness without knowledge of the similarity metric. Gillen et al. (2018) ; Rothblum & Yona (2018) ; Jung et al. (2019) reduce the problem of enforcing individual fairness to a supervised learning problem by minimizing the number of violations. Instead of a similarity metric, these algorithms rely on an oracle that detects violations of individual fairness. Garg et al. (2018) enforce individual fairness by penalizing the expected difference in predictions between counterfactual inputs. Instead of a similarity metric, this algorithm relies a way of generating counterfactual inputs. Our approach complements these methods by relying on a similarity metric instead of such oracles. This allows us to take advantage of recent work on learning fair metrics (Ilvento, 2019; Wang et al., 2019; Yurochkin et al., 2020; Mukherjee et al., 2020) . Most similar to our work is SenSR (Yurochkin et al., 2020) that also assumes access to a similarity metric. SenSR is based on adversarial training, i.e. it enforces a risk-based notion of individual fairness that only requires the risk of the ML model to be similar on similar inputs. In contrast, our method is based on fair regularization, i.e. it enforces a notion of individual fairness that requires the outputs of the ML model to be similar. The latter is stronger (it implies the former) and is much closer to Dwork et al.'s original definition. In addition, the risk-based notion of SenSR ties accuracy to fairness. Our approach separates these two (usually conflicting) goals and allows practitioners to more easily adjust the trade-off between accuracy and fairness as demonstrated in our experimental studies. Finally, there is a line of work that proposes pre-processing techniques to learn a fair representation of the individuals and training an ML model that accepts the fair representation as input (Zemel et al., 2013; Bower et al., 2018; Madras et al., 2018; Lahoti et al., 2019) . These works are complimentary to ours as we propose an in-processing algorithm.

3. THEORETICAL PROPERTIES OF SENSEI

In this section, we describe some theoretical properties of SenSeI. We defer all proofs to Appendix A. First, Algorithm 1 is an instance of a stochastic gradient method, and its convergence properties are well-studied. Even if f (w, Z) is non-convex in w, the algorithm converges (globally) to a stationary point (see Appendix A for a rigorous statement). Second, the fair regularizer is data-dependent, and it is unclear whether minimizing its empirical counterpart (3.1) enforces distributional fairness. We show that the fair regularizer generalizes under standard conditions. Consequently, 1. it is possible for practitioners to certify that an ML model h is DIF a posteriori by checking R(h) (even if h was trained by minimizing R); 2. fair regularization enforces distributional individual fairness (as long as the hypothesis class includes DIF ML models); Notation Let {(X i , Y i )} n i=1 be the training set and P X be the empirical distribution of the inputs. Define L : H → R as the empirical risk and R : H → R as the empirical counterpart of the fair regularizer R (2.2): R(h)      max Π∈∆(X ×X ) E Π d Y (h(X), h(X )) subject to E Π d X (X, X ) ≤ Π(•, X ) = P X ,      . (3.1) Define the loss class L and its counterpart for the fair regularizer as L { h : Z → R | h ∈ H}, h (z) (h(x), y), D {d h : X × X → R + | h ∈ H}, d h (x, x ) d Y (h(x), h(x )). We measure the complexity of D and L with their entropy integrals with respect to to the uniform metric: J(D/L) ∞ 0 log N (D/L, • ∞ , ) 1 2 d , where N (D, • ∞ , ) is the -covering number of D in the uniform metric. The main benefit of using entropy integrals instead of Rademacher or Gaussian complexities to measure the complexity of D and L is it does not depend on the distribution of the training examples. This allows us to obtain generalization error bounds for counterfactual training sets that are similar to the (observed) training set. Finally, define the diameter of X in the d X metric, that of Y in the d Y metric as D X sup x,x ∈X d X (x, x ), D Y sup y,y ∈Y d Y (y, y ). The first result shows that the fair regularizer R(h) generalizes. We assume D X , D cY < ∞. This is a boundedness condition on X × Y, and it is a common simplifying assumption in statistical learning theory. We also assume J(D) < ∞. This is a standard assumption that appears in many uniform convergence results. Theorem 3.1. As long as D X , D Y , and J(G) are all finite, with probability at least 1 -t: sup h∈H | R(h) -R(h)| ≤ 48(J(D)+ 1 D X D Y ) √ n + D Y ( log 2 t 2n ) 1 2 . Theorem 3.1 implies it is possible to certify that an ML model h satisfies distributional individual fairness (modulo error terms that vanish in the large-sample limit) by inspecting R(h). This is important because a practitioner may inspect R(h) after training to verify whether the trained ML model h is fair enough. Theorem 3.1 assures the user that R(h) is close to R(h).

4. COMPUTATIONAL EXPERIMENTS

In this section we present empirical evidence that SenSeI trains individually fair ML models in practice and study the trade-off between accuracy and fairness parametrized by ρ (defined in (2.6)). Baselines We compare SenSeI to empirical risk minimization (Baseline) and two recent approaches for training individually fair ML models: Sensitive Subspace Robustness (SenSR) (Yurochkin et al., 2020) that uses DRO to achieve robustness to perturbations in a fair metric, and Counterfactual Logit Pairing (CLP) (Garg et al., 2018) that penalizes the differences in the output of an ML model on training examples and hand-crafted counterfactuals. We provide implementation details of SenSeI and the baselines in Appendix B.

4.1. TOXIC COMMENT DETECTION

We consider the task of training a classifier to identify toxic comments, i.e. rude or disrespectful messages in online conversations. Identifying and moderating toxic comments is crucial for facilitating inclusive online conversations. Data is available through the "Toxic Comment Classification Challenge" Kaggle competition. We utilize the subset of the dataset that is labeled with a range of identity contexts (e.g. "muslim", "white", "black", "homosexual gay or lesbian"). Many of the toxic comments in the train data also relate to these identities leading to a classifier with poor test performance on the sets of comments with some of the identity contexts (group fairness violation) and prediction rule utilizing words such as "gay" to flag a comment as toxic (individual fairness violation). To obtain good features we use last layer representation of BERT (base, uncased) (Devlin et al., 2018) fine tuned on a separate subset of 500k randomly selected comments without identity labels. We then train a 2000 hidden units neural network with these BERT features. Counterfactuals and fair metric. CLP (Garg et al., 2018) requires defining a set of counterfactual tokens. The training proceeds by taking an input comment and if it contains a counterfactual token replacing it with another random counterfactual token. For example, if "gay" and "straight" are among the counterfactual tokens, comment "Some people are gay" may be modified to "Some people are straight". Then difference in logit outputs of the classifier on the original and modified comments is used as a regularizer. For toxicity classification Garg et al. (2018) adopted a set of 50 counterfactual tokens from (Dixon et al., 2018) . Counterfactuals allow for a simple fair metric learning procedure via factor analysis: since any variation in representation of a data point and its counterfactuals is considered undesired, Yurochkin et al. (2020) ; Mukherjee et al. (2020) proposed to use a Mahalanobis metric with the major directions of variation among counterfactuals projected out. We utilize this Comparison metrics. To evaluate individual fairness of the classifiers we use test data and 50 counterfactuals to check if the classifier toxicity decision varies across counterfactuals. For example, is prediction for "Some people are gay" same as for "Some people are straight"? An intuitive fair metric should be 0 on a pair of such comments and prediction of the classifier should not change based on the Dwork et al.'s individual fairness definition. We report Counterfactual Token Fairness (CTF) score (Garg et al., 2018) that quantifies variance across counterfactuals of the predicted probability that a comment is toxic, and Prediction Consistency (PC) equal to the portion of test comments where prediction is the same across all 50 counterfactual variations. For goodness-of-fit we use balanced accuracy due to class imbalance. To quantify group fairness we follow accuracy parity notion (Zafar et al., 2017; Zhao et al., 2019) . Here protected groups correspond to the identity context labels available in the data and accuracy parity quantifies whether a classifier is equally accurate on, e.g., comments labeled to have "white" context and those labeled with "black". There are 9 protected groups and we report standard deviation of the accuracies and balanced accuracies across them. We present mathematical expressions for each of the metrics in Appendix C for completeness. Results. We repeat our experiment 10 times with random 70 SenSeI also outperforms CLP: our approach uses optimization to find worst case perturbations of the data according to a fair metric induced by counterfactuals, while CLP chooses a random perturbation among the counterfactuals. SenSeI is searching for a perturbation on every data point that potentially allows to generalize to unseen counterfactuals, while CLP can only perturb comments that explicitly contain a counterfactual known during training time. In Figure 2 we verify that both SenSeI and CLP allow to "trade" fairness and accuracy by varying the regularization strength ρ (in the table we used ρ = 5 for both). We also notice the effect of worst case optimization opposed to random sampling: SenSeI has higher prediction consistency at the cost of balanced accuracy.

4.2. OCCUPATION PREDICTION

Online professional presence via dedicated services or personal websites is the common practice in many industries. ML systems can be trained using data from these sources to identify person's occupation and used by recruiters to assist in finding suitable candidates for the job openings. Counterfactuals and fair metric. Counterfactuals definition for this problem is the gender analog of the Bertrand & Mullainathan (2004) investigation of the racial bias in the labor market. For each bio we create a counterfactual bio by replacing male pronouns with the corresponding female ones and vice a versa. For the fair metric we use same approach as in the Toxicity study. Comparison metrics. We use the same individual fairness metrics. To compare group fairness we report root mean squared gap (Gap RMS) and mean absolute gap (Gap ABS) between male and female true positive rates for each of the occupations following prior studies of this dataset (Romanov et al., 2019; Prost et al., 2019) . For performance we report balanced accuracy due to imbalance in occupation proportions in the data. Results. We repeat the experiment 10 times with 70-30 train-test splits and summarize results in Table 2 (for SenSeI and CLP we set ρ = 5). Comparing to the Toxicity experiment, we note that individual fairness metrics are much better. In particular, attaining prediction consistency is easier because there is only one type of counterfactuals. Overall, fairness metrics are comparable across fair training methods with a slight SenSeI advantage. It is interesting to note mild accuracy improvement: learning a classifier invariant to gender can help to avoid spurious correlations between occupations and gender present in the data. We present fairness accuracy "trade-off" in Figure 2 -increasing regularization strength has clear upward trend in terms of the prediction consistency without decreasing accuracy. This experiments is an example where fairness can be improved without "trading" performance. We note two prior group fairness studies of the Bios dataset: Romanov et al. ( 2019) and Prost et al. (2019) . Both reported worse classification results and fairness metrics. Better classification performance in our work is likely attributed to using BERT for obtaining bios feature vectors. For the group fairness, we note that relative to baseline improvement with SenSeI is more significant.

4.3. INCOME PREDICTION

The Adult dataset (Bache & Lichman, 2013 ) is a common benchmark in the group fairness literature. The task is to predict if a person earns more than $50k per year using information about their education, gender, race, marital status, hours worked per week, etc. Yurochkin et al. (2020) studied individual fairness on Adult by considering prediction consistency with respect to demographic features: race and gender (GR-Con.) and marital status (S-Con., i.e. spouse consistency). To quantify group fairness they used RMS gaps and maximum gaps between true positive rates across genders (Gap RMS G and Gap max G ) and races (Gap RMS R and Gap max R ). Due to class imbalance, performance is quantified with balanced accuracy (B-Acc). For the fair metric they used Mahalanobis distance with race, gender and a logistic regression vector predicting gender projected out. We note that CLP was proposed as a fair training method for text classification (Garg et al., 2018) and is not applicable on Adult because it is not clear how to define counterfactuals. We compare SenSeI to results reported in Yurochkin et al. (2020) . In Table 3 we show that with sufficiently large regularization strength ρ = 40 SenSeI is able to further reduce all group fairness gaps and improve one of the individual fairness metrics, however trading off some accuracy.

5. SUMMARY AND DISCUSSION

In this paper, we studied a regularization approach to enforcing individual fairness. We defined distributional individual fairness, a variant of Dwork et al.'s original definition of individual fairness and a data-dependent regularizer that enforces this distributional fairness (see Definition 2.1). We also developed a stochastic approximation algorithm to solve regularized empirical risk minimization problems and showed that it trains ML models with distributional fairness guarantees. Finally, we showed that the algorithm mitigates algorithmic bias on three ML tasks that are susceptible to such biases: income-level classification, occupation prediction, and toxic comment detection.

A THEORETICAL PROPERTIES

We collect the proofs of all the theoretical results in the paper here. We restate the results before proving them for the reader's convenience. We assume that (X , d X ) and (Y, d Y ) are complete and separable metric spaces (Polish spaces).

A.1 DIF AND INDIVIDUAL FAIRNESS

Proposition A.1 (Proposition 2.3). If h : X → Y is ( , δ)-DIF, then P X (sup d X (x,x )≤ d Y (h(x), h(x )) ≥ τ ) ≤ δ τ . Proof. Recall T IF arg max d X (x,x )≤ d Y (h(x), h(x )). is feasible for (the Mongé version of) (2.2). Thus R(h) ≤ δ implies E P X d Y (X, T IF (X)) ≤ δ. Markov's inequality implies P X (d Y (X, T IF (X)) ≥ τ ) ≤ δ τ for any τ > 0.

A.2 CONVERGENCE PROPERTIES OF SENSEI

Algorithm 1 is an instance of a stochastic gradient method, and its convergence properties are wellstudied. Even if f (w, Z) is non-convex in w, the algorithm converges (globally) to a stationary point. This a well-known result in stochastic approximation, and we state it here for completeness. Theorem A.2 (Ghadimi & Lan (2013) ). Let σ 2 ≥ E 1 B B b=1 ∂ w f (w, Z b ) -∂F (w) 2 2 be an upper bound of the variance of the stochastic gradient. As long as F is L-strongly smooth, F (w ) ≤ F (w) + ∂F (w), w -w + L 2 w -w 2 2 for any w, w ∈ Θ × R + , then Algorithm 1 with constant step sizes η t = ( 2B 0 Lσ 2 T ) 1 2 satisfies 1 T T t=1 E ∂F (w t ) 2 2 ≤ σ( 8L 0 BT ), where 0 is any upper bound of the suboptimality of w 0 . In other words, Algorithm 1 finds an -stationary point of (2.8) in at most O( 12 ) iterations. If F has more structure (e.g. convexity), then Algorithm 1 may converge faster.

A.3 PROOF OF DUALITY RESULTS IN SECTION 2

Theorem A.3 (Theorem 2.4). If d Y (h(x), h(x )) -λd X (x, x ) is continuous (in (x, x )) for any λ ≥ 0, then R(h) = inf λ≥0 {λ + E P X r λ (h, X) }, r λ (h, X) sup x ∈X {d Y (h(X), h(x )) -λd X (X, x )}. Proof. We abuse notation and denote the function d Y (h(x), h(x )) as d Y • h. We recognize the optimization problem in (2.2) as an (infinite dimensional) linear optimization problem: R(h)      sup Π:∆(X ×X ) Π, d Y • h = E Π d Y (h(X), h(X )) subject to Π, d X = E Π d X (X, X ) ≤ Π(•, X ) = P X , It is not hard check Slater's condition: dΠ(x, x ) = 1{x = x }dP (x) is strictly feasible. Thus we have strong duality (see Theorem 8.7.1 in (Luenberger, 1968 )): R(h) = sup Π:Π(•,X )=P X inf λ≥0 Π, d Y • h + λ( -Π, d X ) = sup Π:Π(•,X )=P X inf λ≥0 λ + Π, d Y • h -λd X = inf λ≥0 {λ + sup Π:Π(•,X )=P X Π, d Y • h -λd X }, It remains to show sup Π:Π(•,X )=P d Y • h -λd X , Π = E P sup x ∈X {d Y (h(X), h(x )) -λd X (X, x )} . (A.1) ≤ direction The integrands in (A.1) satisfy (d Y •h-λd X )(x, x ) = d Y (h(x), h(x ))-λd X (x, x ) ≤ sup x ∈X d Y (h(x), h(x ))-λd X (x, x ) , so the integrals satisfy the ≤ version of (A.1). ≥ direction Let Q be the set of all Markov kernels from X to X . We have sup Π:Π(•,X )=P d Y • h -λd X , Π = sup Q∈Q X ×X d Y (h(x), h(x )) -λd X (x, x )dQ(x | x)dP (x) ≥ sup T :X →X X d Y (h(x), h(T (x))) -λd X (x, T (x))dP (x), where we recalled Q(A | x) = 1{T (x) ∈ A} is a Markov kernel in the second step. (Technically, in the second step, we only sup over T 's that are decomposable (see Definition 14.59 in Rockafellar & Wets (2004) ) with respect to P , but we gloss over this detail here.) We appeal to the technology of integrands (Rockafellar & Wets, 2004) to interchange integration and maximization. We assumed d Y • h -λd X is continuous, so it is a normal integrand (see Corollary 14.34 in Rockafellar & Wets (2004) ). Thus it is OK to interchange integration and maximization (see Theorem 14.60 in Rockafellar & Wets (2004) ): sup T :X →X X d Y (h(x), h(T (x)))-λd X (x, T (x))dP (x) = X sup x ∈X d Y (h(x), h(x ))-λd X (x, x ) , dP (x). This shows the ≥ direction of (A.1). We remark that it is not necessary to rely on the technology of normal integrands to interchange expectation and maximization in the proof of Theorem 2.4. For example, Blanchet & Murthy (2016) prove a similar strong duality result without resorting to normal integrands. We do so here to simplify the proof.

A.4 PROOFS OF GENERALIZATION RESULTS IN SECTION 3

Theorem A.4 (Theorem 3.1). As long as D X , D Y , J(G) are all finite, sup f ∈F |E P f (Z) -E P f (Z) | ≤ 48(J(D)+ 1 D X D Y ) √ n + D Y ( log 2 t 2n ) 1 2 with probability at least 1 -t. Proof. Let R(h)      max Π∈∆(X ×X ) E Π d Y (h(X), h(X )) subject to E Π d X (X, X ) ≤ Π(•, X ) = P X ,      . where P X is the empirical distribution of the inputs in the training set. By Theorem 2.4, we have R(h) -R(h) = inf λ≥0 {λ + E P X r λ (h, X) } -inf λ≥0 {λ + E P X r λ (h, X) } = inf λ≥0 {λ + E P X r λ (h, X) } -λ * -E P X r λ * (h, X) ≤ E P X r λ * (h, X) } -E P X r λ * (h, X) , (A.2) where λ * ∈ arg min λ≥0 λ + E P X r λ (h, X) . The infimum is attained because inf λ≥0 {λ + E P X r λ (h, X) } is an strictly feasible (infinite-dimensional) linear optimization problem (see proof of Theorem 2.4). To bound λ * , we observe that r λ (h, X) ≥ 0 for any h ∈ H, λ ≥ 0: r λ (h, X) = sup x ∈X {d Y (h(X), h(x )) -λd X (X, x )} ≥ d Y (h(X), h(X)) -λd X (X, X). This implies R(h) = λ * + E P X r λ * (h, X) ≥ λ * . We rearrange to obtain a bound on λ * : λ * ≤ 1 R(h) ≤ 1 D Y λ. (A.3) This is admittedly a crude bound, but it is good enough here. Similarly, R(h) -R(h) ≤ E P X r λ * (h, X) } -E P X r λ * (h, X) , (A.4) where λ * ∈ arg min λ≥0 λ + E P X r λ (h, X) , and λ * ≤ λ. Combining (A.2) and (A.4), we obtain | R(h) -R(h)| ≤ sup f ∈F |E P f (Z) -E P f (Z) |, where F {r λ (h, •) | h ∈ H, λ ∈ [0, λ]}. It is possible to bound sup f ∈F |E P f (Z) -E P f (Z) | with results from statistical learning theory. First, we observe that the functions in F are bounded: 0 ≤ r λ (h, X) ≤ 1 sup y,y ∈Y d Y (y, y ). Thus sup f ∈F |E P f (Z) -E P f (Z) | has bounded differences inequality, so it concentrates sharply around its expectation. By the bounded-differences inequality and a standard symmetrization argument, sup f ∈F |E P f (Z) -E P f (Z) | ≤ 2R n (F) + D Y ( log 2 t 2n ) 1 2 with probability at least 1 -t, where R n (F) is the Rademacher complexity of F: R n (F) = E sup f ∈F 1 n n i=1 σ i f (Z i ) . It remains to study R n (F). First, we show that the F-indexed Rademacher process X f 1 n n i=1 σ i f (Z i ) is sub-Gaussian with respect to to the metric d F ((h 1 , λ 1 ), (h 2 , λ 2 )) sup x1,x2∈X |d Y (h 1 (x 1 ), h 1 (x 2 )) -d Y (h 2 (x 1 ), h 2 (x 2 ))| + D X |λ 1 -λ 1 | : E exp(t(X f1 -X f2 )) = E exp t n n i=1 σ i (r λ1 (h 1 , X i ) -r λ2 (h 2 , X i )) = E exp t n σ(r λ1 (h 1 , X i ) -r λ2 (h 2 , X i )) n = E exp t n σ(sup x 1 ∈X inf x 2 ∈X d Y (h 1 (X i ), h 1 (x 1 )) -λ 1 d X (x 1 , X) -d Y (h 2 (X i ), h 2 (x 2 )) + λ 2 d X (X, x 2 ))) n ≤ E exp t n σ(sup x1∈X d Y (h 1 (X i ), h 1 (x 1 )) -d Y (h 2 (X i ), h 2 (x 1 )) + (λ 2 -λ 1 )d X (x 1 , X))) n ≤ exp 1 2 t 2 d F (h 1 , h 2 ) . Let N (F, d F , ) be the -covering number of F in the d F metric. We observe N (F, d F , ) ≤ N (D, • ∞ , 2 ) • N ([0, λ], | • |, 2D X ). (A.5) By Dudley's entropy integral, R n (F) ≤ 12 √ n ∞ 0 log N (F, d F , ) 1 2 d ≤ 12 √ n ∞ 0 log N (D, • ∞ , 2 ) + N [0, λ], | • |, 2D X 1 2 d ≤ 12 √ n ∞ 0 log N (D, • ∞ , 2 ) 1 2 d + ∞ 0 N [0, λ], | • |, 2D X 1 2 d ≤ 24J(D) √ n + 24D X λ √ n 1 2 0 log( 1 )d . We check that 1 2 0 log( 1 )d < 1 to arrive at Theorem 3.1. The chief technical novelty of this proof is the bound on λ * in terms of the diameter of the output space. This bound allows us to restrict the relevant function class in a way that allows us to appeal to standard techniques from empirical process theory to obtain uniform convergence results. In prior work (e.g. Lee & Raginsky (2017) ), this bound relies on smoothness properties of the loss, but this precludes non-smooth d Y in our problem setting. Corollary A.5. Assume there is h 0 ∈ H such that L(h 0 ) + ρR(h 0 ) < δ 0 . As long as D X , D Y , J(L), and J(G) are all finite, any global minimizer ĥ ∈ arg min h∈H L(h) + ρ R(h) satisfies L( ĥ) + ρR( ĥ) ≤ δ 0 + 2 24J(L) + 48ρ(J(D) + 1 D X D Y ) √ n + ( L + ρD Y )( log 2 t 2n ) 1 2 with probability at least 1 -2t. Proof. Let F (h) L(h) + ρR(h) and F be its empirical counterpart. The optimality of ĥ implies F ( ĥ) = F ( ĥ) -F ( ĥ) + F ( ĥ) - F (h 0 ) + F (h 0 ) -F (h 0 ) + F (h 0 ) ≤ δ 0 + 2 sup h∈H | F (h) -F (h)|. We have sup h∈H | F (h) -F (h)| ≤ sup h∈H | L(h) -L(h)| + ρ sup h∈H | R(h) -R(h)|. (A.6) We assumed is bounded, so sup h∈H | L(h) -L(h)| has bounded differences inequality, so it concentrates sharply around its expectation. By the bounded-differences inequality and a standard symmetrization argument, sup h∈H | L(h) -L(h)| ≤ 2R n (L) + L( log 2 t 2n ) 1 2 with probability at least 1 -t, where R n (L) is the Rademacher complexity of L. By Dudley's entropy integral, R n (L) ≤ 12 √ n ∞ 0 log N (L, • ∞ , ) 1 2 d , so the first term on the right side of (A.6) is at most sup h∈H | L(h) -L(h)| ≤ 24J(L) √ n + L( log 2 t 2n ) 1 2 with probability at least 1 -t. Theorem 3.1 implies the second term on the right side of (A.6) is at most sup h∈H | R(h) -R(h)| ≤ 48(J(D) + 1 D X D Y ) √ n + D Y ( log 2 t 2n ) 1 2 with probability at least 1 -t. We combine the bounds to arrive at the stated result.

B SENSEI AND BASELINES IMPLEMENTATION DETAILS

In this section we describe implementation details of all methods and hyperparameter selection to facilitate reproducibility of the experimental results reported in the main text. Improving balanced accuracy All three datasets we consider have noticeable class imbalances: over 80% comments in toxicity classification are non-toxic; several occupations in the Bias in Bios dataset are scarcely present (see Figure 1 in De-Arteaga et al. (2019) for details); about 75% of individuals in the Adult dataset make below $50k a year. Because of this class imbalance we choose to report balanced accuracy to quantify classification performance of different methods. Balanced accuracy is simply an average of true positive rates of all classes. To improve balanced accuracy for all methods we use balanced mini-batches following Yurochkin et al. (2020) , i.e. when sampling a mini-batch we enforce that every class is equally represented. Fair regularizer distance metric Recall that fair regularizer in Definition 2.1 of the main text requires selecting a distance metric on the classifier outputs d Y (h(x), h(x )). This distance is also required for the implementation of Counterfactual Logit Pairing (CLP) (Garg et al., 2018) . In a K-class problem, let h(x) ∈ R K denote a vector of K logits of a classifier for an observation x, then for both SenSeI and CLP we define d Y (h(x), h(x )) =foot_0 K h(x) -h(x )foot_1 2 , i.e. mean squared difference between logits of x and x . This is one of the choices empirically studied by Yang et al. (2019) for image classification. We defer exploring alternative fairness regularizer distance metrics for future work. Data processing and classifier architecture Data processing and classifier are shared across all methods in all experiments. In Toxicity experiment we utilized BERT (Devlin et al., 2018) finetuned on a random 33% subset of the data. We downloaded the fine-tuned model from one of the Kaggle kernels. 1 In the Bios experiment for each train-test split we fine-tuned BERT-Base, Uncased 2 for 3 epochs with mini-batch size 32, learning rate 2e-5 and 128 maximum sequence length. In both Toxicity and Bios experiments we obtained 768-dimensional sentence representations by average-pooling token embeddings of the corresponding fined-tuned BERTs. Then we trained a fully connected neural network with one hidden layer consisting of 2000 neurons with ReLU activations using BERT sentence representations as inputs. For Adult experiment we followed data processing and classifier choice (i.e. 100 hidden units neural network) as described in Yurochkin et al. (2020) . 4 for each hyperparameter we summarize its meaning, abbreviation, name in the code provided with the submissionfoot_2 and methods where it is used.

Hyperparameters selection In Table

To select hyperparameters for each experiment we performed a grid search on an independent train-test split. Then we fixed selected hyperparameters and ran 10 experiment repetitions with random train test splits (these results are reported in the main text). Hyperparameter choices for all experiments are summarized in Tables 5, 6 , 7. For the Adult experiment we duplicated results for all prior methods from Yurochkin et al. (2020) . 2020) we consider the fair metric of the form d X (x, x ) = (x -x ) T Σ(x -x ). We utilize their sensitive subspace idea writing Σ = I -P ran(A) , i.e. an orthogonal complement projector of the subspace spanned by the columns of A ∈ R d×k . Here A encodes the k directions of sensitive variations that should be ignored by the fair metric (d is the data dimension), such as differences in sentence embeddings due to gender pronouns in the Bios experiment or due to identity (counterfactual) tokens in the Toxicity experiment.

Synthetic experiment

In the synthetic experiment in Figure 1 we consider a fair metric ignoring variation along the x-axis coordinate, i.e. A = [1 0] T .

Toxicity experiment

To compute A we utilize FACE algorithm of Mukherjee et al. (2020) (see section 2.1 and Algorithm 1 in their paper). Here groups of comparable samples are the BERT embeddings of sentences from the train data and their modifications obtained using 25 counterfactuals known at the training time. For example, suppose we have a sentence "Some people are gay" in the train data and the list of known counterfactuals is "gay", "straight" and "muslim". Then we can obtain two comparable sentences: "Some people are straight" and "Some people are muslim". BERT embeddings of the original and two created sentences constitute a group of comparable sentences. Embeddings of groups of comparable sentences are the inputs to Algorithm 1 of Mukherjee et al. (2020) , which consists of a per-group centering step followed by a singular value decomposition. Taking the top k = 25 singular vectors gives us matrix of sensitive directions A defining the fair metric.

Bios experiment

We again utilize FACE algorithm of Mukherjee et al. (2020) to obtain the fair metric. Here the counterfactual modification is based on male and female gender pronouns. For example, sentence "He went to law school" is modified to "She went to law school". As a result, each group of comparable samples consists of a pair of bios (original and modified). Let X ∈ R n×d be the data matrix of BERT embeddings of the n train bios, and let X ∈ R n×d be the corresponding modified bios. Here Algorithm 1 of Mukherjee et al. (2020) is equivalent to performing SVD on X -X . We take the top k = 25 singular vectors to obtain sensitive directions A and the corresponding fair metric. Adult experiment In this experiment A consists of three vectors: a vector of zeros with 1 in the gender coordinate; a vector of zeros with 1 in the race coordinate; and a vector of logistic regression coefficients trained to predict gender using the remaining features (and 0 in the gender coordinate). This sensitive subspace construction replicates the approach Yurochkin et al. (2020) utilized in their Adult experiment for obtaining the fair metric. Please see Appendix B.1 and Appendix D in their paper for additional details.

C FAIRNESS EVALUATION METRICS DEFINITIONS

Individual fairness To compare individual fairness we used two metrics: prediction consistency and Counterfactual Token Fairness (CTF) score of Garg et al. (2018) . The idea behind these metrics is to quantify changes in prediction when modifying original data in ways that intuitively should not change behavior of an individually fair classifier. For Toxicity experiment an individually fair classifier should not change its prediction when a word "gay" in a comment is replaced with a word "straight". For example, we expect toxicity predictions on "Some people are gay" and "Some people are straight" to be the same. Following prior work (Dixon et al., 2018; Garg et al., 2018) we considered a set of 50 tokensfoot_3 that should not affect the classifier when interchanged. For any comment that contains at least one of these 50 tokens we can create 49 versions of it via a simple word replacement and evaluate classifier prediction and probability of being toxic for each of the 50 variations (including the original). Prediction consistency is the proportion of comments (with at least one of the 50 tokens) where prediction is the same on all 50 variations. CTF score is the average (across all comments with at least one of the 50 tokens) standard deviation of the toxicity probability across 50 variations. We use similar individual fairness metrics for the Bios experiment. We create a single variation of each bio by interchanging "he" and "she"; "his" and "her"; "him" and "hers"; "himself" and "herself"; "mr" and "ms" or "mrs"; original name with a random name from a different gender sampled among those present in the data. Prediction consistency is computed as before using 2 variations (including the original one) of each bio. Note that although there are fewer variations, there are significantly more classes in the Bios dataset. CTF score in the average (across all bios) squared Euclidean distance between the vectors of class probabilities for the 2 bio variations. In the Adult experiment we compute same individual fairness metrics as in Yurochkin et al. (2020) . S-Con. (spouse consistency) is the prediction consistency when creating data variations by altering marital status feature. GR-Con. (gender and race consistency) is the prediction consistency when creating data variations by altering race and gender features.

Group fairness

In our experiments we observed that enforcing individual fairness also has positive effect on group fairness metrics. In the Toxicity experiment we used accuracy parity (Zafar et al., 2017; Zhao et al., 2019) to quantify group fairness. There are multiple protected groups in the Toxicity dataset (e.g. "muslim", "white", "black", "homosexual or lesbian") that correspond to human annotated identity contexts (not necessarily mutually exclusive). To account for this when evaluating accuracy parity we computed accuracies for each of the protected groups and reported their standard deviation. Large standard deviation implies that classifier is significantly more accurate on some protected groups than on the others. Because of the class imbalance we also reported standard deviation of the corresponding balanced accuracies. For the Bios experiment we used same group fairness metrics as in the prior works studying this dataset (Romanov et al., 2019; Prost et al., 2019) . Here protected attribute is binary: male or female genders. Let TPR 0,k and TPR 1,k denote true positive rates for class k for protected attributes 0 and 1. Then TPR gap for class k is Gap k = |TPR 0,k -TPR 1,k |. The summary statistics we report are Gap RMS = 1 K k Gap 2 k and Gap ABS = 1 K k Gap k . For the Adult experiment we used same group fairness metrics as in Yurochkin et al. (2020) , which correspond to Gap RMS described above and Gap MAX = max k Gap k , evaluated with respect to race and gender (both are binary protected attributes in the dataset).

Synthetic data experiment

In Figure 1 varying ρ. In Figure 3 we show the lack of such flexibility in SenSR (Yurochkin et al., 2020) : varying the radius of the DRO ball in their definition of individual fairness results in a horizontal decision boundary even for = 0. SenSR ties loss to fairness in its objective, and in this experiment loss can be increased significantly for anything but a horizontal decision boundary (fair metric allows free movement along the x-axis and a data-point can be perturbed in horizontal direction even for = 0).

Toxicity and Bios experiments

In Figure 2 we presented trade-offs between prediction consistency and balanced accuracy for SenSeI and CLP (Garg et al., 2018) for the Toxicity and Bios experiments. For completeness we also present corresponding CTF score and balanced accuracy trade-offs in Figure 4 . As with the prediction consistency, we see that increasing fair regularization strength ρ allows to train classifiers with better individual fairness properties. SenSeI outperforms CLP as it trains classifiers with lower CTF score across all values of ρ. 



https://www.kaggle.com/taindow/bert-a-fine-tuning-example https://github.com/google-research/bert We will open-source the code and merge variable names with their abbreviations https://github.com/conversationai/unintended-ml-bias-analysis/blob/ master/unintended_ml_bias/bias_madlibs_data/adjectives_people.txt



Figure1: The decision surface of a one hidden layer neural network trained with SenSeI as the fair regularization parameter ρ varies. In this ML task, points on a horizontal line (points with identical y-values) are similar, but the training data is biased because P Y |X is not constant on horizontal lines. We see that fair regularization (eventually) corrects the bias in the data.

Figure 2: Balanced accuracy (BA) and prediction consistency (PC) trade-off.

1e-5 -----5

Figure3: The decision surface of a one hidden layer neural network trained with SenSR(Yurochkin et al., 2020) as the DRO radius varies. Problem setting is the same as in Figure1. Even for = 0, SenSR prioritizes fairness over accuracy producing a horizontal decision surface. It is unable to achieve intermediate behaviors of SenSeI trading accuracy and fairness as in Figure1 (a,b,c).

Figure 4: Balanced accuracy (BA) and CTF score trade-off on Toxicity and Bios experiments

while DIF is parameterized by an ( , δ) pair. Intuitively, (2.1) enforces (approximate) invariance at all scales (at any > 0), while DIF only enforces invariance at one scale (determined by the input tolerance parameter). Second, (2.1) enforces invariance uniformly on X , while (2.2) enforces invariance on average. Although DIF seems a weaker notion of algorithmic fairness than (2.1) (average fairness vs uniform fairness), DIF is actually more stringent in some ways because the constraints in (2.2) are looser. This is evident in the Mongé version of the optimization problem in (2.2): sup

Summary of Toxicity classification experiment over 10 restarts

-30 train-test splits, every time utilizing a random subset of 25 counterfactuals during training. Results are summarized in Table1. SenSeI on average outperforms other fair training methods on all individual and group fairness metrics at the cost of slightly lower balanced accuracy. SenSR has the lowest prediction consistency score suggesting that our fair regularization is more effective in enforcing individual fairness than adversarial training. This observation aligns with the empirical study byYang et al. (2019) comparing various invariance enforcing techniques for spatial robustness in image recognition.

Adult experiment over 10 restarts. Prior methods are duplicated fromYurochkin et al. (2020) BA,% S-Con. GR-Con. Gap RMS The dataset consists of 400k textual bio descriptions and the goal is to predict one of the 28 occupations. We again use BERT fine-tuned on the train data to obtain bio representations and then train a 2000 hidden neurons neural networks using each of the fair training methods.



Hyperparameter choices in Bios experiment

ACKNOWLEDGEMENTS

This paper is based upon work supported by the National Science Foundation (NSF) under grants no. 1830247 and 1916271. Any opinions, findings, and conclusions or recommendations expressed in this paper are those of the authors and do not necessarily reflect the views of the NSF.

