STATISTICAL INFERENCE FOR INDIVIDUAL FAIRNESS

Abstract

As we rely on machine learning (ML) models to make more consequential decisions, the issue of ML models perpetuating or even exacerbating undesirable historical biases (e.g. gender and racial biases) has come to the fore of the public's attention. In this paper, we focus on the problem of detecting violations of individual fairness in ML models. We formalize the problem as measuring the susceptibility of ML models against a form of adversarial attack and develop a suite of inference tools for the adversarial cost function. The tools allow auditors to assess the individual fairness of ML models in a statistically-principled way: form confidence intervals for the worst-case performance differential between similar individuals and test hypotheses of model fairness with (asymptotic) non-coverage/Type I error rate control. We demonstrate the utility of our tools in a real-world case study.

1. INTRODUCTION

The problem of bias in machine learning systems is at the forefront of contemporary ML research. Numerous media outlets have scrutinized machine learning systems deployed in practice for violations of basic societal equality principles (Angwin et al., 2016; Dastin, 2018; Vigdor, 2019) . In response researchers developed many formal definitions of algorithmic fairness along with algorithms for enforcing these definitions in ML models (Dwork et al., 2011; Hardt et al., 2016; Berk et al., 2017; Kusner et al., 2018; Ritov et al., 2017; Yurochkin et al., 2020) . Despite the flurry of ML fairness research, the basic question of assessing fairness of a given ML model in a statistically principled way remains largely unexplored. In this paper we propose a statistically principled approach to assessing individual fairness (Dwork et al., 2011) of ML models. One of the main benefits of our approach is it allows the investigator to calibrate the method; i.e. it allows the investigator to prescribe a Type I error rate. Passing a test that has a guaranteed small Type I error rate is the usual standard of proof in scientific investigations because it guarantees the results are reproducible (to a certain degree). This is also highly desirable in detecting bias in ML models because it allows us to certify whether an ML model will behave fairly at test time. Our method for auditing ML models abides by this standard. There are two main challenges associated with developing a hypothesis test for individual fairness. First, how to formalize the notion of individual fairness in an interpretable null hypothesis? Second, how to devise a test statistic and calibrate it so that auditors can control the Type I error rate? In this paper we propose a test motivated by the relation between individual fairness and adversarial robustness (Yurochkin et al., 2020) . At a high-level, our approach consists of two parts: 1. generating unfair examples: by unfair example we mean an example that is similar to a training example, but treated differently by the ML models. Such examples are similar to adversarial examples (Goodfellow et al., 2014) , except they are only allowed to differ from a training example in certain protected or sensitive ways.

2.. summarizing the behavior of the ML model on unfair examples:

We propose a loss-ratio based approach that is not only scale-free, but also interpretable. For classification problems, we propose a variation of our test based on the error rates ratio.

1.1. RELATED WORK

At a high level, our approach is to use the difference between the empirical risk and the distributionally robust risk as a test statistic. The distributionally robust risk is the maximum risk of the ML model on similar training examples. Here similarity is measured by a fair metric that encodes our intuition of which inputs should be treated similarly by the ML model. We note that DRO has been extensively studied in the recent literature (Duchi et al., 2016; Blanchet & Murthy, 2016; Hashimoto et al., 2018) , however outside of the fairness context with the exception of Yurochkin et al. (2020) ; Xue et al. (2020) . Yurochkin et al. (2020) focus on training fair or robust ML models instead of auditing ML models. Xue et al. (2020) also use the difference between the empirical and distributionally robust risks as a test statistic, but their test is only applicable to ML problems with finite feature spaces. This limitation severely restricts the applicability of their test. On the other hand, our test is suitable for ML problems with continuous features spaces. We note that the technical exposition in Xue et al. ( 2020) is dependant on the finite feature space assumption and in this work we develop a novel perspective of the problem that allows us to handle continuous feature spaces.

2. GRADIENT FLOW FOR FINDING UNFAIR EXAMPLES

In this section, we describe a gradient flow-based approach to finding unfair examples that form the basis of our suite of inferential tools. Imagine an auditor assessing whether an ML model is fair or not  d y (f (x 1 ), f (x 2 )) ≤ L fair d x (x 1 , x 2 ) for all x 1 , x 2 ∈ X (2.1) for some Lipschitz constant L fair > 0. 2020)). Section 4 provides details of picking a fair metric in our empirical studies. To motivate our approach, we recall the distributionally robust optimization (DRO) approach to training individually fair ML models (Yurochkin et al., 2020) . Let f : X → Y be an ML model and (f (x), y) : Z → R + be any smooth loss (e.g. cross-entropy loss). To search for differential treatment in the ML model, Yurochkin et al. (2020) solve the optimization problem max P :W (P,Pn)≤ Z (f (x), y)dP (z), (2.2) where W is the Wasserstein distance on probability distributions on feature space induced by the fair metric, P n is the empirical distribution of the training data, and is a moving budget that ensures the adversarial examples are close to the (original) training examples in the fair metric. Formally, this search for differential treatment checks for violations of distributionally robust fairness. Definition 2.1 (distributionally robust fairness (DRF) (Yurochkin et al., 2020) ). An ML model h : X → Y is ( , δ)-distributionally robustly fair (DRF) WRT the fair metric d x iff sup P :W (P,Pn)≤ Z (z, h)dP (z) -Z (z, h)dP n (z) ≤ δ. (2.3) The optimization problem (2.2) is an infinite-dimensional problem, but its dual is more tractable. Blanchet & Murthy show that the dual of (2.2) is max P :W (P,Pn)≤ E P [ (f (x), y)] = min λ≥0 {λ + E Pn [ c λ (x, y)]}, (2.4) c λ (x i , y i ) max x∈X { (f (x), y i ) -λd 2 x (x, x i )}. (2.5) In practice, since (2.5) is highly non-convex in general, auditors use gradient-based optimization algorithm to solve it and terminate the algorithm after few iterations. As a result, one can not guarantee optimality of the solution. However, optimality is required to establish convergence guarantees for DRO algorithms. This issue is typically ignored in practice when developing training algorithms, e.g. as in Yurochkin et al. (2020) , but it should be treated with care when considering limiting distribution of the related quantities required to calibrate a test. We note that Xue et al. (2020) needed discrete feature space assumption due to the aforementioned concern: when the feature space is discrete, it is possible to solve (2.5) optimally by simply comparing the objective value at all points of the sample space. In this paper we adapt theory to practice, i.e. analyze the limiting distribution of (2.5) optimized for a fixed number of gradient steps. The effects of early termination can be characterized by a continuous-time approximation of adversarial dynamics, which we called gradient flow attack. Given a sample (x 0 , y 0 ), the gradient flow attack solves a continuous-time ordinary differential equation (ODE) Ẋ(t) = ∇ x { (f (X(t)), y 0 ) -λd 2 x (X(t), x 0 )}, X(0) = x 0 , over time t ≥ 0. For fixed penalty parameter λ and stopping time T > 0, Φ : X × Y → X is the unfair map Φ(x 0 , y 0 ) X(T ). (2.7) Here the map Φ is well-defined as long as g(x) ∇ x { (f (x), y 0 ) -λd 2 x (x, x 0 )} is Lipschitz, i.e., g(x 1 ) -g(x 2 ) 2 ≤ L x 1 -x 2 2 for some L > 0. Under this assumption, the autonomous Cauchy problem (2.6) has unique solution and thus Φ : X × Y → X is a one-to-one function. We call Φ an unfair map because it maps samples in the data to similar areas of the sample space that the ML model performs poorly on. We note that data in this case is an audit dataset chosen by the auditor to evaluate individual fairness of the given model. The audit data does not need to be picked carefully and could be simply an iid sample (e.g. testing data). The unfair map plays the key role as it allows us to identify areas of the sample space where model violates individual fairness, even if the audit samples themselves reveal no such violations. In the rest of the paper, we define the test statistic in terms of the unfair map instead of the optimal point of (2.5). This has two main benefits: 1. computational tractability: evaluating the unfair map is computationally tractable because integrating initial value problems (IVP) is a well-developed area of scientific computing (Heath, 2018) . Auditors may appeal to any globally stable method for solving IVP's to evaluate the unfair map. 2. reproducibility: the non-convex nature of (2.5) means the actual output of any attempts at solving (2.5) is highly depend on the algorithm and the initial iterate. By defining the test statistic algorithmically, we avoid ambiguity in the algorithm and initial iterate, thereby ensuring reproducibility. Of course, the tractability and reproducibility of the resulting statistical tests comes at a cost: power. Because we are not exactly maximizing (2.5), the ability of the test statistic to detect violations of individual fairness is limited by the ability of (2.7) to find (unfair) adversarial examples.

2.1. EVALUATING THE TEST STATISTIC

Solving IVP's is a well-studied problem in scientific computing, and there are many methods for solving IVP's. For our purposes, it is possible to use any globally stable method to evaluate the unfair map. One simple method is the forward Euler method with sufficiently small step size. Let 0 = t 0 < t 1 < . . . < t N = T be a partition of [0, T ], and denote the step size by η k = t k -t k-1 for k = 1, . . . , N . Initialized at x (0) = x 0 , the forward Euler method repeats the iterations x (k) = x (k-1) + η k • ∇ x { (f (x (k-1) ), y 0 ) -λd 2 x (x (k-1) , x 0 )} (2.8) for k = 1, . . . , N . The validity of discretizations of the forward Euler method is guaranteed by the following uniform bounds. Theorem 2.2 (Global stability of forward Euler method). Consider the solution path {X(t)} 0≤t≤T given by (2.6) and the sequence {x (k) } N k=0 given by (2.8). Let the maximal step size be h = max{η 1 , . . . , η N }. Suppose that J g (x)g(x) ∞ ≤ m, where g(x) = ∇ x { (f (x), y 0 )-λd 2 x (x, x 0 )} and J g is the Jacobian matrix of g. Then we have max k=1,...,N X(t k ) -x (k) 2 ≤ hm √ d 2L (e LT -1). (2.9) Theorem 2.2 indicates that the global approximation error (2.9) decreases linearly with h, the maximal step size. Therefore by taking small enough h, the value of the unfair map Φ can be approximated by x (N ) with arbitrarily small error.

3. TESTING INDIVIDUAL FAIRNESS OF AN ML MODEL

Although gradient flows are good ways of finding unfair examples, they do not provide an interpretable summary of the ML model outputs. In this section, we propose a loss-ratio based approach to measuring unfairness with unfair examples. Given a sample point (x 0 , y 0 ) ∈ Z, gradient flow attack (2.6) always increases the regularized loss in (2.5), that is, (f (x 0 ), y 0 ) ≤ (f (X(T )), y 0 ) -d 2 x (X(T ), x 0 ) ≤ (f (X(T )), y 0 ). (3.1) Therefore the unfair map Φ : Z → X always increases the loss value of the original sample. In other words, the ratio (f (Φ(x, y)), y) (f (x), y) ≥ 1 for all (x, y) ∈ Z. (3.2) Recall that the unfair map Φ moves a sample point to its similar points characterized by the fair metric d x . The fair metric d x reflects the auditor's particular concern of individual fairness so that the original sample (x, y) and the mapped sample (Φ(x, y), y) should be treated similarly. If there is no bias/unfairness in the ML model, then we expect the ratio (f (Φ(x, y)), y)/ (f (x), y) to be close to 1. With this intuition, to test if the ML model is individually fair or not, the auditor considers hypothesis testing problem H 0 : E P (f (Φ(x, y)), y) (f (x), y) ≤ δ versus H 1 : E P (f (Φ(x, y)), y) (f (x), y) > δ, (3.3) where P is the true data generating process, and δ > 1 is a constant specified by the auditor. Using the ratio of losses in (3.3) has two main benefits: 1. scale-free: changing the loss function by multiplying a factor does not change the interpretation of the null hypothesis. 2. The test is interpretable: the tolerance δ is the maximum loss differential above which we consider an ML model unfair. In applications where the loss can be interpreted as a measure of the negative impact of an ML model, there may be legal precedents on acceptable levels of differential impact that is tolerable. In our computational results, we set δ according to the four-fifths rule in US labor law. Please see Section 3.2 for further discussion regarding δ and interpretability. We note that using the ratio of losses as the test statistic is not without its drawbacks. If the loss is small but non-zero, then the variance of the test statistic is inflated and and the test loses power, but Type I error rate control is maintained.

3.1. THE AUDIT VALUE

The auditor collects a set of audit data {(x i , y i )} n i=1 and computes the empirical mean and variance of the ratio (f (Φ(x, y)), y)/ (f (x), y), S n = 1 n n i=1 (f (Φ(x i , y i )), y i ) (f (x i ), y i ) and V 2 n = 1 n -1 n i=1 (f (Φ(x i , y i )), y i ) (f (x i ), y i ) -S n 2 , (3.4) by solving gradient flow attack (2.6). The first two empirical moments, S n and V 2 n , are sufficient for the auditor to form confidence intervals and perform hypothesis testing for the population mean E P [ (f (Φ(x, y)), y)/ (f (x), y)], the audit value. Theorem 3.1 (Asymptotic distribution). Assume that ∇ x { (f (x), y) -λd 2 x (x, y)} is Lipschitz in x, and (f (Φ(x, y)), y)/ (f (x), y) has finite first and second moments. If the ML model f is independent of {(x i , y i )} n i=1 , then √ nV -1 n S n -E P (f (Φ(x, y)), y) (f (x), y) d → N (0, 1) (3.5) in distribution, as n → ∞. The first inferential task is to provide confidence intervals for the audit value. The two-sided equal-tailed confidence interval for the audit value E P [ (f (Φ(x, y)), y)/ (f (x), y)] with asymptotic coverage probability 1 -α is CI two-sided = S n - z 1-α/2 √ n V n , S n + z 1-α/2 √ n V n , where z q is the q-th quantile of normal distribution N (0, 1). Corollary 3.2 (Asymptotic coverage of two-sided confidence interval). Under the assumptions in Theorem 3.1, lim n→∞ P E P (f (Φ(x, y)), y) (f (x), y) ∈ S n - z 1-α/2 √ n V n , S n + z 1-α/2 √ n V n = 1 -α. (3.7) The second inferential task is to test restrictions on the audit value, that is, considering the hypothesis testing problem (3.3). Similar to the two-sided confidence interval (3.6), we may also have one-sided confidence interval for the audit value E P [ (f (Φ(x, y)), y)/ (f (x), y)] with asymptotic coverage probability 1 -α, i.e., CI one-sided = S n - z 1-α √ n V n , ∞ . (3.8) The one-sided confidence interval (3.8) allows us to test simple restrictions of the form E P (f (Φ(x, y)), y) (f (x), y) ≤ δ (3.9) by checking whether δ falls in the (1 -α)-level confidence region. By the duality between confidence intervals and hypothesis tests, this test has asymptotic Type I error rate at most α. With this intuition, a valid test is: Reject H 0 if T n > δ, T n S n - z 1-α √ n V n , (3.10) where T n is the test statistic. Corollary 3.3 (Asymptotic validity of test). Under the assumptions in Theorem 3.1, 1. if E P [ (f (Φ(x, y)), y)/ (f (x), y)] ≤ δ, we have lim n→∞ P(T n > δ) ≤ α; 2. if E P [ (f (Φ(x, y)), y)/ (f (x), y)] > δ, we have lim n→∞ P(T n > δ) = 1.

3.2. TEST INTERPRETABILITY, TEST TOLERANCE AND AN ALTERNATIVE FORMULATION

To utilize our test, a auditor should set a tolerance (or threshold) parameter δ. It is important that a auditor can understand and interpret the meaning of the threshold they choose. Appropriate choice of δ can vary based on the application, however here we consider a general choice motivated by the US Equal Employment Opportunity Commission's four-fifths rule, which states "selection rate for any race, sex, or ethnic group [must be at least] four-fifths (4/5) (or eighty percent) of the rate for the group with the highest rate".foot_0 Rephrasing this rule in the context of the loss ratio, we consider the following: the largest permissible loss increase on an individual should be at most five-fourth (5/4) of its original loss. This corresponds to the null hypothesis threshold δ = 1.25. The aforementioned wording of the four-fifth rule is based on the demographic parity group fairness definition, however it can be generalized to other group fairness definitions as follows: "performance of the model on any protected group must be at least four-fifth of the best performance across groups". Depending on what we mean by performance, we can obtain other group fairness definitions such as accuracy parity when auditing an ML classification system. In our test, we use the loss ratio because the loss is a general measure of performance of an ML system. However, in the context of supervised learning, the loss is often a mathematically convenient proxy for the ultimate quantity of interest, error rate. For classification problems, it is possible to modify our test statistic based on the ratio of error rates instead of losses (maintaining δ = 1.25 according to the five-fourth rule). Let 0,1 be the 0-1 loss. Naively, we could consider the mean of the ratio 0,1 (f (Φ(x,y)),y) 0,1(f (x),y) as a test statistic, but this is problematic because the ratio is not well-defined when the classifier correctly classifies x. To avoid this issue, we propose considering the ratio of means (instead of the mean of the ratio) as a test statistic. Formally, we wish to test the hypothesis H 0 : E P [ 0,1 (f (Φ(x, y)), y)] E P [ 0,1 (f (x), y)] ≤ δ versus H 1 : E P [ 0,1 (f (Φ(x, y)), y)] E P [ 0,1 (f (x), y)] > δ, We emphasize that the gradient flow attack is still performed with respect to a smooth loss function; we merely use the 0-1 loss function to evaluate the accuracy of the classifier on the original and adversarial examples. The auditor collects a set of audit data {(x i , y i )} n i=1 and computes the ratio of empirical risks Sn = 1 n n i=1 0,1 (f (Φ(x i , y i )), y i ) 1 n n i=1 0,1 (f (x i ), y i ) and Ṽn 1 n n i=1 0,1 (f (Φ(x i , y i )), y i ) 0,1 (f (x i ), y i ) 0,1 (f (Φ(x i , y i )), y i ) 0,1 (f (x i ), y i ) (3. 12) by performing the gradient flow attack (2.6). Let A n and B n be the numerator and denominator of Sn . Parallel to the intuition of (3.10), here the proposed test is: Reject H 0 if Tn > δ, Tn Sn - z 1-α B 2 n A 2 n ( Ṽn ) 2,2 + B 2 n ( Ṽn ) 1,1 -2A n B n ( Ṽn ) 1,2 n , (3.13) where Tn is the test statistic. Please see Appendix C for the asymptotic normality theorem and Type I error guarantees.

4. INDIVIDUAL FAIRNESS TESTING IN PRACTICE

In our experiments we first verify our methodology in simulations and then present a case-study of testing individual fairness on the Adult dataset (Dua & Graff, 2017) . A auditor performing the testing would need to make three important choices: fair metric d x (•, •), testing threshold δ to have a concrete null hypothesis and level of significance (maximum allowed type I error of hypothesis testing, i.e. p-value cutoff) to make a decision whether the null (classifier is fair) should be rejected. The fair metric can be provided by a subject expert, as considered in our simulation studies, or estimated from data using fair metric learning techniques proposed in the literature, as we do in the Adult experiment. Following the discussion in Section 3.2 we set δ = 1.25. For the significance level, typical choice in many sciences utilizing statistical inference is 0.05, which we follow in our experiments, however this is not a universal rule and should be adjusted in practice when needed.

4.1. STUDYING TEST PROPERTIES WITH SIMULATIONS

We first investigate the ability of our test to identify an unfair classifier, explore robustness to fair metric misspecification verifying Theorem A.1, and discuss implications of the choice of null hypothesis parameter δ. We simulate a 2-dimensional binary classification dataset with two subgroups of observations that differ only in the first coordinate (we provide additional details and data visualization in Appendix E). One of the subgroups is underrepresented in the training data yielding a corresponding logistic regression classifier that overfits the majority subgroup and consequently differentiates data points (i.e. "individuals") along both coordinates. Recall that a pair of points that only differ in the first coordinate are considered similar by the problem design, i.e. their fair distance is 0, and prediction for such points should be the same to satisfy individual fairness. Fair metric Our expert knowledge of the problem allow to specify a fair metric d 2 x ((x 1 , x 2 ), (x 1 , x 2 )) = (x 2 -x 2 ) 2 . Evidently, an individually fair logistic regression should have coefficient of the first coordinate θ 1 = 0, while intercept and θ 2 can be arbitrary. The more θ 1 differs from 0, the larger is the individual fairness violation. In Figure 1 (left) we visualize the heatmap of the test statistic (3.10) over a grid of (θ 1 , θ 2 ) values (intercept is estimated from the data for each coefficients pair). Recall that when this value exceeds δ = 1.25 our test rejects the null (fairness) hypothesis (red heatmap area). Our test well-aligns with the intuitive interpretation of the problem, i.e. test statistic increases as θ 1 deviates from 0 and is independent of the θ 2 value.

Metric misspecification

We also consider fair metric misspecification in the center and right heatmaps of Figure 1 . Here the discounted movement direction in the metric is rotated, i.e. d 2 x ((x 1 , x 2 ), (x 1 , x 2 )) = (sin 2 β)(x 1 -x 1 ) 2 + (cos 2 β)(x 2 -x 2 ) 2 for β = 5 • (center) and β = 10 • (right). We see that the test statistic starts to reject fairness of the models with larger θ 2 magnitudes due to misspecification of the metric, however it remains robust in terms of identifying θ 1 = 0 as the fair model. Null hypothesis threshold Finally we assess the null hypothesis choice δ = 1.25. We saw that the test permits (approximately) θ 1 < 1.5 -whether this causes minor or severe individual fairness violations depends on the problem at hand. A auditor that has access to an expert knowledge for defining the fair metric and desires stricter individual fairness guarantees may consider smaller values of δ. In this simulated example, we see that as δ approaches 1, the test constructed with the correct fair metric (Figure 1 left) will reject all models with θ 1 = 0, while permitting any θ 2 values.

4.2. REAL DATA CASE-STUDY

We present a scenario of how our test can be utilized in practice. To this end, we consider income classification problem using Adult dataset (Dua & Graff, 2017) . The goal is to predict if a person is earning over $50k per year using features such as education, hours worked per week, etc. (we exclude race and gender form the predictor variables; please see Appendix F and code in the supplementary materials for data processing details). Learning the fair metric In lieu of an expert knowledge to define a fair metric, we utilize technique by Yurochkin et al. (2020) to learn a fair metric from data. They proposed a fair metric of the form: d 2 x (x, x ) = x -x , P (x -x ) , where P is the projection matrix orthogonal to a "sensitive" subspace. Similar to their Adult experiment, we learn this subspace by fitting two logistic regression classifier to predict gender and race and taking span of the coefficient vectors (i.e. vectors orthogonal to decision boundary) as the sensitive subspace. The intuition behind this metric is that this subspace captures variation in the data pertaining to the racial and gender differences. A fair metric should treat individuals that only differ in gender and/or race as similar, therefore it assigns 0 distance to any pair of individuals that only differ by a vector in the sensitive subspace (similar to the fair metric we used in simulations discounting any variation along the first coordinate). Our hypothesis test is an audit procedure performed at test time, so we learn the fair metric using test data to examine fairness of several methods that only have access to an independent train set to learn their decision function.

Results

We perform testing of the 4 classifiers: baseline NN, group fairness Reductions (Agarwal et al., 2018) algorithm, individual fairness SenSR (Yurochkin et al., 2020) algorithm and a basic Project algorithm that pre-processes the data by projecting out "sensitive" subspace. For SenSR fair metric and Project we use training data to learn the "sensitive" subspace. All methods are trained to account for class imbalance in the data and we report test balanced accuracy as a performance measure following prior studies of this dataset (Yurochkin et al., 2020; Romanov et al., 2019) . Results of the 10 experiment repetitions are summarized in Table 1 (see Table 3 in Appendix F.6 for the standard deviations). We compare group fairness using average odds difference (AOD) (Bellamy et al., 2018) for gender and race. Significance level for null hypothesis rejection is 0.05 and δ = 1.25 (see Appendix F and code for details regarding the algorithms and comparison metrics). Baseline exhibits clear violation of both individual (T n , Tn 1.25 and rejection proportion is 1) and group fairness (both AOD are far from 0). Simple projection pre-processing improved individual fairness, however the null is still rejected in the majority experiment repetitions (balanced accuracy improvement is accidental). A more sophisticated individual fairness algorithm SenSR does perform well according to our test with test statistic close to 1 (ideal value) and the test fails to reject individual fairness of SenSR every time. Lastly we examine the trade-off between individual and group fairness. Enforcing group fairness with Reductions leads to best AOD values, however it worsens individual fairness (comparing to the baseline) as measured by the test statistic. On the contrary, enforcing individual fairness with SenSR also improves group fairness metrics, however at the cost of the lowest balanced accuracy. We present a similar study of the COMPAS dataset in Appendix G. Results there follow the same pattern with the exception of Reductions slightly improving individual fairness in comparison to the baseline, but still being rejected by our test in all experiment repetitions. Both loss-ratio test T n and error-rate ratio test Tn results are almost identical. The only difference is that loss ratio test rejected Project in 9 out of 10 trials, while error-rate ratio test in 8 out of 10 trials. Setting stopping time T . Recall that the test statistic T n depends on the number of steps T of the gradient flow attack (2.6). Corollary 3.3 guarantees Type I error control for any T , i.e. it controls the error of rejecting a fair classifier regardless of the stopping time choice. Theoretical guarantees for Type II error, i.e. failing to reject an unfair classifier, are hard to provide in general (one needs to know expected value of the loss ratio for a given T and a specific model). In practice, we recommend running the gradient flow attack long enough (based on the available computation budget) to guarantee small Type II error. In our Adult experiment we set T = 500. In Figure 2 

5. DISCUSSION AND CONCLUSION

We developed a suite of inferential tools for detecting and measuring individual bias in ML models. The tools require access to the gradients/parameters of the ML model, so they're most suitable for internal investigators. We hope our tools can help auditors verify individual fairness of ML models and help researchers benchmark individual fairness algorithms. Future work on learning flexible individual fairness metrics from data will expand the applicability range of our test. We demonstrated the utility of our tools by using them to reveal the gender and racial biases in an income prediction model. In our experiments, we discovered that enforcing group fairness may incur individual bias. In other words, the algorithm may sacrifice individual fairness in order to preserve parity of certain metrics across groups. For example, one of the earliest methods for enforcing group fairness explicitly treated examples from the majority and minority groups differently (Hardt et al., 2016) . We conjecture that the even the more modern methods for enforcing group fairness could be forcibly balancing outcomes among demographic groups, leading to instances where similar individuals in different demographic groups are treated differently. The possible trade-off between individual and group fairness warrants further investigation, but is beyond the scope of this paper. for some t(i) k-1 ∈ [t k-1 , t k ]. Compactly, we have X(t k ) = X(t k-1 ) + η k Ẋ(t k-1 ) + 1 2 η 2 k Ẍ(1) ( t(1) k-1 ), . . . , Ẍ(d) ( t(d) k-1 ) . (B.2) For k = 1, . . . , N , we let T k X(t k ) -X(t k-1 ) η k -g(X(t k-1 )) (B.3) = 1 η k X(t k ) -X(t k-1 ) -η k Ẋ(t k-1 ) . (B.4) Note that Ẍ(t) ∞ = J g (X(t))g(X(t)) ∞ ≤ m, (B.5) and η k ≤ h, we have k) for k = 1, . . . , N , we have T k 2 = 1 2 η k Ẍ(1) ( t(1) k-1 ), . . . , Ẍ(d) ( t(d) k-1 ) 2 (B.6) ≤ 1 2 η k √ d Ẍ(1) ( t(1) k-1 ), . . . , Ẍ(d) ( t(d) k-1 ) ∞ (B.7) ≤ hm √ d 2 . (B.8) Let e k = X(t k ) -x ( e k = X(t k-1 ) -x (k-1) + η k g(X(t k-1 )) -g(x (k-1) ) + η k T k (B.9) = e k-1 + η k g(X(t k-1 )) -g(x (k-1) ) + η k T k . (B.10) Since g is L-Lipschitz, we have e k 2 ≤ e k-1 2 + η k L e k-1 2 + η k hm √ d 2 . (B.11) Then, Proof of Corollary 3.2. By asymptotic distribution given by Theorem 3.1, e k 2 + hm √ d 2L ≤ (1 + Lη k ) e k-1 2 + hm √ d 2L (B.12) ≤ e Lη k e k-1 2 + hm √ d 2L . (B.13) For k = 1, . . . , N , e k 2 + hm √ d 2L ≤ e L(η1+ P E P (f (Φ(x, y)), y) (f (x), y) ∈ S n - z 1-α/2 √ n V n , S n + z 1-α/2 √ n V n (B.19) =P z α/2 ≤ √ nV -1 n S n -E P (f (Φ(x, y)), y) (f (x), y) ≤ z 1-α/2 → 1 -α (B.20) as n → ∞. Proof of Corollary 3.3. Let τ = E P [ (f (Φ(x, y)), y)/ (f (x), y)]. By asymptotic distribution given by Theorem 3.1, P(T n > δ) = 1 -P S n - z 1-α √ n V n ≤ δ (B.21) = 1 -P √ nV -1 n (S n -τ ) ≤ z 1-α + √ nV -1 n (δ -τ ) (B.22) → 0, if τ < δ α, if τ = δ 1, if τ > δ (B.23) as n → ∞.

B.3 PROOF OF THEOREM IN APPENDIX A

Proof of Theorem A.1. For any (x 0 , y 0 ) ∈ Z, let {X 1 (t)} 0≤t≤T solve Ẋ1 (t) = ∇ x { (f (X 1 (t), y 0 )) -λd 2 1 (X 1 (t), x 0 )}, X 1 (0) = x 0 , (B.24) and {X 2 (t)} 0≤t≤T solve Ẋ2 (t) = ∇ x { (f (X 2 (t), y 0 )) -λd 2 2 (X 1 (t), x 0 )}, X 2 (0) = x 0 . (B.25) Consider y(t) = X 1 (t) -X 2 (t) 2 2 + λD 2 δ L , (B.26) we have y(0) = λD 2 δ L , y(t) ≥ 0, (B.27) and ẏ(t) = 2 X 1 (t) -X 2 (t), Ẋ1 (t) -Ẋ2 (t) (B.28) ≤ 2 X 1 (t) -X 2 (t) 2 • Ẋ1 (t) -Ẋ2 (t) 2 (B.29) ≤ 2 X 1 (t) -X 2 (t) 2 • {L X 1 (t) -X 2 (t) 2 + λDδ} (B.30) ≤ 2L X 1 (t) -X 2 (t) 2 2 + λD 2 δ L (B.31) = 2L • y(t). (B.32) By Gronwall's inequality, y(T ) ≤ e 2LT y(0), (B.33) that is, X 1 (T ) -X 2 (T ) 2 2 ≤ λD 2 δ L (e 2LT -1), (B.34) which implies Φ 1 (x 0 , y 0 ) -Φ 2 (x 0 , y 0 ) 2 = X 1 (T ) -X 2 (T ) 2 ≤ λδ L De LT . (B.35) By Assumption A3, we have (f (Φ 1 (x 0 , y 0 )), y 0 ) (f (x 0 ), y 0 ) - (f (Φ 2 (x 0 , y 0 )), y 0 ) (f (x 0 ), y 0 ) ≤ λδ L L 0 De LT c . (B.36)

C ASYMPTOTIC NORMALITY AND ASYMPTOTIC VALIDITY OF THE ERROR-RATES RATIO TEST

The auditor collects a set of audit data {(x i , y i )} n i=1 and computes the ratio of empirical risks Sn = 1 n n i=1 0,1 (f (Φ(x i , y i )), y i ) 1 n n i=1 0,1 (f (x i ), y i ) and Ṽn 1 n n i=1 0,1 (f (Φ(x i , y i )), y i ) 0,1 (f (x i ), y i ) 0,1 (f (Φ(x i , y i )), y i ) 0,1 (f (x i ), y i ) (C.1 ) by performing the gradient flow attack (2.6). Let A n and B n be the numerator and denominator of Sn . First we derive the limiting distribution of a calibrated test statistic for the error-rates ratio test. Theorem C.1 (Asymptotic distribution). Assume that ∇ x { (f (x), y) -λd 2 x (x, y)} is Lipschitz in x. If the ML model f is independent of {(x i , y i )} n i=1 , then √ nB 2 n A 2 n ( Ṽn ) 2,2 + B 2 n ( Ṽn ) 1,1 -2A n B n ( Ṽn ) 1,2 Sn - E P [ 0,1 (f (Φ(x, y)), y)] E P [ 0,1 (f (x), y)] d → N (0, 1) (C.2) in distribution, as n → ∞. Type I error rate control is formalized in the following: Corollary C.2 (Asymptotic validity of test). Under the assumptions in Theorem C.1, 1. if E P [ 0,1 (f (Φ(x, y)), y)]/E P [ 0,1 (f (x), y)] ≤ δ, we have lim n→∞ P( Tn > δ) ≤ α; 2. if E P [ 0,1 (f (Φ(x, y)), y)]/E P [ 0,1 (f (x), y)] > δ, we have lim n→∞ P( Tn > δ) = 1. Proof of Theorem C.1. By central limit theorem (CLT), √ n A n B n - µ x µ y d → N (0, Σ), (C.3) for finite µ x , µ y , and Σ with finite entries. Let g(x, y) = x/y, then ∇g(µ x , µ y ) = µ -2 y (µ y , -µ x ) . By continuous mapping theorem, we have the data. √ n (g(A n , B n ) -g(µ x , µ y )) d → N (0, ∇g(µ x , µ y ) Σ∇g(µ x , µ y )), (C.4) which implies √ n Sn - E P [ 0,1 (f (Φ(x, y)), y)] E P [ 0,1 (f (x), y)] d → N 0, µ 2 x Σ 2,2 + µ 2 y Σ 1,1 -2µ x µ y Σ 1,2 µ 4 y , (C.5) or √ nµ 2 y µ 2 x Σ 2,2 + µ 2 y Σ 1,1 -2µ x µ y Σ 1,2 Sn - E P [ 0,1 (f (Φ(x, y)), y)] E P [ 0,1 (f (x), y)]                    G i ∼ iid Bernoulli(0.1) X i ∼ N (1 -G i ) -1.5 0 + G i 1.5 0 , (0.25) 2 I 2 Y i = 1 (1 -G i ) -0.2 -0.01 + G i 0.2 -0.01 T X i + N 0, 10 -4 > 0 for i = 1, . . . , 400. (E.1) The data is plotted in Figure 3 . As seen in the figure, feature vectors for two groups mainly differ in the first co-ordinate. So, the discounted movement direction is (1, 0) T .

E.2 CLASSIFIERS

For comparison purpose we consider logistic model of the form f b,w1,w2 (x) = expit b + w 1 X (1) i + w 2 X (2) i (E.2) where expit(x) e x 1+e x and the weights are chosen as (w 1 , w 2 ) ∈ {-4, -3.6, . . . , 4} 2 . For a given (w 1 , w 2 ) the bias b is chosen as: b(w 1 , w 2 ) arg min b∈R 400 i=1 (f b,w1,w2 (X i ), Y i ), where is the logistic loss.

E.3 LOWER BOUND

To calculate the lower bounds we use Algorithm 1 with the following choices: the choices of and f is provided in the previous subsection. The choice of fair distances are provided in Section 4. We choose regularizer λ = 100, number of steps T = 400 and step sizes t = 0.02 t 2/3 . Dua & Graff (2017) . We remove fnlwgt, education, native-country from the features. race and sex are considered as protected attributes and they are not included in feature vectors for classification. race is treated as binary attribute: White and non-White. We remove any data-point with missing entry and end up having 45222 data-points.

F.2 FAIR METRIC

To learn the sensitive subspace, we perform logistic regression for race and sex on other features, and use the weight vectors as the vectors spanning the sensitive subspace (H). The fair metric is then obtained as d 2 x (x 1 , x 2 ) = (I -Π H )x 1 -x 2 2 2 .

F.3 HYPERPARAMETERS AND TRAINING

For each model, 10 random train/test splits of the dataset is used, where we use 80% data for training purpose. All compared methods are adjusted to account for class imbalance during training.

F.3.1 BASELINE AND PROJECT

Baseline is the obtained by fitting 2 layer fully connected neural network with 50 neurons for the hidden layer. It doesn't enforce any fairness for the model. Project also uses similar architecture, except a pre-processing layer for projecting out sensitive subspace from features. So, Project model is a simple and naive way to enforce fairness. For both the models same parameters are involved: learning_rate for step size for Adam optimizer, batch_size for mini-batch size at training time, and num_steps for number of training steps to be performed. We present the choice of hyperparameters in Table 2 F.3.2 SENSR Codes for SenSR (Yurochkin et al., 2020) is provided with submission with a demonstration for fitting the model, where the choice of hyperparameters are provided.

F.3.3 REDUCTION

We provide codes for reduction (Agarwal et al., 2018) approach. We also provide a demonstration for fitting reduction model with the choice of hyperparameters for this experiment. The codes can also be found in https://github.com/fairlearn/fairlearn. We used Equalized Odds fairness constraint (Hardt et al., 2016) with constraints violation tolerance parameter = 0.03.

F.4 LOWER BOUND AND TESTING

To calculate the lower bounds we use Algorithm 1. The loss is the logistic loss. Test data is provided as an input, whereas the fair metric is also learnt from the test data. For each of the models we choose regularizer λ = 50, number of steps T = 500 and step size t = 0.01. 



Uniform Guidelines on Employment Selection Procedures, C.F.R. §1607.4(D) (2015).



. The auditor aims to detect violations of individual fairness in the ML model. Recall Dwork et al. (2011)'s definition of individual fairness. Let X ⊂ R d and Y ⊂ R d be the input and output spaces respectively, and f : X → Y be an ML model to audit. The ML model f is known as individually fair if

Figure 1: Heatmaps of the test statistic T n (3.10) for a logistic regression classifier on a grid of coefficients (θ 1 , θ 2 ). Individually fair classifier corresponds to θ 1 close to 0 and any θ 2 . Black line in each plot represents the null hypothesis rejection decision boundary T n > 1.25. Blue color represents acceptance region, whereas the red corresponds to unfair coefficients regions. The true fair metric discounts any differences along the first coordinate (left); (center,right) are results with misspecified fair metric, i.e. discounting direction is rotated by 5 • and 10 • respectively.

Figure 2: T n as a function of stopping time T on Adult data (log-log scale)

Figure3: Simulated data for synthetic experiment. We have two groups; group 1 being the minority group. The features for these two groups differ mainly by first co-ordinate. So, the discounted movement direction is (1, 0) T .

Results on Adult data over 10 experiment repetitions (note the log-log scale) we present an empirical study of the test statistic T n as a function of stopping time T . We see that our test fails to reject SenSR, the classifier we found individually fair, for any value of T verifying our Type I error guarantees in practice. Rejection of the unfair classifiers requires sufficiently large T , supporting our recommendation for Type II error control in practice.

••+η k ) hm

Choice of hyperparameters for Baseline and Project Parameters learning_rate batch_size num_steps Adult are: Age, fnlwgt, capital-gain, capital-loss, hours-per-week, and education-num. The categorical features are: work-class, education, marital-status, occupation, relationship, race, sex, native-country. The detailed descriptions can be found in

Results on COMPAS data over 10 experiment repetitions and our test fails to reject its individual fairness in all experiment repetitions. Examining the tradeoff between individual and group fairness, we see that both SenSR and Reductions improve all fairness metrics in comparison to the baseline. However, improvement of individual fairness with Reductions is marginal. SenSR provides a sizeable improvement of gender AOD, but only a marginal improvement of race AOD.

ACKNOWLEDGEMENTS

This paper is based upon work supported by the National Science Foundation (NSF) under grants no. 1830247 and 1916271. Any opinions, findings, and conclusions or recommendations expressed in this paper are those of the authors and do not necessarily reflect the views of the NSF.

A ROBUSTNESS OF TEST STATISTIC TO THE CHOICE OF FAIR METRIC

Since the fair metric d x is picked by an expert's domain knowledge or in a data-driven way, the auditor may incur the issue of fair metric misspecification. Fortunately, the test statistic (3.10) is robust to small changes in fair metrics. Let d 1 , d 2 : X × X → R + be two different fair metrics on X . Let Φ 1 , Φ 2 : X × Y → X be the unfair maps induced by d 1 , d 2 . We start by stating the following assumptions:(A1) ∇ x { (f (x), y) -λd 2 1 (x, y)} is L-Lipschitz in x with respect to • 2 ; (A2) sup x,x ∈X x -x 2 D < ∞; (A3) (f (x), y) is L 0 -Lipschitz in x with respect to • 2 , and lower bounded by c > 0;Assumption A1 is always assumed for the existence and uniqueness of ODE solution. Assumption A2, that the feature space X is bounded, and the first part of Assumption A3 are standard in the DRO literature. The second part of Assumption A3 is to avoid singularity in computing loss ratio. Assumption A4 is worthy of being discussed. The constant δ d in A4 characterizes the level of fair metric misspecification. Moreover, Assumption A4 is mild under Assumption A2. For example, let1) be the exact and misspecified fair metric respectively. Then sup4) The level of fair metric misspecification δ d vanishes as long as Σ mis estimates Σ exact consistently. Theorem A.1 (Robustness of test statistic). Suppose that the support of data distribution P satisfies for any (x 0 , y 0 ) ∈ supp(P ), the solution path of ODE (2.6) corresponding to (x 0 , y 0 ) and d 1 (or d 2 ) lies in X . Under Assumptions A1 -A4, we havefor any (x 0 , y 0 ) ∈ supp(P ).The first assumption in Theorem A.1 is mild since we always perform early termination in gradient flow attack.In the literature, the fair metric can be learned from an additional dataset different from the training, test, and audit dataset. In this case, the constant δ d , which characterizes the goodness-of-fit of the estimated fair metric to the exact fair metric, shrinks to 0 as n → ∞. Then Theorem A.1 provides two key insights.First, as long as δ d tends to 0, we ultimately test the same null hypothesis sinceSecond, the error of test statistic induced by the misspecification of fair metric is negligible as long as Proof of Theorem 2.2. Let X(t) = (X (1) (t), . . . , X (d) (t)) . For i = 1, . . . , d and k = 1, . . . , N , we haveSn -. By asymptotic distribution given by Theorem C.1,as n → ∞.

D IMPLEMENTATION OF THE PROPOSED TEST

The algorithm 1 provides a step-by-step procedure for calculating the lower bound. For a choice of δ (threshold for null hypothesis, see equation 3.3), at a level of significance 0.05, we reject the null hypothesis if lower bound is greater than δ.Algorithm 1 Individual fairness testing Input: ML model f ; loss ; data {(X i , Y i )} n i=1 ; fair-distance d x ; regularization parameters λ; number of steps T ; and step size { t } T t=1 ; Require: f provides class probabilities; t is decreasingend forOutput: lower bound = mean(r) -1.645 √ n * std(r)

E SUPPLEMENTARY DETAILS FOR SIMULATIONS

Here we provide further details for the experiment with simulated data.

E.1 DATA

We considered one one group variable G with two labels. The 2 dimensional features were generated with the idea that they will differ in first co-ordinate. We present the detailed model for generating Published as a conference paper at ICLR 2021 Performance Let C be the set of classes. Let Y and Ŷ be the observed and predicted label for a data-point, respectively. The balanced accuracy is defined asGroup fairness Let G be the protected attribute taking values in {0, 1}. The average odds difference (AOD) (Bellamy et al., 2018) for group G is defined asIn Table 3 we present extended results of the Adult experiment with standard deviations computed from 10 experiment repetitions.

G COMPAS EXPERIMENT

In COMPAS recidivism prediction dataset (Larson et al., 2016) the task is to predict whether a criminal defendant would recidivate within two years. We consider race (Caucasian or not-Caucasian) and sex (binary) as the sensitive attributes. The features in COMPAS are: sex, race, priors_count age_cat= 25 to 45, age_cat= Greater than 45, age_cat= Less than 25, and c_charge_degree=F. prior_count is standardized.As before we perform testing on four classifiers: baseline NN, group fairness Reductions (Agarwal et al., 2018) algorithm, individual fairness SenSR (Yurochkin et al., 2020) algorithm and a basic Project algorithm that pre-processes the data by projecting out the "senstive" subspace. Baseline and Project have same architecture and parameters as in the experiment with Adult dataset. For SenSR fair metric and Project we use train data to learn the "senstive" subspace. A further detail for choice of parameters is provided in the code. For Reduction we used Equalized Odds fairness constraint (Hardt et al., 2016) with constraints violation tolerance parameter = 0.16.All methods are trained to account for class imbalance in the data and we report test balanced accuracy as a performance measure. Results of the 10 experiment repetitions are summarized in Table 4 . We compare group fairness using average odds difference (AOD) (Bellamy et al., 2018) for gender and race. Significance level for the null hypothesis rejection is 0.05 and δ = 1.25.Baseline exhibits clear violations of both individual (test is rejected with proportion 1) and group fairness (both AODs are big in terms of absolute magnitude). Reductions method achieves significant group fairness improvements, but is individually unfair. Simple pre-processing is more efficient (comparing to the Adult experiment) with rejection proportion of 0.2. SenSR is the most effective

