STATISTICAL INFERENCE FOR INDIVIDUAL FAIRNESS

Abstract

As we rely on machine learning (ML) models to make more consequential decisions, the issue of ML models perpetuating or even exacerbating undesirable historical biases (e.g. gender and racial biases) has come to the fore of the public's attention. In this paper, we focus on the problem of detecting violations of individual fairness in ML models. We formalize the problem as measuring the susceptibility of ML models against a form of adversarial attack and develop a suite of inference tools for the adversarial cost function. The tools allow auditors to assess the individual fairness of ML models in a statistically-principled way: form confidence intervals for the worst-case performance differential between similar individuals and test hypotheses of model fairness with (asymptotic) non-coverage/Type I error rate control. We demonstrate the utility of our tools in a real-world case study.

1. INTRODUCTION

The problem of bias in machine learning systems is at the forefront of contemporary ML research. Numerous media outlets have scrutinized machine learning systems deployed in practice for violations of basic societal equality principles (Angwin et al., 2016; Dastin, 2018; Vigdor, 2019) . In response researchers developed many formal definitions of algorithmic fairness along with algorithms for enforcing these definitions in ML models (Dwork et al., 2011; Hardt et al., 2016; Berk et al., 2017; Kusner et al., 2018; Ritov et al., 2017; Yurochkin et al., 2020) . Despite the flurry of ML fairness research, the basic question of assessing fairness of a given ML model in a statistically principled way remains largely unexplored. In this paper we propose a statistically principled approach to assessing individual fairness (Dwork et al., 2011) of ML models. One of the main benefits of our approach is it allows the investigator to calibrate the method; i.e. it allows the investigator to prescribe a Type I error rate. Passing a test that has a guaranteed small Type I error rate is the usual standard of proof in scientific investigations because it guarantees the results are reproducible (to a certain degree). This is also highly desirable in detecting bias in ML models because it allows us to certify whether an ML model will behave fairly at test time. Our method for auditing ML models abides by this standard. There are two main challenges associated with developing a hypothesis test for individual fairness. First, how to formalize the notion of individual fairness in an interpretable null hypothesis? Second, how to devise a test statistic and calibrate it so that auditors can control the Type I error rate? In this paper we propose a test motivated by the relation between individual fairness and adversarial robustness (Yurochkin et al., 2020) . At a high-level, our approach consists of two parts: 1. generating unfair examples: by unfair example we mean an example that is similar to a training example, but treated differently by the ML models. Such examples are similar to adversarial examples (Goodfellow et al., 2014) , except they are only allowed to differ from a training example in certain protected or sensitive ways.

