EXPLAINABILITY FOR FAIR MACHINE LEARNING

Abstract

As the decisions made or influenced by machine learning models increasingly impact our lives, it is crucial to detect, understand, and mitigate unfairness. But even simply determining what "unfairness" should mean in a given context is non-trivial: there are many competing definitions, and choosing between them often requires a deep understanding of the underlying task. It is thus tempting to use model explainability to gain insights into model fairness, however existing explainability tools do not reliably indicate whether a model is indeed fair. In this work we present a new approach to explaining fairness in machine learning, based on the Shapley value paradigm. Our fairness explanations attribute a model's overall unfairness to individual input features, even in cases where the model does not operate on sensitive attributes directly. Moreover, motivated by the linearity of Shapley explainability, we propose a meta algorithm for applying existing trainingtime fairness interventions, wherein one trains a perturbation to the original model, rather than a new model entirely. By explaining the original model, the perturbation, and the fair-corrected model, we gain insight into the accuracy-fairness trade-off that is being made by the intervention. We further show that this meta algorithm enjoys both flexibility and stability benefits with no loss in performance.

1. INTRODUCTION

Machine learning has repeatedly demonstrated astonishing predictive power due to its capacity to learn complex relationships from data. However, it is well known that machine learning models risk perpetuating or even exacerbating unfair biases learnt from historical data (Barocas & Selbst, 2016; Bolukbasi et al., 2016; Caliskan et al., 2017; Lum & Isaac, 2016) . As such models are increasingly used for decisions that impact our lives, we are compelled to ensure those decisions are made fairly. In the pursuit of training a fair model, one encounters the immediate challenge of how fairness should be defined. There exist a wide variety of definitions of fairness -some based on statistical measures, others on causal reasoning, some imposing constraints on group outcomes, others at the individual level -and each notion is often incompatible with its alternatives (Berk et al., 2018; Corbett-Davies et al., 2017; Kleinberg et al., 2017; Lipton et al., 2018; Pleiss et al., 2017) . Deciding which measure of fairness to impose thus requires extensive contextual understanding and domain knowledge. Further still, one should understand the downstream consequences of a fairness intervention before imposing it on the model's decisions (Hu et al., 2019; Liu et al., 2018) . To help understand whether a model is making fair decisions, and choose an appropriate notion of fairness, one might be tempted to turn to model explainability techniques. Unfortunately, it has been shown that many standard explanation methods can be manipulated to suppress the reported importance of the protected attribute without substantially changing the output of the model (Dimanov et al., 2020) . Consequently such explanations are poorly suited to assessing or quantifying unfairness. In this work, we introduce new explainability methods for fairness based on the Shapley value framework for model explainability (Datta et al., 2016; Štrumbelj & Kononenko, 2010; Lipovetsky & Conklin, 2001; Lundberg & Lee, 2017; Štrumbelj & Kononenko, 2014) . We consider a broad set of widely applied group-fairness criteria and propose a unified approach to explaining unfairness within any one of them. This set of fairness criteria includes demographic parity, equalised odds, equal opportunity and conditional demographic parity see Sec. 2.1. We show that for each of these definitions it is possible to choose Shapley value functions which capture the overall unfairness in the model, and attribute it to individual features. We also show that because the fairness Shapley values collectively must sum to the chosen fairness metric, we cannot hide unfairness by manipulating the explanations of individual features, thereby overcoming the problems with accuracy-based explanations observed by Dimanov et al. (2020) . Motivated by the attractive linearity properties of Shapley value explanations, we also introduce a meta algorithm for training a fair model. Rather than learning a fair model directly, we propose instead learning an additive correction to an existing unfair model. We use training-time fairness algorithms to train the correction, thereby ensuring the corrected model is fair. We show that this approach gives new perspectives helpful for understanding fairness, benefits from greater flexibility due to model-agnosticism, and enjoys improved stability, all while maintaining the performance of the chosen training-time algorithm.

2. EXPLAINABLE FAIRNESS

In this section we give an overview of the Shapley value paradigm for machine learning explainability, and show how it can be adapted to explain fairness. Motivated by the axiomatic properties of Shapley values, we also introduce a meta algorithm for applying training-time fairness algorithms to a perturbation rather than a fresh model, giving us multiple perspectives on fairness.

2.1. BACKGROUND AND NOTATION

We consider fairness in the context of supervised classification, where the data consists of triples (x, a, y), where x ∈ X are the features, a ∈ A is a protected attribute (e.g. sex or race), and y ∈ Y is the target. We allow, but do not require, a to be a component of x. The task is to train a model f to predict y from x while avoiding unfair discrimination with respect to a. We assume A and Y are both finite, discrete sets. Our fairness explanations apply to any definition that can be formulated as (conditional) independence of the model output and the protected attribute. This includes demographic parity (Calders et al., 2009; Feldman et al., 2015; Kamiran & Calders, 2012; Zafar et al., 2017) , conditional demographic parity (Corbett-Davies et al., 2017) , and equalised odds and equal opportunity (Hardt et al., 2016) . Definition 1. DEMOGRAPHIC PARITY The model f satisfies demographic parity if f (x) is independent of a, or equivalently P (f (x) = ỹ|a) = P (f (x) = ỹ) for all ỹ ∈ Y and a ∈ A. Definition 2. CONDITIONAL DEMOGRAPHIC PARITY The model f satisfies conditional demographic parity if with respect to a set of legitimate risk factors {v 1 , . . . , v n } if f (x) is independent of a conditional on the v i , or equivalently P (f (x) = ỹ|a, v 1 , . . . , v n ) = P (f (x) = ỹ|v 1 , . . . , v n ) for all ỹ ∈ Y and a ∈ A. Definition 3. EQUALISED ODDS The model f satisfies equalised odds if f (x) is independent of a conditional on y, or equivalently P (f (x) = ỹ|a, y) = P (f (x) = ỹ|y) for all ỹ, y ∈ Y and a ∈ A. If Y = {0, 1} is binary, then equalised odds implies that the true and false positive rates on each protected group should agree. Furthermore, assuming that y = 1 corresponds to the "privelidged outcome", we can define equal opportunity as follows Definition 4. EQUAL OPPORTUNITY The model f satisfies equal opportunity if f (x) is independent of a conditional on y = 1, or equivalently P (f (x) = ỹ|a, y = 1) = P (f (x) = ỹ|y = 1) for all ỹ ∈ Y and a ∈ A.

2.2. ADAPTING EXPLAINABILTY TO FAIRNESS

Fairness in decision making -automated or not -is a subtle topic. Choosing an appropriate definition of fairness requires both context and domain knowledge. In seeking to improve our understanding of the problem, we might be tempted to use model explainability methods. However Dimanov et al. (2020) show that such methods are poorly suited for understanding fairness. In particular we should not try to quantify unfairness by looking at the feature importance of the protected attribute, as such measures can be easily manipulated. Part of the problem is that most explainability methods attempt to determine which features are important contibutors to the model's accuracy. We seek to introduce explanations that instead determine which features contributed to unfairness in the model. Toward this end, we work within the Shapley value paradigm, which is widely used as a modelagnostic and theoretically principled approach to model explainability (Datta et al., 2016; Štrumbelj & Kononenko, 2010; Lipovetsky & Conklin, 2001; Lundberg & Lee, 2017; Štrumbelj & Kononenko, 2014) . We will first review the application of Shapley values to explaining model accuracy, then show how this can be adapted to explaining model unfairness. See Frye et al. (2020b) for a detailed analysis of the axiomatic foundations of Shapley values in the context of model explainability.

