EXPLAINABILITY FOR FAIR MACHINE LEARNING

Abstract

As the decisions made or influenced by machine learning models increasingly impact our lives, it is crucial to detect, understand, and mitigate unfairness. But even simply determining what "unfairness" should mean in a given context is non-trivial: there are many competing definitions, and choosing between them often requires a deep understanding of the underlying task. It is thus tempting to use model explainability to gain insights into model fairness, however existing explainability tools do not reliably indicate whether a model is indeed fair. In this work we present a new approach to explaining fairness in machine learning, based on the Shapley value paradigm. Our fairness explanations attribute a model's overall unfairness to individual input features, even in cases where the model does not operate on sensitive attributes directly. Moreover, motivated by the linearity of Shapley explainability, we propose a meta algorithm for applying existing trainingtime fairness interventions, wherein one trains a perturbation to the original model, rather than a new model entirely. By explaining the original model, the perturbation, and the fair-corrected model, we gain insight into the accuracy-fairness trade-off that is being made by the intervention. We further show that this meta algorithm enjoys both flexibility and stability benefits with no loss in performance.

1. INTRODUCTION

Machine learning has repeatedly demonstrated astonishing predictive power due to its capacity to learn complex relationships from data. However, it is well known that machine learning models risk perpetuating or even exacerbating unfair biases learnt from historical data (Barocas & Selbst, 2016; Bolukbasi et al., 2016; Caliskan et al., 2017; Lum & Isaac, 2016) . As such models are increasingly used for decisions that impact our lives, we are compelled to ensure those decisions are made fairly. In the pursuit of training a fair model, one encounters the immediate challenge of how fairness should be defined. There exist a wide variety of definitions of fairness -some based on statistical measures, others on causal reasoning, some imposing constraints on group outcomes, others at the individual level -and each notion is often incompatible with its alternatives (Berk et al., 2018; Corbett-Davies et al., 2017; Kleinberg et al., 2017; Lipton et al., 2018; Pleiss et al., 2017) . Deciding which measure of fairness to impose thus requires extensive contextual understanding and domain knowledge. Further still, one should understand the downstream consequences of a fairness intervention before imposing it on the model's decisions (Hu et al., 2019; Liu et al., 2018) . To help understand whether a model is making fair decisions, and choose an appropriate notion of fairness, one might be tempted to turn to model explainability techniques. Unfortunately, it has been shown that many standard explanation methods can be manipulated to suppress the reported importance of the protected attribute without substantially changing the output of the model (Dimanov et al., 2020) . Consequently such explanations are poorly suited to assessing or quantifying unfairness. In this work, we introduce new explainability methods for fairness based on the Shapley value framework for model explainability (Datta et al., 2016; Štrumbelj & Kononenko, 2010; Lipovetsky & Conklin, 2001; Lundberg & Lee, 2017; Štrumbelj & Kononenko, 2014) . We consider a broad set of widely applied group-fairness criteria and propose a unified approach to explaining unfairness within any one of them. This set of fairness criteria includes demographic parity, equalised odds, equal opportunity and conditional demographic parity see Sec. 2.1. We show that for each of these definitions it is possible to choose Shapley value functions which capture the overall unfairness in the model, and attribute it to individual features. We also show that because the fairness Shapley values collectively must sum to the chosen fairness metric, we cannot hide unfairness by manipulating the explanations of individual features, thereby overcoming the problems with accuracy-based explanations observed by Dimanov et al. ( 2020).

