EVALUATING VISUAL COUNTERFACTUAL EXPLAINERS

Abstract

Explainability methods have been widely used to provide insight into the decisions made by statistical models, thus facilitating their adoption in various domains within the industry. Counterfactual explanation methods aim to improve our understanding of a model by perturbing samples in a way that would alter its response in an unexpected manner. This information is helpful for users and for machine learning practitioners to understand and improve their models. Given the value provided by counterfactual explanations, there is a growing interest in the research community to investigate and propose new methods. However, we identify two issues that could hinder the progress in this field. (1) Existing metrics do not accurately reflect the value of an explainability method for the users. ( 2) Comparisons between methods are usually performed with datasets like CelebA, where images are annotated with attributes that do not fully describe them and with subjective attributes such as "Attractive". In this work, we address these problems by proposing an evaluation method with a principled metric to evaluate and compare different counterfactual explanation methods. The evaluation is based on a synthetic dataset where images are fully described by their annotated attributes. As a result, we are able to perform a fair comparison of multiple explainability methods in the recent literature, obtaining insights about their performance. We make the code and data public to the research community.

1. INTRODUCTION

The popularity of deep learning methods is a testament to their effectiveness across a multitude of tasks in different domains. This effectiveness has led to their widespread industrial adoption (e.g., self-driving cars, screening systems, healthcare, etc.), where the need to explain a model's decision becomes paramount. However, due to the high level of complexity of deep learning models, it is difficult to understand their decision making process (Burkart & Huber, 2021) . This ambiguity has slowed down the adoption of these systems in critical domains. Hence, in order to ensure algorithmic fairness in deep learning and to identify potential biases in training data and models, it is key to explore the reasoning behind their decisions (Buhrmester et al., 2021) . In an attempt to convincingly tackle the why question, there has been a surge of work in the field of explainability for machine learning models (Joshi et al., 2018; Mothilal et al., 2020; Rodríguez et al., 2021) . The goal of this field is to provide explanations for the decisions of a classifier, which often come in the form of counterfactuals. These conterfactuals provide insight as to why the output of the algorithms is not any different and how it could be changed (Goyal et al., 2019) . Basically a counterfactual explanation answers the question: "For situation X why was the outcome Y and not Z", describing what changes in a situation would have produced a different decision. Ideally, counterfactual methods produce explanations that are interpretable by humans while reflecting the factors that influence the decisions of a model (Mothilal et al., 2020) . So given an input sample and a model, a counterfactual explainer would perturb certain attributes of the sample, producing a counterexample i.e., counterfactual, that shifts the model's prediction, thus revealing which semantic attributes the model is sensitive to. In this work, we focus on the image domain given the recent surge of explainability methods for image classifiers (Joshi et al., 2018; Rodríguez et al., 2021; Singla et al., 2019; Chang et al., 2018) . A particular challenge of the image domain is that changes in the pixel space are difficult to interpret and resemble adversarial attacks (Goodfellow et al., 2014b) (see Figure 4 for an example), so current explainers tend to search for counterfactuals in a latent space produced 1

