EVALUATING VISUAL COUNTERFACTUAL EXPLAINERS

Abstract

Explainability methods have been widely used to provide insight into the decisions made by statistical models, thus facilitating their adoption in various domains within the industry. Counterfactual explanation methods aim to improve our understanding of a model by perturbing samples in a way that would alter its response in an unexpected manner. This information is helpful for users and for machine learning practitioners to understand and improve their models. Given the value provided by counterfactual explanations, there is a growing interest in the research community to investigate and propose new methods. However, we identify two issues that could hinder the progress in this field. (1) Existing metrics do not accurately reflect the value of an explainability method for the users. ( 2) Comparisons between methods are usually performed with datasets like CelebA, where images are annotated with attributes that do not fully describe them and with subjective attributes such as "Attractive". In this work, we address these problems by proposing an evaluation method with a principled metric to evaluate and compare different counterfactual explanation methods. The evaluation is based on a synthetic dataset where images are fully described by their annotated attributes. As a result, we are able to perform a fair comparison of multiple explainability methods in the recent literature, obtaining insights about their performance. We make the code and data public to the research community.

1. INTRODUCTION

The popularity of deep learning methods is a testament to their effectiveness across a multitude of tasks in different domains. This effectiveness has led to their widespread industrial adoption (e.g., self-driving cars, screening systems, healthcare, etc.), where the need to explain a model's decision becomes paramount. However, due to the high level of complexity of deep learning models, it is difficult to understand their decision making process (Burkart & Huber, 2021) . This ambiguity has slowed down the adoption of these systems in critical domains. Hence, in order to ensure algorithmic fairness in deep learning and to identify potential biases in training data and models, it is key to explore the reasoning behind their decisions (Buhrmester et al., 2021) . In an attempt to convincingly tackle the why question, there has been a surge of work in the field of explainability for machine learning models (Joshi et al., 2018; Mothilal et al., 2020; Rodríguez et al., 2021) . The goal of this field is to provide explanations for the decisions of a classifier, which often come in the form of counterfactuals. These conterfactuals provide insight as to why the output of the algorithms is not any different and how it could be changed (Goyal et al., 2019) . Basically a counterfactual explanation answers the question: "For situation X why was the outcome Y and not Z", describing what changes in a situation would have produced a different decision. Ideally, counterfactual methods produce explanations that are interpretable by humans while reflecting the factors that influence the decisions of a model (Mothilal et al., 2020) . So given an input sample and a model, a counterfactual explainer would perturb certain attributes of the sample, producing a counterexample i.e., counterfactual, that shifts the model's prediction, thus revealing which semantic attributes the model is sensitive to. In this work, we focus on the image domain given the recent surge of explainability methods for image classifiers (Joshi et al., 2018; Rodríguez et al., 2021; Singla et al., 2019; Chang et al., 2018) . A particular challenge of the image domain is that changes in the pixel space are difficult to interpret and resemble adversarial attacks (Goodfellow et al., 2014b) (see Figure 4 for an example), so current explainers tend to search for counterfactuals in a latent space produced by, e.g., a variational autoencoder (VAE) (Kingma & Welling, 2013) , or by conditioning on annotated attributes (Denton et al., 2019; Joshi et al., 2018; Singla et al., 2019; Rodríguez et al., 2021) . Although explanations produced in the latent space are easier to interpret than in the pixel space, they depend on the chosen or learned decomposition of the input into attributes or latent factors. In the case of VAEs, these factors could be misaligned with the real underlying generating process of the images. Moreover, different methods in the literature rely on different autoencoding architectures or generative models to infer semantic attributes from images, which make their counterfactual search algorithms not comparable. In the case of annotations, since datasets do not provide access to the whole data generating process, they tend to focus on arbitrary aspects of the input (such as facial attributes for CelebA (Liu et al., 2015b) ), ignoring other aspects that could influence a classifier's decision boundaries such as illumination, background color, shadows, etc. This raises the need for evaluating explainers on a known set of attributes that represent the real generative factors of the input. In this work, we propose to fill this gap by introducing a new explainability benchmark based on a synthetic image dataset, where we model the whole data generating process and samples are fully described by a controlled set of attributes. An additional challenge when evaluating explainers is that there is no consensus on the metric that should be used. While there has been some effort to provide a general metric to evaluate explainers (Mothilal et al., 2020; Rodríguez et al., 2021) , most of the proposed metrics could be easily gamed to maximize the score of a given explainer without actually improving its quality for a user. For example, since current metrics reward producing many explanations, the score can be increased by (i) producing random samples that cannot be related to the ones being explained. This has motivated measuring the proximity of explanations (Mothilal et al., 2020) . (ii) Repeating the same explanation many times. This motivates measuring diversity (Mothilal et al., 2020) . However, we found that it is possible to maximize existing diversity measures by always performing the same perturbation to a sensitive attribute while performing random perturbations to the rest of attributes that describe a sample. As a result, although one counterfactual changing the sensitive attribute would suffice, an explainer could obtain a higher score by producing more redundant explanations. We argue that instead of providing many explanations, explainers should be designed to produce the minimal set of counterfactuals that represent each of the factors that influence a model's decision. (iii) Providing uninformative or trivial explanations (Rodríguez et al., 2021) . This has motivated us to compare the model's predictions with those expected from an "oracle classifier". Model predictions that deviate from the expected value are more informative than those that behave as expected (see A.2. In this work, we address these problems by proposing a fair way to evaluate and compare different counterfactual explanation methods. cierOur contributions can be summarized as follows: (i) we present a benchmark to evaluate counterfactuals generated by any explainer in a fair way (Section 3); (ii) we offer insights on why existing explainability methods have strong limitations such as an ill-defined oracle (Section 3.3); (iii) we introduce a new set of metrics to evaluate the quality of counterfactuals (Section 3.4); and (iv) we evaluate 6 explainers across different dataset configurations (Section 4).

2. RELATED WORK

Explainability methods. Since most successful machine learning models are uninterpretable (He et al., 2016; Jégou et al., 2017; LeCun et al., 1989) , modern explainability methods have emerged to provide explanations for these types of models, which are known as post-hoc methods. An important approach to post-hoc explanations is to establish feature importance for a given prediction. These methods (Guidotti et al., 2018; Ribeiro et al., 2016; Shrikumar et al., 2017; Bach et al., 2015) involve locally approximating the machine learning model being explained with a simpler interpretable model. However, the usage of proxy models hinders the truthfulness of the explanations. Another explainability technique is visualizing the factors that influenced a model's decision through heatmaps (Fong et al., 2019; Elliott et al., 2021; Zhou et al., 2022) . Heatmaps are useful to understand which objects present in the image have contributed to a classification. However, heatmaps do not show how areas of the image should be changed and they cannot explain factors that are not spatially localized (e.g., size, color, brightness, etc). Explanation through examples or counterfactual explanations addresses these limitations by synthesizing alternative inputs (counterfactuals) where a small set of attributes is changed resulting in a different classification. These counterfactuals are usually created using generative models. A set of

