TOWARDS AN OBJECTIVE EVALUATION OF THE TRUSTWORTHINESS OF CLASSIFIERS

Abstract

With the widespread deployment of AI models in applications that impact human lives, research on model trustworthiness has become increasingly important, as a result of which model effectiveness alone (measured, e.g., with accuracy, F1, etc.) should not be the only criteria to evaluate predictive models; additionally the trustworthiness of these models should also be factored in. It has been argued that the features deemed important by a black-box model should be aligned with the human perception of the data, which in turn, should contribute to increasing the trustworthiness of a model. Existing research in XAI evaluates such alignments with user studies -the limitations being that these studies are subjective, difficult to reproduce, and consumes a large amount of time to conduct. We propose an evaluation framework, which provides a quantitative measure for trustworthiness of a black-box model, and hence, we are able to provide a fair comparison between a number of different black-box models. Our framework is applicable to both text and images, and our experiment results show that a model with a higher accuracy does not necessarily exhibit better trustworthiness.

1. INTRODUCTION

Owing to the success and promising results achieved for data-driven (deep) approaches for supervised learning, there has been a growing interest in the AI community to apply such models in domains such as healthcare (Asgarian et al., 2018; Spann et al., 2020; Yasodhara et al., 2020) , criminal justice (Rudin, 2019) and finance (Dixon et al., 2020) . As ML models become embedded into critical aspects of decision making, their successful adoption depends heavily on how well different stakeholders (e.g. user or developer of ML models) can understand and trust their predictions. As a result, there has been a recent surge in making ML models worthy of human trust (Wiens et al., 2019) , and researchers have proposed a variety of methods to explain ML models to stakeholders (Bhatt et al., 2020) , with examples such as DARPA's Explainable AI (XAI) initiative (Gunning et al., 2019) and the 'human-interpretable machine learning' community (Abdul et al., 2018) . Although standard evaluation metrics exist to evaluate the performance of a predictive model, there is no consistent evaluation strategy for XAI. Consequently, a common evaluation strategy is to show individual, potentially cherry-picked, examples that look reasonable (Murdoch et al., 2019) and pass the first test of having 'face-validity' (Doshi-Velez & Kim, 2018) . Moreover, evaluating the ability of an explanation to convince a human is different from evaluating its correctness, e.g., while Petsiuk et al. (2018) believe that keeping humans out of the loop for evaluation makes it more fair and true to the classifier's own view on the problem rather than representing a human's view, Gilpin et al. (2018) explain that a non-intuitive explanation could indicate either an error in the reasoning of the predictive model, or an error in the explanation producing method. Visual inspection on the plausibility of explanations, such as anecdotal evidence, cannot make the distinction as to whether a non-intuitive explanation is the outcome of an error in the reasoning of the predictive model, or that it is an error that could be attributed to an explanation generating model itself. Zhang et al. ( 2019) identify such visual inspections as one of the main shortcomings when evaluating XAI and state that checking whether an explanation "looks reasonable" only evaluates the accuracy of the black box model and is not evaluating the faithfulness of the explanation. These commentaries relate to the inherent coupling of evaluating the black box model's predictive accuracy with explanation quality. As pointed out by Robnik-Sikonja & Bohanec (2018) , the correctness of an explanation and the accuracy of the predictive model may be orthogonal. Although the correctness of the explanation is independent of the correctness of the prediction, visual inspection cannot distinguish between the two. Validating explanations with users can unintentionally combine the evaluation of explanation correctness with evaluating the correctness of the predictive model. Synthetic datasets are useful for evaluating explanations for black box models Oramas et al. (2019) . By designing a dataset in a controlled manner, it should be possible to argue, with a relatively high confidence, that a predictive model should reason in a particular way; a set of 'gold' explanations can thus be created in a controlled manner using a data generation process. Subsequently, the agreement of the generated explanations with these true explanations can be measured. For example, Oramas et al. ( 2019) generate an artificial image dataset of flowers, where the color is the discriminative feature between classes. Our work compares multiple underlying predictive models in terms of trustworthiness, rather than the XAI methods themselves. Evaluating whether the features deemed to be important by a predictive model conform with those by a human is an intrinsically human-centric task that ideally requires human studies. However, performing such studies multiple times during the model development phase is not feasible. To this end, the major contributions of our work are as follows. Our Contributions. First, we generate a synthetic dataset and its associated ground-truth explanations for a multi-objective image classification task. We also manually create the ground-truth explanations for two image classification datasets, namely the MNIST '3 vs. 8' classification and the Plant-Village (Mohanty et al., 2016) disease classification tasks, and for our text experiments we make use of a dataset of legal documents with existing ground-truth explanation units (Malik et al., 2021) (dataset and code will be released). Second, we propose a general framework to quantify the trustworthiness of a black-box model. Our approach is agnostic to both the explanation methodology and the data modality. In our experiments, we compare the performance of predictive models both in terms of effectiveness and trustworthiness on synthetic and real-world datasets using data from two different modalities -image and text.

2. RELATED WORK

Several works have used synthetic datasets for evaluating XAI algorithms. Liu et al. ( 2021) released the XAI-BENCH -a suite of synthetic datasets along with a library for benchmarking feature attribution algorithms. The authors argue that their synthetic datasets offer a wide variety of parameters which can be configured to simulate real-world data and have the potential to identify subtle failures, such as the deterioration of performance on datasets with high feature correlation. They give examples of how real datasets can be converted to similar synthetic datasets, thereby allowing XAI methods to be benchmarked on realistic synthetic datasets. 2019) introduce an8Flower, a dataset specifically designed for objective quantitative evaluation of methods for visual explanation. They generate two synthetic datasets, 'an8Flowersingle-6c' and 'an8Flower-double-12c', with 6 and 12 classes respectively. In the former, a fixed single part of the object is allowed to change color. This color defines the classes of interest. In the latter, a combination of color and the part on which it is located defines the discriminative feature. After defining these features, they generate masks that overlap with the discriminative regions. Then, they threshold the heatmaps at given values and measure the pixel-level intersection over union (IoU) of a model explanation (produced by the method to be evaluated) with respect to these masks. We argue that the importance of each pixel as outputted by the XAI model is different, and that this information is not captured by a simple technique, such as the pixel-level IoU of a model's explanations relative to the ground-truth explanation masks. In our work, we propose two new metrics for evaluating predictive models in terms of their trustworthiness. Some argue in favor of automated metrics where no user involvement is needed, e.g., in the context of usability evaluation in the Human Computer Interaction (HCI) community, Greenberg & Buxton (2008) argue that there is a risk of executing user studies in an early design phase, since this can quash creative ideas or promote poor ideas. Miller et al. (2017) therefore argues that proxy studies are especially valid in early development. Qi et al. (2021) indicate that "evaluating explanations

