TOWARDS AN OBJECTIVE EVALUATION OF THE TRUSTWORTHINESS OF CLASSIFIERS

Abstract

With the widespread deployment of AI models in applications that impact human lives, research on model trustworthiness has become increasingly important, as a result of which model effectiveness alone (measured, e.g., with accuracy, F1, etc.) should not be the only criteria to evaluate predictive models; additionally the trustworthiness of these models should also be factored in. It has been argued that the features deemed important by a black-box model should be aligned with the human perception of the data, which in turn, should contribute to increasing the trustworthiness of a model. Existing research in XAI evaluates such alignments with user studies -the limitations being that these studies are subjective, difficult to reproduce, and consumes a large amount of time to conduct. We propose an evaluation framework, which provides a quantitative measure for trustworthiness of a black-box model, and hence, we are able to provide a fair comparison between a number of different black-box models. Our framework is applicable to both text and images, and our experiment results show that a model with a higher accuracy does not necessarily exhibit better trustworthiness.

1. INTRODUCTION

Owing to the success and promising results achieved for data-driven (deep) approaches for supervised learning, there has been a growing interest in the AI community to apply such models in domains such as healthcare (Asgarian et al., 2018; Spann et al., 2020; Yasodhara et al., 2020) , criminal justice (Rudin, 2019) and finance (Dixon et al., 2020) . As ML models become embedded into critical aspects of decision making, their successful adoption depends heavily on how well different stakeholders (e.g. user or developer of ML models) can understand and trust their predictions. As a result, there has been a recent surge in making ML models worthy of human trust (Wiens et al., 2019) , and researchers have proposed a variety of methods to explain ML models to stakeholders (Bhatt et al., 2020) , with examples such as DARPA's Explainable AI (XAI) initiative (Gunning et al., 2019) and the 'human-interpretable machine learning' community (Abdul et al., 2018) . Although standard evaluation metrics exist to evaluate the performance of a predictive model, there is no consistent evaluation strategy for XAI. Consequently, a common evaluation strategy is to show individual, potentially cherry-picked, examples that look reasonable (Murdoch et al., 2019) and pass the first test of having 'face-validity' (Doshi-Velez & Kim, 2018). Moreover, evaluating the ability of an explanation to convince a human is different from evaluating its correctness, e.g., while Petsiuk et al. ( 2018) believe that keeping humans out of the loop for evaluation makes it more fair and true to the classifier's own view on the problem rather than representing a human's view, Gilpin et al. (2018) explain that a non-intuitive explanation could indicate either an error in the reasoning of the predictive model, or an error in the explanation producing method. Visual inspection on the plausibility of explanations, such as anecdotal evidence, cannot make the distinction as to whether a non-intuitive explanation is the outcome of an error in the reasoning of the predictive model, or that it is an error that could be attributed to an explanation generating model itself. Zhang et al. (2019) identify such visual inspections as one of the main shortcomings when evaluating XAI and state that checking whether an explanation "looks reasonable" only evaluates the accuracy of the black box model and is not evaluating the faithfulness of the explanation. These commentaries relate to the inherent coupling of evaluating the black box model's predictive accuracy 1

