SOUNDNESS AND COMPLETENESS: AN ALGORITHMIC PERSPECTIVE ON EVALUATION OF FEATURE ATTRIBUTION

Abstract

Feature attribution is a fundamental approach to explaining neural networks by quantifying the importance of input features for a model's prediction. Although a variety of feature attribution methods have been proposed, there is little consensus on the assessment of attribution methods. In this study, we empirically show the limitations of order-based and model-retraining metrics. To overcome the limitations and enable evaluation with higher granularity, we propose a novel method to evaluate the completeness and soundness of feature attribution methods. Our proposed evaluation metrics are mathematically grounded on algorithm theory and require no knowledge of "ground truth" informative features. We validate our proposed metrics by conducting experiments on synthetic and real-world datasets. Lastly, we use the proposed metrics to benchmark a wide range of feature attribution methods. Our evaluation results provide an innovative perspective on comparing feature attribution methods. Code is in the supplementary material.

1. INTRODUCTION

Explaining the prediction of machine learning (XML) models is an important component of trustworthy machine learning in various domains, such as medical diagnosis (Bernhardt et al., 2022; Khakzar et al., 2021b; a) , drug discovery (Callaway, 2022; Jiménez-Luna et al., 2020) , and autonomous driving (Kaya et al., 2022; Can et al., 2022) . One fundamental approach to interpreting neural networks is feature attribution, which indicates how much each feature contributes to a model's prediction. However, different feature attribution methods can produce conflicting results for a given input (Krishna et al., 2022) . In order to evaluate how well a feature attribution explains the prediction, different evaluation metrics have been proposed in the literature. Despite valuable efforts and significant contributions to proposing a new evaluation strategy for feature attribution methods, several problems of concern remain. (1) Some evaluation strategies use duplicate or even conflicting definitions. For instance, (Ancona et al., 2018) define sensitivity-n as the equality between the sum of attribution and the output variation after removing the attributed features, while (Yeh et al., 2019) define the sensitivity as the impact of insignificant perturbations on the attribution result. (Lundstrom et al., 2022) provide another version of sensitivity, where the attribution of a feature should be zero if it does not contribute to the output. (2) The retraining on a modified dataset (Hooker et al., 2019; Zhou et al., 2022) is time-consuming. Furthermore, many retraining-based evaluations imply a strong assumption that only a part of the input is learned during retraining. For instance, ROAR (Hooker et al., 2019) assumes that a model only learns to use features that are not removed during perturbation, and (Zhou et al., 2022) assumes that a retrained model only learns watermarks added into the original images of a semi-natural dataset. Later in this work, we show that a model can pick up any features in the dataset during retraining. (3) Many evaluation metrics are only order-sensitive, meaning that they only evaluate whether one feature is more important than another while ignoring how differently the two features contribute to the output. To overcome the discussed challenges, we propose to evaluate the alignment between attribution and informative features from an algorithmic perspective. The two proposed metrics work in conjunction with each other and reflect different characteristics of feature attribution methods. Our proposed metrics are both order-sensitive and value-sensitive, allowing for stricter differentiation between 1

