SOUNDNESS AND COMPLETENESS: AN ALGORITHMIC PERSPECTIVE ON EVALUATION OF FEATURE ATTRIBUTION

Abstract

Feature attribution is a fundamental approach to explaining neural networks by quantifying the importance of input features for a model's prediction. Although a variety of feature attribution methods have been proposed, there is little consensus on the assessment of attribution methods. In this study, we empirically show the limitations of order-based and model-retraining metrics. To overcome the limitations and enable evaluation with higher granularity, we propose a novel method to evaluate the completeness and soundness of feature attribution methods. Our proposed evaluation metrics are mathematically grounded on algorithm theory and require no knowledge of "ground truth" informative features. We validate our proposed metrics by conducting experiments on synthetic and real-world datasets. Lastly, we use the proposed metrics to benchmark a wide range of feature attribution methods. Our evaluation results provide an innovative perspective on comparing feature attribution methods. Code is in the supplementary material.

1. INTRODUCTION

Explaining the prediction of machine learning (XML) models is an important component of trustworthy machine learning in various domains, such as medical diagnosis (Bernhardt et al., 2022; Khakzar et al., 2021b; a) , drug discovery (Callaway, 2022; Jiménez-Luna et al., 2020) , and autonomous driving (Kaya et al., 2022; Can et al., 2022) . One fundamental approach to interpreting neural networks is feature attribution, which indicates how much each feature contributes to a model's prediction. However, different feature attribution methods can produce conflicting results for a given input (Krishna et al., 2022) . In order to evaluate how well a feature attribution explains the prediction, different evaluation metrics have been proposed in the literature. Despite valuable efforts and significant contributions to proposing a new evaluation strategy for feature attribution methods, several problems of concern remain. (1) Some evaluation strategies use duplicate or even conflicting definitions. For instance, (Ancona et al., 2018) define sensitivity-n as the equality between the sum of attribution and the output variation after removing the attributed features, while (Yeh et al., 2019) define the sensitivity as the impact of insignificant perturbations on the attribution result. (Lundstrom et al., 2022) provide another version of sensitivity, where the attribution of a feature should be zero if it does not contribute to the output. (2) The retraining on a modified dataset (Hooker et al., 2019; Zhou et al., 2022) is time-consuming. Furthermore, many retraining-based evaluations imply a strong assumption that only a part of the input is learned during retraining. For instance, ROAR (Hooker et al., 2019) assumes that a model only learns to use features that are not removed during perturbation, and (Zhou et al., 2022) assumes that a retrained model only learns watermarks added into the original images of a semi-natural dataset. Later in this work, we show that a model can pick up any features in the dataset during retraining. (3) Many evaluation metrics are only order-sensitive, meaning that they only evaluate whether one feature is more important than another while ignoring how differently the two features contribute to the output. To overcome the discussed challenges, we propose to evaluate the alignment between attribution and informative features from an algorithmic perspective. The two proposed metrics work in conjunction with each other and reflect different characteristics of feature attribution methods. Our proposed metrics are both order-sensitive and value-sensitive, allowing for stricter differentiation between two attribution methods. Therefore, the information provided by our metrics is more fine-grained compared to existing metrics. In addition, our proposed metrics can perform evaluation without knowing the "ground truth" informative features for model inference, as we utilize the model's performance as an approximate metric to compare different feature attribution methods. To summarize our contributions: • We empirically reveal the limitations of existing evaluation strategies for feature attribution. We show that order-based evaluations can be underperforming. We further demonstrate that evaluations with retraining are not guaranteed to be correct. • We draw inspiration from algorithm theory and propose two novel metrics for faithfully evaluating feature attribution methods. Our approach requires no prior knowledge about "ground truth" informative features. We conduct extensive experiments to validate the correctness of our metrics and empirically show that our approach overcomes the problems associated with existing metrics. • We comprehensively benchmark feature attribution methods with our proposed metrics. Our benchmark reveals some undiscovered properties of existing feature attribution methods. We also examine the effectiveness of ensemble methods to showcase that our metrics can help to create better feature attribution methods.

2. RELATED WORK

Feature Attribution Attribution methods explain a model by identifying informative input features given an associated output. Gradient-based methods (Simonyan et al., 2014; Baehrens et al., 2010; Springenberg et al., 2015; Zhang et al., 2018; Shrikumar et al., 2017) et al., 2022) proposed to inject "ground truth" features into the training dataset and forced the model to learn these features solely. They then tested if an attribution method identified these features. Furthermore, (Khakzar et al., 2022) propose to generate datasets with null features and then test the axioms for feature attribution. Another sub-type in functional-grounded metrics focuses on sanity checks for attribution methods (Adebayo et al., 2018) . The key differences between our approach



Fong et al., 2019)  rely on perturbing input features and measuring the impact on output. Moreover, IBA(Schulz et al., 2020)  andInputIBA (Zhang  et al., 2021)  are derived from information bottlenecks. InputIBA finds a better prior for the information bottleneck at the input, thus producing more fine-grained attribution maps than IBA. Lastly, attribution methods based on activation maps like CAM(Zhou et al., 2016)  and GradCAM(Selvaraju et al., 2017)  use activations or gradients of hidden layers.Evaluation metrics for feature attribution Earlier efforts have attempted to evaluate various feature attribution methods. One category of these methods is expert-grounded metrics that rely on visual inspection(Yang et al., 2022), pointing game(Zhang et al., 2018), or human-AI collaborative tasks(Nguyen et al., 2021). However, the evaluation outcome is subjective and does not guarantee consistency. Another category is functional-grounded metrics(Petsiuk et al., 2018; Samek et al.,  2016; Ancona et al., 2018)  that perturb input according to attribution order and measure the change of model's output. Prior works usually consider two removal orders: MoRF (Most Relevant First) and LeRF (Least Relevant First).(Hooker et al., 2019)  argued that removing features can lead to adversarial effects. They proposed a method ROAR to retrain the model on the perturbed dataset to mitigate the adversarial effect. Later, ROAD(Rong et al., 2022)  showed that the perturbation masks can leak class information, leading to an overestimated evaluation outcome. Recently, (Zhou

