RE-CALIBRATING FEATURE ATTRIBUTIONS FOR MODEL INTERPRETATION

Abstract

The ability to interpret machine learning models is critical for high-stakes applications. Due to its desirable theoretical properties, path integration is a widely used scheme for feature attribution to interpret model predictions. However, the methods implementing this scheme currently rely on absolute attribution scores to eventually provide sensible interpretations. This not only contradicts the premise that the features with larger attribution scores are more relevant to the model prediction, but also conflicts with the theoretical settings for which the desirable properties of the attributions are proven. We address this by devising a method to first compute an appropriate reference for the path integration scheme. This reference further helps in identifying valid interpolation points on a desired integration path. The reference is computed in a gradient ascending direction on the model's loss surface, while the interpolations are performed by analyzing the model gradients and variations between the reference and the input. The eventual integration is effectively performed along a non-linear path. Our scheme can be incorporated into the existing integral-based attribution methods. We also devise an effective sampling and integration procedure that enables employing our scheme with multireference path integration efficiently. We achieve a marked performance boost for a range of integral-based attribution methods on both local and global evaluation metrics by enhancing them with our scheme. Our extensive results also show improved sensitivity, sanity preservation and model robustness with the proposed re-calibration of the attribution techniques with our method. 1

1. INTRODUCTION

How to interpret deep learning predictions is a major concern for the real-world applications, especially in the high-stakes domains. Feature attribution methods explain a model's prediction by assigning importance scores (attributions) to the input features (Simonyan et al., 2014; Springenberg et al., 2015; Shrikumar et al., 2017) . They assert that features with larger attribution scores are more relevant to the model prediction than those with smaller scores. Among a variety of attribution methods, integral-based techniques (Sundararajan et al., 2017; Sturmfels et al., 2020) are particularly attractive because they satisfy certain desirable axiomatic properties, which others do not. This also makes them suitable for model regularization (Chen et al., 2019; Erion et al., 2021) . Inspired by the cooperative game theory, integral-based attribution methods introduce a reference to represent the absence of the input signal. This allows them to calculate the attribution for the presence of an input feature (Sundararajan et al., 2017; Erion et al., 2021; Pan et al., 2021) . Since the attribution is computed with respect to a reference, finding the correct reference that represents the absence of a feature in a true sense, is critical for the reliability of these methods. In their Integrated Gradient (IG) method, Sundararajan et al. ( 2017) chose a black image (zero input) as the reference. However, this result is always assigning zero attribution to black input pixels. Sturmfels et al. (2020) confirmed that a fixed reference renders an attribution method blind to the features that the reference



Our code is available at https://github.com/ypeiyu/attribution_recalibration 1

