RETHINKING THE ROLE OF GRADIENT-BASED ATTRI-BUTION METHODS FOR MODEL INTERPRETABILITY

Abstract

Current methods for the interpretability of discriminative deep neural networks commonly rely on the model's input-gradients, i.e., the gradients of the output logits w.r.t. the inputs. The common assumption is that these input-gradients contain information regarding p θ (y | x), the model's discriminative capabilities, thus justifying their use for interpretability. However, in this work we show that these input-gradients can be arbitrarily manipulated as a consequence of the shiftinvariance of softmax without changing the discriminative function. This leaves an open question: if input-gradients can be arbitrary, why are they highly structured and explanatory in standard models? We investigate this by re-interpreting the logits of standard softmax-based classifiers as unnormalized log-densities of the data distribution and show that input-gradients can be viewed as gradients of a class-conditional density model p θ (x | y) implicit within the discriminative model. This leads us to hypothesize that the highly structured and explanatory nature of input-gradients may be due to the alignment of this class-conditional model p θ (x | y) with that of the ground truth data distribution p data (x | y). We test this hypothesis by studying the effect of density alignment on gradient explanations. To achieve this density alignment, we use an algorithm called score-matching, and propose novel approximations to this algorithm to enable training large-scale models. Our experiments show that improving the alignment of the implicit density model with the data distribution enhances gradient structure and explanatory power while reducing this alignment has the opposite effect. This also leads us to conjecture that unintended density alignment in standard neural network training may explain the highly structured nature of input-gradients observed in practice. Overall, our finding that input-gradients capture information regarding an implicit generative model implies that we need to re-think their use for interpreting discriminative models.

1. INTRODUCTION

Input-gradients, or gradients of outputs w.r.t. inputs, are commonly used for the interpretation of deep neural networks (Simonyan et al., 2013) . For image classification tasks, an input pixel with a larger input-gradient magnitude is attributed a higher 'importance' value, and the resulting maps are observed to agree with human intuition regarding which input pixels are important for the task at hand (Adebayo et al., 2018) . Quantitative studies (Samek et al., 2016; Shrikumar et al., 2017) also show that these importance estimates are meaningful in predicting model response to larger structured perturbations. These results suggest that input-gradients do indeed capture relevant information regarding the underlying model. However in this work, we show that input-gradients can be arbitrarily manipulated using the shift-invariance of softmax without changing the underlying discriminative model, which calls into question the reliability of input-gradient based attribution methods for interpreting arbitrary black-box models. Given that input-gradients can be arbitrarily structured, the reason for their highly structured and explanatory nature in standard pre-trained models is puzzling. Why are input-gradients relatively well-behaved when they can just as easily be arbitrarily structured, without affecting discriminative model performance? What factors influence input-gradient structure in standard deep neural networks? To answer these, we consider the connections made between softmax-based discriminative classifiers and generative models (Bridle, 1990; Grathwohl et al., 2020) , made by viewing the logits of standard classifiers as un-normalized log-densities. This connection reveals an alternate interpretation of input-gradients, as representing the log-gradients of a class-conditional density model which is implicit within standard softmax-based deep models, which we shall call the implicit density model. This connection compels us to consider the following hypothesis: perhaps input-gradients are highly structured because this implicit density model is aligned with the 'ground truth' class-conditional data distribution? The core of this paper is dedicated to testing the validity of this hypothesis, whether or not input-gradients do become more structured and explanatory if this alignment increases and vice versa. For the purpose of validating this hypothesis, we require mechanisms to increase or decrease the alignment between the implicit density model and the data distribution. To this end, we consider a generative modelling approach called score-matching, which reduces the density modelling problem to that of local geometric regularization. Hence by using score-matching, we are able to view commonly used geometric regularizers in deep learning as density modelling methods. In practice, the score-matching objective is known for being computationally expensive and unstable to train (Song & Ermon, 2019; Kingma & LeCun, 2010) . To this end, we also introduce approximations and regularizers which allow us to use score-matching on practical large-scale discriminative models. This work is broadly connected to the literature around unreliability of saliency methods. While most such works consider how the explanations for nearly identical images can be arbitrarily different (Dombrowski et al., 2019; Subramanya et al., 2019; Zhang et al., 2020; Ghorbani et al., 2019) , our work considers how one may change the model itself to yield arbitrary explanations without affecting discriminative performance. This is similar to Heo et al. ( 2019) who show this experimentally, whereas we provide an analytical reason for why this happens relating to the shift-invariance of softmax. The rest of the paper is organized as follows. We show in § 2 that it is trivial to manipulate input-gradients of standard classifiers using the shift-invariance of softmax without affecting the discriminative model. In § 3 we state our main hypothesis and describe the details of score-matching, present a tractable approximation for the same that eliminates the need for expensive Hessian computations. § 4 revisits other interpretability tools from a density modelling perspective. Finally, § 5 presents experimental evidence for the validity of the hypothesis that improved alignment between the implicit density model and the data distribution can improve the structure and explanatory nature of input-gradients.

2. INPUT-GRADIENTS ARE NOT UNIQUE

In this section, we show that it is trivial to manipulate input-gradients of discriminative deep networks, using the well-known shift-invariance property of softmax. Here we shall make a distinction between two types of input-gradients: logit-gradients and loss-gradients. While logit-gradients are gradients of the pre-softmax output of a given class w.r.t. the input, loss-gradients are the gradients of the loss w.r.t. the input. In both cases, we only consider outputs of a single class, usually the target class. Let x ∈ R D be a data point, which is the input for a neural network model f : R D → R C intended for classification, which produces pre-softmax logits for C classes. The cross-entropy loss function for some class 1 ≤ i ≤ C, i ∈ N corresponding to an input x is given by (f (x), i) ∈ R + , which is shortened to i (x) for convenience. Note that here the loss function subsumes the softmax function as well. The logit-gradients are given by ∇ x f i (x) ∈ R D for class i, while loss-gradients are ∇ x i (x) ∈ R D . Let the softmax function be p(y = i|x) = exp(f i (x))/ C j=1 exp(f j (x)), which we denote as p i for simplicity. Here, we make the observation that upon adding the same scalar function g to all logits, the logit-gradients can arbitrarily change but the loss values do not. Observation. Assume an arbitrary function g : R D → R. Consider another neural network function given by fi (•) = f i (•) + g(•), for 0 ≤ i ≤ C, for which we obtain ∇ x fi (•) = ∇ x f i (•) + ∇ x g(•).

