VARIATIONAL SALIENCY MAPS FOR EXPLAINING MODEL'S BEHAVIOR

Abstract

Saliency maps have been widely used to explain the behavior of an image classifier. We introduce a new interpretability method which considers a saliency map as a random variable and aims to calculate the posterior distribution over the saliency map. The likelihood function is designed to measure the distance between the classifier's predictive probability of an image and that of locally perturbed image. For the prior distribution, we make attributions of adjacent pixels have a positive correlation. We use a variational approximation, and show that the approximate posterior is effective in explaining the classifier's behavior. It also has benefits of providing uncertainty over the explanation, giving auxiliary information to experts on how much the explanation is trustworthy.

1. INTRODUCTION

Since the advent of deep learning brought significant improvement in general machine learning tasks (Krizhevsky et al., 2012) , explaining deep networks have become an important issue (Ribeiro et al. (2016) ). Problems inherent in training a deep neural network, such as fairness (Arrieta et al., 2020) or the model classifying based on unintended features (Ribeiro et al., 2016) , can be mitigated when the model is finely explained. Therefore, the models that have gained users' trust through explanation are preferred in practical applications. Saliency maps, also called attribution maps or relevance maps, have been widely used for interpretability methods in classification tasks, typically in an image domain (Simonyan et al., 2013) . A saliency map represents the importance of each feature of given data that influences the model's decision. There have been several approaches for obtaining the saliency map, which are backpropagation based methods (Ancona et al., 2017; Bach et al., 2015; Lundberg & Lee, 2017; Montavon et al., 2017; Selvaraju et al., 2017; Shrikumar et al., 2017; Simonyan et al., 2013; Smilkov et al., 2017; Srinivas & Fleuret, 2019; Sundararajan et al., 2017) and perturbation based methods (Chang et al., 2019; Chen et al., 2018; Dabkowski & Gal, 2017; Fong et al., 2019; Fong & Vedaldi, 2017; Schulz et al., 2020; Zeiler & Fergus, 2014; Zintgraf et al., 2017) . Regardless of the approaches, the common implicit assumption shared by most of the previous interpretability methods is that a saliency map exists in a deterministic manner when a model and an input data are given: one attribution map is provided to explain the model's decision for each data point. Instead of the implicit assumption, we propose a stochastic approach called Variational Saliency maps (VarSal) where it is assumed that the interpretation has inherent randomness. The intuition stems from the stochastic effect that makes interpretation methods more explainable. For instance, FIDO (Chang et al., 2019) expands the search space of the mask by drawing it from Bernoulli distribution. This approach prevents the mask to be searched in the local space when it is directly optimized (Fong & Vedaldi, 2017) . The example informs us that the stochastic property draws better interpretation. We define the posterior distribution as the probability of the saliency map when the training data and the classifier are given. To make the posterior behave as the distribution of explanation, it is essential to carefully design the likelihood function and the prior distribution. We follow the idea of perturbation based methods to form the likelihood where the input that only contains features which correspond to high attribution in a saliency map is likely to describe the classifier's behavior. For modeling the prior, we propose a new covariance matrix of Gaussian distribution that implies the property of having a positive correlation among attributions of adjacent pixels. As this property mimics total variation (TV) regularization, we name the prior as soft-TV Gaussian prior. After modeling the likelihood and the prior, the Variational Bayesian method (Hoffman et al., 2013; Kingma & Welling, 2013) is used since the posterior is intractable. After the optimization, unlike most of perturbation based methods, VarSal produces a real-time saliency map since only a single forward pass is required for generating it. Also, the VarSal method provides high quality in the visual inspection where sophisticated borderlines exist with objectoriented attention. We compare VarSal with baseline methods on the perturbation benchmark test to show the effectiveness of our approach. At the end, we examine the benefit of employing a posterior distribution, which is uncertainty over the explanation.

2. RELATED WORK

In this section, we take a look at perturbation based interpretability methods. Fong & Vedaldi (2017) optimize the cost function with respect to the mask which indicates the most important features in an image for the classifier's prediction. This approach is further developed by Fong et al. (2019) where they introduce a new method for making a perturbed image which helps to reduce hyper-parameters and produces better qualitative results. Both methods should optimize the mask every time they receive input, which is computationally expensive. Dabkowski & Gal (2017) relax the problem of time complexity by using a trained network of which the output is a saliency mask. However, all three methods have a limitation for producing importance ranking among features of a given image since their objective is to produce a binary mask. PDA (Zintgraf et al., 2017) produces a saliency map from a different perspective. It computes the importance of each pixel by regarding it as an unobserved pixel and marginalizes it out to get the predictive probability output of the classifier. The same idea is used in FIDO (Chang et al., 2019) to generate a perturbed image that is regarded as a sample from training data distribution. It optimizes the parameters of a Bernoulli dropout distribution for making a saliency mask. It helps exploring the search space of binary mask rather than being limited to local search since the mask is sampled from the distribution for each training iteration. Our method is similar to FIDO in that VarSal also explores the search space by sampling the saliency map from the encoder in the training phase. There is an information theoretic approach for explaining the classifier's prediction. Schulz et al. (2020) adopt an information bottleneck for restricting the flow of information in an intermediate layer by adding noise. They find the importance of each feature by calculating the information flow. Chen et al. ( 2018) also adopt mutual information concept and optimize its variational bound for training a network that maps an input image to a saliency map. VarSal is similar in that we also train the encoder network by optimizing the evidence lower bound (ELBO). However, our method differs in that we regard the saliency map as a random variable and aim to calculate the posterior over the saliency map. In this section, we introduce details of the VarSal method which provides stochastic saliency maps. Let us define a pre-trained classifier that we aim to interpret as M : R c×h×w → Y where x ∈ R c×h×w is an input with c, h, and w to be channel, height, and width of the input image, respectively, and Y = {1, 2, . . . , K} is a set of classes. The classifier M provides categorical probability P M (•) = ŷ ∈ K-1 where K-1 is a K -1 simplex. Since the purpose of a saliency map s ∈ R h×w is to describe the behavior of the classifier's prediction, our goal is to calculate the posterior distribution of the saliency map, p(s|x, ŷ) (solid lines in Figure 1 ). By Bayes' rule, the posterior is stated as: p(s|x, ŷ) = p( ŷ|x,s) p(s|x) / Z ,

𝑀 𝜃

where Z is the marginal likelihood. To calculate the posterior, we should model two terms: the likelihood p( ŷ|x,s) and the prior p(s|x).



Figure 1: Graphical model.

