VARIATIONAL SALIENCY MAPS FOR EXPLAINING MODEL'S BEHAVIOR

Abstract

Saliency maps have been widely used to explain the behavior of an image classifier. We introduce a new interpretability method which considers a saliency map as a random variable and aims to calculate the posterior distribution over the saliency map. The likelihood function is designed to measure the distance between the classifier's predictive probability of an image and that of locally perturbed image. For the prior distribution, we make attributions of adjacent pixels have a positive correlation. We use a variational approximation, and show that the approximate posterior is effective in explaining the classifier's behavior. It also has benefits of providing uncertainty over the explanation, giving auxiliary information to experts on how much the explanation is trustworthy.

1. INTRODUCTION

Since the advent of deep learning brought significant improvement in general machine learning tasks (Krizhevsky et al., 2012) , explaining deep networks have become an important issue (Ribeiro et al. (2016) ). Problems inherent in training a deep neural network, such as fairness (Arrieta et al., 2020) or the model classifying based on unintended features (Ribeiro et al., 2016) , can be mitigated when the model is finely explained. Therefore, the models that have gained users' trust through explanation are preferred in practical applications. Saliency maps, also called attribution maps or relevance maps, have been widely used for interpretability methods in classification tasks, typically in an image domain (Simonyan et al., 2013) . A saliency map represents the importance of each feature of given data that influences the model's decision. There have been several approaches for obtaining the saliency map, which are backpropagation based methods (Ancona et al., 2017; Bach et al., 2015; Lundberg & Lee, 2017; Montavon et al., 2017; Selvaraju et al., 2017; Shrikumar et al., 2017; Simonyan et al., 2013; Smilkov et al., 2017; Srinivas & Fleuret, 2019; Sundararajan et al., 2017) and perturbation based methods (Chang et al., 2019; Chen et al., 2018; Dabkowski & Gal, 2017; Fong et al., 2019; Fong & Vedaldi, 2017; Schulz et al., 2020; Zeiler & Fergus, 2014; Zintgraf et al., 2017) . Regardless of the approaches, the common implicit assumption shared by most of the previous interpretability methods is that a saliency map exists in a deterministic manner when a model and an input data are given: one attribution map is provided to explain the model's decision for each data point. Instead of the implicit assumption, we propose a stochastic approach called Variational Saliency maps (VarSal) where it is assumed that the interpretation has inherent randomness. The intuition stems from the stochastic effect that makes interpretation methods more explainable. For instance, FIDO (Chang et al., 2019) expands the search space of the mask by drawing it from Bernoulli distribution. This approach prevents the mask to be searched in the local space when it is directly optimized (Fong & Vedaldi, 2017) . The example informs us that the stochastic property draws better interpretation. We define the posterior distribution as the probability of the saliency map when the training data and the classifier are given. To make the posterior behave as the distribution of explanation, it is essential to carefully design the likelihood function and the prior distribution. We follow the idea of perturbation based methods to form the likelihood where the input that only contains features which correspond to high attribution in a saliency map is likely to describe the classifier's behavior. For modeling the prior, we propose a new covariance matrix of Gaussian distribution that implies the property of having a positive correlation among attributions of adjacent pixels. As this property

