EXPLAINABILITY AS STATISTICAL INFERENCE

Abstract

A wide variety of model explanation approaches have been proposed in recent years, all guided by very different rationales and heuristics. In this paper, we take a new route and cast interpretability as a statistical inference problem. We propose a general deep probabilistic model designed to produce interpretable predictions. The model's parameters can be learned via maximum likelihood, and the method can be adapted to any predictor network architecture, and any type of prediction problem. Our method is a case of amortized interpretability models, where a neural network is used as a selector to allow for fast interpretation at inference time. Several popular interpretability methods are shown to be particular cases of regularized maximum likelihood for our general model. We propose new datasets with ground truth selection which allow for the evaluation of the features importance map. Using these datasets, we show experimentally that using multiple imputation provides more reasonable interpretation.

1. INTRODUCTION

Fueled by the recent advances in deep learning, machine learning models are becoming omnipresent in society. Their widespread use for decision making or predictions in critical fields leads to a growing need for transparency and interpretability of these methods. While Rudin (2019) argues that we should always favor interpretable models for high-stake decisions, in practice, black-box methods are used due to their superior predictive power. Researchers have proposed a variety of model explanation approaches for black-box models, and we refer to Linardatos et al. (2021) for a recent survey. Finding interpretable methods is hard. The multiplicity of evaluation methods (Afchar et al., 2021; Jethani et al., 2021a; Liu et al., 2021; Hooker et al., 2019) makes it difficult to assess the qualities of the different methods. In this paper, we will focus on methods that offer an understanding of which features are important for the prediction of a given instance. These types of methods are called instance-wise feature selection and quantify how much a prediction change when only a subset of features is shown to the model. Ribeiro et al. (2016) 2020) propose to create a saliency map by evaluating the change in performance of the model when exposed to different selections of features. In practice, these methods show very good results but focus on local explanations for a single instance. An evaluation of the selection for a single image requires an exponential number of passes through the black-box model. It is of particular interest to obtain explanations of multiple instances using amortized explanation methods. The idea of such methods is to train a selector network that will generalize the selection given for a single instance. While there is a higher cost of entry due to training an extra network, the interpretation at test time is much faster. Jethani et al. (2021b) proposed to obtain Shapley values with a selector network. Chen et al. (2018 ), Yoon et al. (2018) both proposed to train a selector that selects a minimum subset of features while maximizing an information-theoretical threshold. Jethani et al. (2021a) showed that the selector in such models was not really constrained to select the important features, but it would encode target information to facilitate the prediction, and proposed a method that trains a surrogate predictor to alleviate this issue. In this paper, we propose LEX (Latent Variable as Explanation) a modular self-interpretable probabilistic model class that allows for instance-wise feature selection. LEX is composed of three In supervised learning, a standard approach uses a function f θ (usually a neural network) to parameterize a prediction distribution p θ . In that framework, we would feed the input data directly to the neural network f θ . Within the LEX obtain a distribution of masks p γ parameterized by a neural network g γ from the input data. Samples from this mask distribution are applied to the original image x to produce incomplete samples x z . We implicitly create the missingness by sampling imputed samples x given the masked image using a generative model conditioned on both the mask and the original image. These samples are then fed to a classifier f θ to obtain a prediction. As opposed to previous methods, multiple imputation allows us to minimise the encoding happening in the mask and to get a more faithful selection. different modules: a predictor, a selector, and an imputation scheme. We show that up to different optimization procedures, other existing amortized explanation methods (L2X Chen et al. ( 2018), Invase Yoon et al. ( 2018), and REAL-X Jethani et al. ( 2021a)) optimize an objective that can be framed as the maximization of a LEX model. LEX can be used either "In-Situ," where the selector and predictor are trained jointly, or "Post-Hoc," to explain an already learned predictor. We propose two new datasets to evaluate the performance of instance wise feature selection and experimentally show that using multiple imputation leads to more plausible selection. Notation Random variables are capitalized, their realizations are not. Exponents correspond to the index of realisations and indices correspond to the considered feature. For instance, x i j corresponds to the i th realization of the random variable X j , which is the j th feature of the random variable X. Let i ∈ 0, D , x -i is defined as the vector (x 0 , . . . , x i-1 , x i+1 , . . . , x D ), i.e., the vector with i th dimension removed. Let z ∈ {0, 1} D , then X z is defined as the vector (x j ) {j|zj =1} where we only select the dimensions where z = 1, and x 1-z denotes the vector (x j ) {j|zj =0} where we only select the dimension where z = 0. In particular, X z is ∥z∥-dimensional and x 1-z is (D -∥z∥)dimensional.

2. CASTING INTERPRETABILITY AS STATISTICAL LEARNING

Let X = D d=1 X i be a D-dimensional feature space and Y be the target space. We consider two random variables X = (X 1 , . . . , X D ) and Y ∈ Y following the true data generating distribution p data (x, y). We have access to N i.i.d. realisations of these two random variables, x 1 , . . . , x N ∈ X and labels y 1 , . . . , y n ∈ Y. We want to approximate the conditional distribution of the labels p data (y|x) and discover which subset of features are useful for every local prediction.

2.1. STARTING WITH A STANDARD PREDICTIVE MODEL

To approximate this conditional distribution, a standard approach would be to consider a predictive model Φ(y|f θ (x)), where f θ : R D → H is a neural network and (Φ(•|η)) η∈H is a parametric family of densities over the target space, here parameterized by the output of f θ . Usually, Φ is a categorical distribution for a classification task and a normal distribution for a regression task. The model being



Figure 1: The LEX pipeline allows us to transform any prediction model into an explainable one.In supervised learning, a standard approach uses a function f θ (usually a neural network) to parameterize a prediction distribution p θ . In that framework, we would feed the input data directly to the neural network f θ . Within the LEX obtain a distribution of masks p γ parameterized by a neural network g γ from the input data. Samples from this mask distribution are applied to the original image x to produce incomplete samples x z . We implicitly create the missingness by sampling imputed samples x given the masked image using a generative model conditioned on both the mask and the original image. These samples are then fed to a classifier f θ to obtain a prediction. As opposed to previous methods, multiple imputation allows us to minimise the encoding happening in the mask and to get a more faithful selection.

