AN ADDITIVE INSTANCE-WISE APPROACH TO MULTI-CLASS MODEL INTERPRETATION

Abstract

Interpretable machine learning offers insights into what factors drive a certain prediction of a black-box system. A large number of interpreting methods focus on identifying explanatory input features, which generally fall into two main categories: attribution and selection. A popular attribution-based approach is to exploit local neighborhoods for learning instance-specific explainers in an additive manner. The process is thus inefficient and susceptible to poorly-conditioned samples. Meanwhile, many selection-based methods directly optimize local feature distributions in an instance-wise training framework, thereby being capable of leveraging global information from other inputs. However, they can only interpret single-class predictions and many suffer from inconsistency across different settings, due to a strict reliance on a pre-defined number of features selected. This work exploits the strengths of both methods and proposes a framework for learning local explanations simultaneously for multiple target classes. Our model explainer significantly outperforms additive and instance-wise counterparts on faithfulness with more compact and comprehensible explanations. We also demonstrate the capacity to select stable and important features through extensive experiments on various data sets and black-box model architectures.

1. INTRODUCTION

Black-box machine learning systems enjoy a remarkable predictive performance at the cost of interpretability. This trade-off has motivated a number of interpreting approaches for explaining the behavior of these complex models. Such explanations are particularly useful for high-stakes applications such as healthcare (Caruana et al., 2015; Rich, 2016 ), cybersecurity (Nguyen et al., 2021) or criminal investigation (Lipton, 2018) . While model interpretation can be done in various ways (Mothilal et al., 2020; Bodria et al., 2021) , our discussion will focus on feature importance or saliency-based approach -that is, to assign relative importance weights to individual features w.r.t the model's prediction on an input example. Features here refer to input components interpretable to humans; for high-dimensional data such as texts or images, features can be a bag of words/phrases or a group of pixels/super-pixels (Ribeiro et al., 2016) . Explanations are generally made by selecting top K features with the highest weights, signifying K most important features to a black-box's decision. Note that this work tackles feature selection locally for an input data point, instead of generating global explanations for an entire dataset. An abundance of interpreting works follows the removal-based explanation approach (Covert et al., 2021) , which quantifies the importance of features by removing them from the model. Based on how feature influence is summarized into an explanation, methods in this line of works can be broadly categorized as feature attribution and feature selection. In general, attribution methods produce relative importance scores to each feature, whereas selection methods directly identify the subset of features most relevant to the model behavior being explained. One popular approach to learn attribution is through an Additive model (Ribeiro et al., 2016; Zafar & Khan, 2019; Zhao et al., 2021) . The underlying principle is originally proposed by LIME (Ribeiro et al., 2016) which 1

