HUMAN-INTERPRETABLE MODEL EXPLAINABILITY ON HIGH-DIMENSIONAL DATA

Abstract

The importance of explainability in machine learning continues to grow, as both neural-network architectures and the data they model become increasingly complex. Unique challenges arise when a model's input features become high dimensional: on one hand, principled model-agnostic approaches to explainability become too computationally expensive; on the other, more efficient explainability algorithms lack natural interpretations for general users. In this work, we introduce a framework for human-interpretable explainability on high-dimensional data, consisting of two modules. First, we apply a semantically-meaningful latent representation, both to reduce the raw dimensionality of the data, and to ensure its human interpretability. These latent features can be learnt, e.g. explicitly as disentangled representations or implicitly through image-to-image translation, or they can be based on any computable quantities the user chooses. Second, we adapt the Shapley paradigm for model-agnostic explainability to operate on these latent features. This leads to interpretable model explanations that are both theoreticallycontrolled and computationally-tractable. We benchmark our approach on synthetic data and demonstrate its effectiveness on several image-classification tasks.

1. INTRODUCTION

The explainability of AI systems is important, both for model development and model assurance. This importance continues to rise as AI models -and the data on which they are trained -become ever more complex. Moreover, methods for AI explainability must be adapted to maintain the human-interpretability of explanations in the regime of highly complex data. Many explainability methods exist in the literature. Model-specific techniques refer to the internal structure of a model in formulating explanations (Chen & Guestrin, 2016; Shrikumar et al., 2017) , while model-agnostic methods are based solely on input-output relationships and treat the model as a black-box (Breiman, 2001; Ribeiro et al., 2016) . Model-agnostic methods offer wide applicability and, importantly, fix a common language for explanations across different model types. The Shapley framework for model-agnostic explainability stands out, due to its theoretically principled foundation and incorporation of interaction effects between the data's features (Shapley, 1953; Lundberg & Lee, 2017) . The Shapley framework has been used for explainability in machine learning for years (Lipovetsky & Conklin, 2001; Kononenko et al., 2010; Štrumbelj & Kononenko, 2014; Datta et al., 2016) . Unfortunately, the combinatorics required to capture interaction effects make Shapley values computationally intensive and thus ill-suited for high-dimensional data. More computationally-efficient methods have been developed to explain model predictions on highdimensional data. Gradient-and perturbation-based methods measure a model prediction's sensitivity to each of its raw input features (Selvaraju et al., 2020; Zhou et al., 2016; Zintgraf et al., 2017) . Other methods estimate the mutual information between input features and the model's prediction (Chen et al., 2018a; Schulz et al., 2020) , or generate counterfactual feature values that change the model's prediction (Chang et al., 2019; Goyal et al., 2019; Wang & Vasconcelos, 2020) . See Fig. 1 for explanations produced by several of these methods (with details given in Sec. 3.5). When intricately understood by the practitioner, these methods for model explainability can be useful, e.g. for model development. However, many alternative methods exist to achieve broadly the same goal (i.e. to monitor how outputs change as inputs vary) with alternative design choices that 1

