A LEARNING THEORETIC PERSPECTIVE ON LOCAL EXPLAINABILITY

Abstract

In this paper, we explore connections between interpretable machine learning and learning theory through the lens of local approximation explanations. First, we tackle the traditional problem of performance generalization and bound the testtime predictive accuracy of a model using a notion of how locally explainable it is. Second, we explore the novel problem of explanation generalization which is an important concern for a growing class of finite sample-based local approximation explanations. Finally, we validate our theoretical results empirically and show that they reflect what can be seen in practice.

1. INTRODUCTION

There has been a growing interest in interpretable machine learning, which seeks to help people better understand their models. While interpretable machine learning encompasses a wide range of problems, it is a fairly uncontroversial hypothesis that there exists a trade-off between a model's complexity and general notions of interpretability. This hypothesis suggests a seemingly natural connection to the field of learning theory, which has thoroughly explored relationships between a function class's complexity and generalization. However, formal connections between interpretability and learning theory remain relatively unstudied. Though there are several notions of conveying interpretability, one common and flexible approach is to use local approximations. Formally, local approximation explanations (which we will refer to as "local explanations") provide insight into a model's behavior as follows: for any black-box model f ∈ F and input x, the explanation system produces a simple function, g x (x ) ∈ G local , which approximates f in a chosen neighborhood, x ∼ N x . Crucially, the freedom to specify both G local and N x grants local explanations great versatility. In this paper, we provide two connections between learning theory and how well f can be approximated locally (i.e. the fidelity of local explanations). Our first result studies the standard problem of performance generalization by relating test-time predictive accuracy to a notion of local explainability. As it turns out, our focus on local explanations leads us to unique tools and insights from a learning theory point of view. Our second result identifies and addresses an unstudied -yet important -question regarding explanation generalization. This question pertains to a growing class of explanation systems, such as MAPLE (Plumb et al., 2018) and RL-LIM (Yoon et al., 2019) , which we call finite sample-based local explanationsfoot_0 . These methods learn their local approximations using a common finite sample drawn from data distribution D (in contrast to canonical local approximation methods such as LIME (Ribeiro et al., 2016) ) and, as a result, run the risk of overfitting to this finite sample. In light of this, we answer the following question: for these explanation-learning systems, how well do the quality of local explanations generalize to data not seen during training?



This terminology is not to be confused with "example-based explanations" where the explanation itself is in the form of data instances rather than a function.

