A LEARNING THEORETIC PERSPECTIVE ON LOCAL EXPLAINABILITY

Abstract

In this paper, we explore connections between interpretable machine learning and learning theory through the lens of local approximation explanations. First, we tackle the traditional problem of performance generalization and bound the testtime predictive accuracy of a model using a notion of how locally explainable it is. Second, we explore the novel problem of explanation generalization which is an important concern for a growing class of finite sample-based local approximation explanations. Finally, we validate our theoretical results empirically and show that they reflect what can be seen in practice.

1. INTRODUCTION

There has been a growing interest in interpretable machine learning, which seeks to help people better understand their models. While interpretable machine learning encompasses a wide range of problems, it is a fairly uncontroversial hypothesis that there exists a trade-off between a model's complexity and general notions of interpretability. This hypothesis suggests a seemingly natural connection to the field of learning theory, which has thoroughly explored relationships between a function class's complexity and generalization. However, formal connections between interpretability and learning theory remain relatively unstudied. Though there are several notions of conveying interpretability, one common and flexible approach is to use local approximations. Formally, local approximation explanations (which we will refer to as "local explanations") provide insight into a model's behavior as follows: for any black-box model f ∈ F and input x, the explanation system produces a simple function, g x (x ) ∈ G local , which approximates f in a chosen neighborhood, x ∼ N x . Crucially, the freedom to specify both G local and N x grants local explanations great versatility. In this paper, we provide two connections between learning theory and how well f can be approximated locally (i.e. the fidelity of local explanations). Our first result studies the standard problem of performance generalization by relating test-time predictive accuracy to a notion of local explainability. As it turns out, our focus on local explanations leads us to unique tools and insights from a learning theory point of view. Our second result identifies and addresses an unstudied -yet important -question regarding explanation generalization. This question pertains to a growing class of explanation systems, such as MAPLE (Plumb et al., 2018) and RL-LIM (Yoon et al., 2019) , which we call finite sample-based local explanationsfoot_0 . These methods learn their local approximations using a common finite sample drawn from data distribution D (in contrast to canonical local approximation methods such as LIME (Ribeiro et al., 2016) ) and, as a result, run the risk of overfitting to this finite sample. In light of this, we answer the following question: for these explanation-learning systems, how well do the quality of local explanations generalize to data not seen during training? We address these questions with two bounds, which we outline now. Regarding performance generalization, we derive our first main result, Theorem 1, which bounds the expected test mean squared error (MSE) of any f in terms of its MSE over the m samples in the training set, S = {(x i , y i )} m i=1 : E (x,y)∼D [(f (x) -y) 2 ] Test MSE ≤ Õ 1 m m i=1 (f (x i ) -y i ) 2 Train MSE + E x∼D, x ∼Nx (g x (x) -f (x)) 2 Interpretability Term (MNF) + ρ S RS (G local ) Complexity Term Regarding explanation generalization for finite sample-based explanation-learning systems, we apply a similar proof technique to obtain Theorem 2, which bounds the quality of the system's explanations on unseen data in terms of their quality on the data on which the system was trained: E x∼D, x ∼Nx (g x (x) -f (x)) 2 Test MNF ≤ 1 m m i=1 E x ∼Nx i (f (x i ) -g x (x i )) 2 Train MNF + Õ ρ S RS (G local ) Complexity Term Before summarizing our contributions, we discuss the key terms and their relationships. • Interpretability terms: The terms involving MNF correspond to Mirrored Neighborhood Fidelity, a metric we use to measure local explanation quality. As we discuss in Section 3, this is a reasonable modification of the commonly used Neighborhood Fidelity (NF) metric (Ribeiro et al., 2016; Plumb et al., 2018) . Intuitively, we generally expect MNF to be larger when the neighborhood sizes are larger since each g x is required to extrapolate farther. • Complexity term: This term measures the complexity of the local explanation system g in terms of (a) the complexity of the local explanation class G local and (b) ρ S , a quantity that we define and refer to as the neighborhood disjointedness factor. As we discuss in Section 4, ρ S is a value in [1, √ m] (where m = |S|) that is proportional to the level of disjointedness of the neighborhoods for points in the sample S. Intuitively, we expect ρ S to be larger when the neighborhoods sizes are smaller since smaller neighborhoods will overlap less. Notably, both our bounds capture the following key trade-off: as neighborhood widths increase, MNF increases but ρ S decreases. As such, our bounds are non-trivial only if the neighborhoods N x can be chosen such that MNF remains small but ρ S grows slower than Õ( √ m) (since RS (G local ) typically decays as Õ(1/ √ m)). We summarize our main contributions as follows: (1) We make a novel connection between performance generalization and local explainability, arriving at Theorem 1. Given the relationship between MNF and ρ S , this bound roughly captures that an easier-to-interpret f enjoys better generalization guarantees, a potentially valuable result when reasoning about F is difficult (e.g. for neural networks). Further, our proof technique may be of independent theoretical interest as it provides a new way to bound the Rademacher complexity of a particular class of randomized functions (see Section 4). (2) We motivate and explore an important generalization question about expected explanation quality. Specifically, we arrive at Theorem 2, a bound for test MNF in terms of train MNF. This bound suggests that practitioners can better guarantee good local explanation quality (measured by MNF) using methods which encourage the neighborhood widths to be wider (see Section 5). (3) We verify empirically on UCI Regression datasets that our results non-trivially reflect the two types of generalization in practice. First, we demonstrate that ρ S can indeed exhibit slower than Õ( √ m) growth without significantly increasing the MNF terms. Also, for Theorem 2, we show that the generalization gap indeed improves with larger neighborhoods (see Section 6). (4) Primarily to aid in our theoretical results, we propose MNF as a novel yet reasonable measure of local explainability. Additionally, we argue that this metric presents a promising avenue for future study, as it may naturally complement NF and offer a unique advantage when evaluating local explanations on "realistic" on-distribution data (see Section 3).



This terminology is not to be confused with "example-based explanations" where the explanation itself is in the form of data instances rather than a function.

