CONSISTENT AND TRUTHFUL INTERPRETATION WITH FOURIER ANALYSIS

Abstract

For many interdisciplinary fields, ML interpretations need to be consistent with what-if scenarios related to the current case, i.e., if one factor changes, how does the model react? Although the attribution methods are supported by the elegant axiomatic systems, they mainly focus on individual inputs and are generally inconsistent. In this paper, we show that such inconsistency is not surprising by proving the impossible trinity theorem, stating that interpretability, consistency, and efficiency cannot hold simultaneously. When consistent interpretation is required, we introduce a new notion called truthfulness, as a relaxation of efficiency. Under the standard polynomial basis, we show that learning the Fourier spectrum is the unique way for designing consistent and truthful interpreting algorithms. Experimental results show that for neighborhoods with various radii, our method achieves 2x -50x lower interpretation error compared with the other methods.

1. INTRODUCTION

Interpretability is a central problem in deep learning. During training, the neural network strives to minimize the training loss without other distracting objectives. However, to interpret the network, we have to construct a different modelfoot_0 , which tends to have simpler structures and fewer parameters, e.g., a decision tree or a polynomial. Theoretically, these restricted models cannot perfectly interpret deep networks due to their limited representation power. Therefore, the previous researchers had to introduce various relaxations. The most popular and elegant direction is the attribution methods with axiomatic systems (Sundararajan et al., 2017; Lundberg & Lee, 2017) , which mainly focus on individual inputs. The interpretations of the attribution methods do not automatically extend to the neighboring points. Take SHAP (Lundberg & Lee, 2017) as the motivating example, illustrated in Figure 1 on the task of sentiment analysis of movie reviews. In this example, the interpretations of the two slightly different sentences are not consistent. It is not only because the weights of each word are significantly different but also because, after removing a word "very" of weight 19.9%, the network's output only drops by 97.8% -88.7% = 9.1%. In other words, the interpretation does not explain the network's behavior even in a small neighborhood of the input. Figure 1 : Interpretations generated by SHAP on a movie review. Inconsistency is not a vacuous concern. Imagine a doctor treating a diabetic with the help of an AI system. The patient has features A, B, and C, representing three positive signals from various tests. AI recommends giving 4 units of insulin with the following explanation: A, B, and C have weights 1, 1, and 2, respectively, so 4 units in total. Then the doctor may ask AI: what if the patient only has A and B, but not C? One may expect the answer to be close to 2, as A+B has a weight of 2. However, the network is highly non-linear and may output other suggestions like 3 units, explaining that both A and B have a weight of 1.5. Such inconsistent behaviors will drastically reduce the doctor's confidence in the interpretations, limiting the AI system's practical value. Consistency (see Definition 3) is certainly not the only objective for interpretability. Being equally important, efficiency is a commonly used axiom in the attribution methods (Weber, 1988; Friedman & Moulin, 1999; Sundararajan & Najmi, 2020) , also called local accuracy (Lundberg & Lee, 2017) or completeness (Sundararajan et al., 2017) , stating that the model's output should be equal to the network's output for the given input (see Definition 4). Naturally, one may ask the following question: Q1: Can we always generate an interpreting model that is both efficient and consistent? Unfortunately, this is generally impossible. We have proved the following theorem in Section 3: Theorem 1(Impossible trinity, informal version). Interpretability, consistency, and efficiency cannot hold simultaneously. A few examples following Theorem 1: (a) Attribution methods are interpretable and efficient, but not consistent. (b) The original (deep) network is consistent and efficient, but not interpretable. (c) If one model is interpretable and consistent, it cannot be efficient. However, consistency is necessary for many scenarios, so one may have the follow up question: Q2: For consistent interpreting models, can they be approximately efficient? The answer to this question depends on how "approximately efficient" is defined. We introduce a new notion called truthfulness, which can be seen as a natural relaxation of efficiency, i.e., partial efficiency. Indeed, we split the functional space of f into two subspaces, the readable part and the unreadable part (see Definition 5). We call g is truthful if it can truthfully represent the readable part of f . Notice that the unreadable part is not simply a "fitting error". Instead, it truthfully represents the higher-order non-linearities in the network that our interpretation model g, even doing its best, cannot cover. In short, what g tells is true, although it may not cover all the truth. Due to Theorem 1, this is essentially the best that consistent algorithms can achieve. Truthfulness is a parameterized notion, which depends on the choice of the readable subspace. While theoretically there are infinitely many possible choices of subspaces, in this paper we follow the previous researchers on interpretability with non-linearities (Sundararajan et al., 2020; Masoomi et al., 2022; Tsang et al., 2020) , and uses the basis that automatically induces interpretable terms like x i x j or x i x j x k , which capture higher order correlations and are easy to understand. The resulting subspace has the standard polynomial basis (or equivalently, the Fourier basis). Following the notion of consistency and truthfulness, our last question is: Q3: Can we design consistent and truthful interpreting models with polynomial basis? It turns out that, when truthfulness is parameterized with the polynomial basis, designing consistent and truthful interpreting models is equivalent to learning the Fourier spectrum (see Lemma 1). In other words, this is the unique way to generate truthful and consistent interpretations for high order correlations among input parameters. In this paper, we focus on the case that f and g are Boolean functions, i.e., the input variables are binary. This is a commonly used assumption in the literature (LIME (Ribeiro et al., 2016) , SHAP (Lundberg & Lee, 2017 ), Shapley Taylor (Sundararajan et al., 2020) , and other methods (Sundararajan & Najmi, 2020; Zhang et al., 2021; Frye et al., 2020; Covert et al., 2020; Tsai et al., 2022) ), although sometimes not explicitly stated. For readers not familiar with Boolean functions, we remark that Boolean functions have very strong representation power. For example, empirically most human readable interpretations can be converted into (ensemble of) decision trees, and theoretically all



For simplicity, below we use model to denote the model that provides interpretation and network to denote the general black-box machine learning model that needs interpretation.

