CONSISTENT AND TRUTHFUL INTERPRETATION WITH FOURIER ANALYSIS

Abstract

For many interdisciplinary fields, ML interpretations need to be consistent with what-if scenarios related to the current case, i.e., if one factor changes, how does the model react? Although the attribution methods are supported by the elegant axiomatic systems, they mainly focus on individual inputs and are generally inconsistent. In this paper, we show that such inconsistency is not surprising by proving the impossible trinity theorem, stating that interpretability, consistency, and efficiency cannot hold simultaneously. When consistent interpretation is required, we introduce a new notion called truthfulness, as a relaxation of efficiency. Under the standard polynomial basis, we show that learning the Fourier spectrum is the unique way for designing consistent and truthful interpreting algorithms. Experimental results show that for neighborhoods with various radii, our method achieves 2x -50x lower interpretation error compared with the other methods.

1. INTRODUCTION

Interpretability is a central problem in deep learning. During training, the neural network strives to minimize the training loss without other distracting objectives. However, to interpret the network, we have to construct a different modelfoot_0 , which tends to have simpler structures and fewer parameters, e.g., a decision tree or a polynomial. Theoretically, these restricted models cannot perfectly interpret deep networks due to their limited representation power. Therefore, the previous researchers had to introduce various relaxations. The most popular and elegant direction is the attribution methods with axiomatic systems (Sundararajan et al., 2017; Lundberg & Lee, 2017) , which mainly focus on individual inputs. The interpretations of the attribution methods do not automatically extend to the neighboring points. Take SHAP (Lundberg & Lee, 2017) as the motivating example, illustrated in Figure 1 on the task of sentiment analysis of movie reviews. In this example, the interpretations of the two slightly different sentences are not consistent. It is not only because the weights of each word are significantly different but also because, after removing a word "very" of weight 19.9%, the network's output only drops by 97.8% -88.7% = 9.1%. In other words, the interpretation does not explain the network's behavior even in a small neighborhood of the input. Figure 1 : Interpretations generated by SHAP on a movie review. Inconsistency is not a vacuous concern. Imagine a doctor treating a diabetic with the help of an AI system. The patient has features A, B, and C, representing three positive signals from various tests. AI recommends giving 4 units of insulin with the following explanation: A, B, and C have weights 1, 1, and 2, respectively, so 4 units in total. Then the doctor may ask AI: what if the patient only has A and B, but not C? One may expect the answer to be close to 2, as A+B has a weight of 2. However, the network is highly non-linear and may output other suggestions like 3 units, explaining that both A and B have a weight of 1.5. Such inconsistent behaviors will drastically reduce the doctor's confidence in the interpretations, limiting the AI system's practical value. Consistency (see Definition 3) is certainly not the only objective for interpretability. Being equally important, efficiency is a commonly used axiom in the attribution methods (Weber, 1988; Friedman & Moulin, 1999; Sundararajan & Najmi, 2020) , also called local accuracy (Lundberg & Lee, 2017) or completeness (Sundararajan et al., 2017) , stating that the model's output should be equal to the network's output for the given input (see Definition 4). Naturally, one may ask the following question: Q1: Can we always generate an interpreting model that is both efficient and consistent? Unfortunately, this is generally impossible. We have proved the following theorem in Section 3: Theorem 1(Impossible trinity, informal version). Interpretability, consistency, and efficiency cannot hold simultaneously. A few examples following Theorem 1: (a) Attribution methods are interpretable and efficient, but not consistent. (b) The original (deep) network is consistent and efficient, but not interpretable. (c) If one model is interpretable and consistent, it cannot be efficient. However, consistency is necessary for many scenarios, so one may have the follow up question: Q2: For consistent interpreting models, can they be approximately efficient? The answer to this question depends on how "approximately efficient" is defined. We introduce a new notion called truthfulness, which can be seen as a natural relaxation of efficiency, i.e., partial efficiency. Indeed, we split the functional space of f into two subspaces, the readable part and the unreadable part (see Definition 5). We call g is truthful if it can truthfully represent the readable part of f . Notice that the unreadable part is not simply a "fitting error". Instead, it truthfully represents the higher-order non-linearities in the network that our interpretation model g, even doing its best, cannot cover. In short, what g tells is true, although it may not cover all the truth. Due to Theorem 1, this is essentially the best that consistent algorithms can achieve. Truthfulness is a parameterized notion, which depends on the choice of the readable subspace. While theoretically there are infinitely many possible choices of subspaces, in this paper we follow the previous researchers on interpretability with non-linearities (Sundararajan et al., 2020; Masoomi et al., 2022; Tsang et al., 2020) , and uses the basis that automatically induces interpretable terms like x i x j or x i x j x k , which capture higher order correlations and are easy to understand. The resulting subspace has the standard polynomial basis (or equivalently, the Fourier basis). Following the notion of consistency and truthfulness, our last question is: Q3: Can we design consistent and truthful interpreting models with polynomial basis? It turns out that, when truthfulness is parameterized with the polynomial basis, designing consistent and truthful interpreting models is equivalent to learning the Fourier spectrum (see Lemma 1). In other words, this is the unique way to generate truthful and consistent interpretations for high order correlations among input parameters. In this paper, we focus on the case that f and g are Boolean functions, i.e., the input variables are binary. This is a commonly used assumption in the literature (LIME (Ribeiro et al., 2016) , SHAP (Lundberg & Lee, 2017), Shapley Taylor (Sundararajan et al., 2020) , and other methods (Sundararajan & Najmi, 2020; Zhang et al., 2021; Frye et al., 2020; Covert et al., 2020; Tsai et al., 2022) ), although sometimes not explicitly stated. For readers not familiar with Boolean functions, we remark that Boolean functions have very strong representation power. For example, empirically most human readable interpretations can be converted into (ensemble of) decision trees, and theoretically all (ensemble of) decision trees can be converted into Boolean functions. The widely used algorithm XGBoost (Chen & Guestrin, 2016 ) is based on an ensemble of decision trees. Given any Boolean function, we may expand it on the Fourier basis and get its Fourier spectrum. It is well known that the Fourier spectrum's complexity naturally characterizes the complexity of Boolean functions (O'Donnell, 2014), e.g., a small decision tree can be approximated with a low degree sparse polynomial on the Fourier basis (Mansour, 1994) . Therefore, we restrict g functions on its Fourier spectrum. For learning truthful models, we apply two different methods from Boolean functional anslysis, Harmonica (Hazan et al., 2018) and Low-degree (Linial et al., 1993) . Both algorithms have rigorous theoretical guarantees on recovery performance and sampling complexities. Afterwards, in Section 5, we will demonstrate that on datasets like SST-2 and IMDb, Harmonica can get 2x-50x lower interpretation error compared with other methods. In summary, our contributions are: • We have proved the impossible trinity theorem for interpretability, which shows that interpretable algorithms cannot be consistent and efficient at the same time. • For interpretable algorithms that are consistent but not efficient, we have introduced a new notion called truthfulness, which can be seen as partial efficiency. Due to the impossible trinity theorem, this is the best that one can achieve when consistency is required. • For truthfulness with the polynomial basis, we have proved that the problem is equivalent to learning Boolean functions, and empirically demonstrated that the Harmonica algorithm can achieve much better interpretation errors and truthfulness, compared with the other methods.

2. RELATED WORK

Interpretability is a critical topic in machine learning, and we refer the reader to (Doran et al., 2017; Lipton, 2018) for insightful general discussions. Below we discuss different types of interpretable models. Model-specific interpretable models Interpretable/white-box models are inherently ante-hoc and model-specific. One of the goals behind using interpretable models is to have inherent model interpretability. Current mainstream approaches include: Decision Trees (Wang et al. (2015a) , Balestriero (2017) , Yang et al. (2018) ), Decision Rules (Wang et al. (2015b) , Su et al. (2015) ), Decision Sets (Lakkaraju et al. (2019) , Wang et al. (2017) ) and Linear Model (Ustun & Rudin (2014) , Ustun et al. (2014) ). Shapley value based explanations Shapley value (Shapley, 1953) was first introduced in cooperative game theory, with several strong axiomatic theoretical properties (Weber, 1988; Grabisch & Roubens, 1999) . Recently, it has been adopted for explanations of machine learning models (Lundberg & Lee, 2017; Štrumbelj & Kononenko, 2014; Sundararajan & Najmi, 2020; Wang et al., 2021a; Zhang et al., 2021; Frye et al., 2020; Yuan et al., 2021) or feature importance (Covert et al., 2020) . Based on the Shapley value, Owen (1972) proposed the Shapley interaction value to study pairwise interactions between players. Grabisch & Roubens (1999) generalized it to study interactions of higher orders and provided an axiomatic foundation. Starting from that, many researchers worked on higher-order feature interactions from different perspectives (Sundararajan et al., 2020; Masoomi et al., 2022; Tsang et al., 2020; Aas et al., 2021; Tsai et al., 2022) . on the model's predictions. Integrated Gradients (Sundararajan et al., 2017) distributes the change in output with respect to a baseline input by integrating gradients between the two input states, and Janizek et al. (2021) generalized this method to second order. DeepLIFT (Shrikumar et al., 2017) and LRP (Montavon et al., 2017) (Shrikumar et al., 2017) , LRP (Montavon et al., 2017) , etc.

3. OUR FRAMEWORK ON INTERPRETABILITY

We consider a Hilbert space H equipped with inner product ⟨•⟩, and induced ℓ 2 norm ∥ • ∥. We denote the input space by X , the output space by Y, which means H ⊆ X → Y. We focus on Boolean functions in this paper, so X = {-1, 1} n , Y = R. We use -1/1 instead of 0/1 to represent the binary variables because it easily fits into the Fourier basis. Due to the space limit, we defer a brief introduction on Fourier analysis to Appendix A. Fourier analysis of Boolean function is a fascinating field, and we refer the reader to O'Donnell (2014) for a more comprehensive introduction. We use G ⊂ H to denote the set of interpretable functions, and F ⊂ H to denote the set of machine learning models that need interpretation. In this paper, we focus on models that are not self-interpretable, i.e., f ∈ F \ G. Definition 1 (Interpretable). A model g is interpretable, if g ∈ G. Interpretable models are generated by interpretation algorithms. Definition 2 (Interpretation Algorithm). An interpretation algorithm A takes f ∈ H, x ∈ X as inputs, and outputs A(f, x) ∈ G for interpreting f on x. As we mentioned previously, for many interdisciplinary fields, the interpretation algorithm should be consistent. Definition 3 (Consistent). Given f ∈ H, an interpretation algorithm A is consistent with respect to f , if A(f, x) is same for every x ∈ X . Efficiency is an important property of the attribution methods. Definition 4 (Efficient). A model g ∈ H is efficient with respect to f ∈ F on x ∈ X , if g(x) = f (x). The following theorem states that one cannot expect to achieve the best of all three worlds. Theorem 1 (Impossible trinity). For any interpretation algorithm A and function sets G ⊂ F ⊆ H, there exists f ∈ F such that with respect to f , either A is not consistent, or A(f, x) is not efficient on x for some x ∈ X . Proof. Pick f ∈ F \ G. If A is consistent with respect to f , let g = A(f, x) ∈ G for any x ∈ X . If for every x ∈ X , g(x) = f (x), we know g = f ̸ ∈ G, this is a contradiction. Therefore, there exists x ∈ X such that g(x) ̸ = f (x). Theorem 1 says efficiency is too restrictive for consistent interpretations. However, being inefficient does not mean the interpretation is wrong, it can still be truthful. Recall a subspace V ⊂ H is closed if whenever {f n } ⊂ V converges to some f ∈ H, then f ∈ V . We have: Definition 5 (Truthful gap and truthful). Given a closed subspace V ⊆ H, g ∈ G ⊆ V and f ∈ F, the truthful gap of g to f for V is: T V (f, g) = ∥f -g∥ 2 -inf v∈V ∥f -v∥ 2 (1) When T V (f, g) = 0, we say g is truthful for subspace V with respect to f , and we know (see e.g. Lemma 4.1 in Stein & Shakarchi ( 2009)) ∀v ∈ V, ⟨f -g, v⟩ = 0. Truthfulness means g fully captures the information in the subspace V of f , therefore it can be seen as a natural relaxation of efficiency. To characterize the interpretation quality, we introduce the following notion. Definition 6 (Interpretation error). Given functions f, g ∈ X → Y, the interpretation error between f and g with respect to measure µ is I p,µ (f, g) = X |f (x) -g(x)| p dµ(x) 1/p (2) Notice that interpretation error is only a loss function that measures the quality of the interpretation, instead of a metric in ℓ p space. Therefore, µ can be a non-uniform weight distribution following the data distribution. For real-world applications, interpreting the model over the whole X is unnecessary, so µ is usually defined as a uniform distribution on the neighborhood of input x (under a certain metric), in which case we denote the distribution as N x . Other than loss functions defined over the whole function space, sometimes we also need the notion for pointwise interpretation error. Definition 7 (Pointwise interpretation error). Given two functions f, g ∈ H, an input x, the interpretation error of g to f on x is u(x) = |f (x) -g(x)|. Discussion on universal consistency. When talking about the notion of consistency, there are two entirely different settings, which we name as the "global consistency" and the "universal consistency". The global consistency is what we focus in this paper (Definition 3), i.e., the interpretation only relates to the input features but not the others. This scenario belongs to the category of "removal-based explanations", and almost all the existing interpretable methods also belong to this category (26 of them are discussed in Covert et al. (2021) ). On the other hand, the universal consistency means the interpretation may depend on the features that are different from the input features. Think about interpreting the sentence "I am happy" with the interpretation that "this sentence does not include [very] , so this guy is not super happy". Universal consistency is much more challenging than global consistency, and we suspect more powerful machinery like foundation models are needed, which we leave as future work.

4. LEARNING BOOLEAN FUNCTIONS

With the Fourier basis, we define our interpretable function set G as follows. Definition 8 (C-Readable function). Given a set of Fourier bases C, a function f is C-readable if it is supported on C. That is, for any χ S ̸ ∈ C, ⟨f, χ S ⟩ = 0. Denote the corresponding subspace as V C . The Readable notion is parameterized with C, because it may differ case by case. If we set C to be all the single variable bases, only linear functions are readable; if we set C to be all the bases with the degree at most 2, functions with pairwise interactions are also readable. Moreover, if we further add one higher order term to C, e.g., χ {x1,x2,x3,x4} , it means we can also reason about the factor x 1 x 2 x 3 x 4 in the interpretation, which might be an important empirical factor that people can easily understand. Starting from the bases set C, we have the following formula for computing the truthful gap. Lemma 1 (Truthful gap for Boolean functions). Given a set of Fourier bases C, two functions f, g ∈ {-1, 1} n → R, the truthful gap of g to f for C is T V C (f, g) = χ S ∈C ⟨f -g, χ S ⟩ 2 (3) Proof. Denote the complement space as V C . We may expand f, g, v on both bases, and get: ∥f -g∥ 2 -inf v∈V ∥f -v∥ 2 = S∈C ⟨f -g, χ S ⟩ 2 + S∈ C⟨f, χ S ⟩ 2 -inf v∈V C   S∈C ⟨f -v, χ S ⟩ 2 + S∈ C⟨f, χ S ⟩ 2   = S∈C ⟨f -g, χ S ⟩ 2 -inf v∈V C S∈C ⟨f -v, χ S ⟩ 2 = S∈C ⟨f -g, χ S ⟩ 2 Where the last equality holds because we can set the Fourier coefficients vS = fS for every S ∈ C, which further gives ⟨f -v, χ S ⟩ = 0. With the previous definitions, it becomes clear that finding a truthful interpretation g is equivalent to accurately learning a Boolean function with respect to the readable bases set C. Intuitively, it means we want to find algorithms that can compute the coefficients for the bases in C. In other words, we want to find the importance of the bases like x 1 , x 2 x 5 , x 2 x 6 x 7 , etc. Learning Boolean function is a classical problem in learning theory, and there are many algorithms like KM algorithm (Kushilevitz & Mansour, 1991) , Low-degree algorithm (Linial et al., 1993) and Harmonica (Hazan et al., 2018) . We pick two algorithms for our task: Harmonica and Low-degree. Compared with Low-degree, Harmonica has much better sampling efficiency based on compressed sensing techniques. Specifically, for general real-valued decision trees which has s leaf nodes and are bounded by B, the sample complexity of Harmonica is et al., 2018) . However, the Low-degree algorithm works for more general settings, while the theoretical guarantee of Harmonica depends on the assumption that f is approximately sparse in the Fourier space. Therefore, we include both algorithms for completeness. Empirically, Harmonica has a much better performance than Low-degree. Õ B 2 s 2 /ε • log n , while Low-degree is Õ B 4 s 2 /ε 2 • log n (Hazan We defer the Algorithm description and theoretical guarantees to Appendix B and Appendix C, and discussion on the comparison of our algorithms with the existing algorithms to Appendix D. Remarks. The theoretical guarantee of Harmonica assumes the target function f is approximately sparse in the Fourier space, which means most of the energy of the function is concentrated in the bases in C. This is not a strong assumption, because if f is not approximately sparse, it means f has energy in many different bases, more or specifically, the bases with higher orders. In other words, f has a large variance and is difficult to interpret. In this case, no existing algorithms will be able to give consistent and meaningful interpretations. Likewise, although the Low-degree algorithm does not assume sparsity for f , it cannot learn all possible functions accurately as well. There are 2 n different bases, and if we want to learn the coefficients for all of them, the cumulative error for g is at the order of Ω(2 n ϵ), which is exponentially large. This is not surprising due to the no free lunch theorem in the generalization theory, as we do not expect to be able to learn "any functions" without exponentially many samples.

5.1. ANALYSIS ON THE POLYNOMIAL FUNCTIONS

To investigate the performance of different interpretation methods, we manually investigate the output of different algorithms (LIME (Ribeiro et al., 2016) , SHAP (Lundberg & Lee, 2017), Shapley Interaction Index (Owen, 1972) , Shapley Taylor (Sundararajan et al., 2020; Hamilton et al., 2022) , Faith-SHAP (Tsai et al., 2022) , Harmonica and Low-degree) for lower order polynomial functions. We observe that all algorithms can accurately learn the coefficients of the first order polynomial. For the second order polynomial function, only Shapley Taylor, Faith-SHAP, Harmonica and Low-degree can learn all the coefficients accurately. For the third order polynomial function, only Faith-SHAP, Harmonica and Low-degree succeeded. Due to the space limit, we defer the details to Appendix E. Faith-SHAP has a delicate representation theorem, which assigns coefficients to different terms under the Möbius transform. Since the basis induced by the Möbius transform is not orthonormal, it is not clear to us whether Faith-SHAP can theoretically compute the accurate coefficients for higher order functions. However, the running time of Faith-SHAP has exponential dependency on n, so empirically weighted sampling on subsets of features is needed Tsai et al. (2022) . This might be the main reason that our algorithms outperform Faith-SHAP in the experiments with real datasets.

5.2. EXPERIMENTAL SETUP

In the rest of this section, we conduct experiments to evaluate the interpretation error I p,Nx (f, g) and truthful gap T V C (f, g) of Harmonica and other baseline algorithms on NLP and vision tasks quantitatively. In our experiments, we choose 2nd order and 3rd order Harmonica algorithms, which correspond to setting C to be all terms with the order at most 2 and 3. The baseline algorithms chosen for comparison include LIME, Integrated Gradients (Sundararajan et al., 2017) , SHAP, Integrated Hessians (Janizek et al., 2021) , Shapley Taylor interaction index, and Faith-Shap where the first three are first-order algorithms, and the last three are second-order algorithms. The two language tasks we pick are the SST-2 (Socher et al., 2013) dataset for sentiment analysis and the IMDb (Maas et al., 2011) dataset for movie review classification. And the vision task is the ImageNet (Krizhevsky et al., 2012) for image classification.

5.3. SENTIMENT ANALYSIS

We start with a binary sentiment classification task on SST-2 dataset. We try to interpret a convolutional neural network (see details in Appendix F) trained with Adam (Kingma & Ba, 2015) optimizer for 10 epochs. The model has a test accuracy of 80.6%. For a given input sentence x with length l, we define the induced neighborhood N x by introducing a masking operation on this sentence. The radius 0 ≤ r ≤ l is defined as the maximum number of masked words. Results on interpretation error Figure 2 shows the interpretation error evaluated under different neighborhoods with a radius ranging from 1 to ∞. Here ∞ represents the maximum sentence length, which may vary for different data points. The interpretation error is evaluated under L 2 , L 1 , and L 0 norms. Here L 2 and L 1 are defined according to Eqn.(2) with p = 2 and p = 1, respectively. And L 0 denotes X 1{|f (x) -g(x)| ≥ 0.1}dµ(x). The detailed numerical results are presented in Table 4 in Appendix G. We can see that Harmonica consistently outperforms all the other baselines on all radii. (3), we have T V C (f, g) = χ S ∈C ⟨f -g, χ S ⟩ 2 =   E x∼{-1,1} n   (f (x) -g(x)) χ S ∈C χ S (x)     2 . Then we perform a sampling-based estimation of the truthful gap. Worth mentioning that since the size of set C d satisfies |C d | = d i=0 n i and the max length of all sentences n * = 50, χ S ∈C χ S (x), as the summation function of orthonormal basis, is easy to compute on every sample x ∈ {-1, 1} n (for n * very large, we will perform another sampling step on this function). Results on truthful gap Figure 3 shows the truthful gap evaluated on SST-2 dataset. We can see that Harmonica achieves the best performance for C 2 and C 3 . For the simple linear case C 1 , Harmonica is almost as good as LIME. 

5.4. MOVIE REVIEW CLASSIFICATION

The IMDb dataset contains long paragraphs, and each paragraph has many sentences. For the readability of results, we treat sentences as units instead of words -masking several words in a sentence may make the whole paragraph hard to understand and even meaningless, while masking a critical sentence has meaningful semantic effects. Therefore, the radius is defined as the maximum number of masked sentences. By default, we use periods, colons, and exclamations to separate sentences. The target network to be interpreted is a convolutional neural network (see details in Appendix F) trained over this dataset with an accuracy of 85.6%. Results on interpretation error Figure 4 shows the interpretation error evaluated on IMDb dataset under the same settings of SST-2 dataset, i.e., sampling-based estimation, with a slight modification of this procedure that on IMDb dataset, masking operation is performed on sentences in one input paragraph instead (we also change the definition of radii accordingly). We can see that Harmonica consistently outperforms all the other baselines on all radii. The detailed numerical results are presented in Results on truthful gap Figure 5 shows the truthful gap evaluated on IMDb dataset under the same settings of SST-2 dataset. We can see that Harmonica consistently outperforms all the other baselines.

5.5. IMAGE CLASSIFICATION

We use ImageNet (Krizhevsky et al., 2012) Results on interpretation error Figure 6 shows the interpretation error evaluated on 1000 random images from ImageNet while the masking operation is performed on 16 superpixels in one input image instead (we also change the definition of radii accordingly). We can see that when the neighborhood's radius is greater than 1, Harmonica outperforms all the other baselines. The detailed numerical results are presented in Results on truthful gap Figure 7 shows the truthful gap evaluated on ImageNet dataset. We can see that Harmonica outperforms all the other baselines consistently. 

5.6. ADDITIONAL EXPERIMENTS

We further explore the sample complexity of Harmonica and Low-degree algorithms in Appendix H, which show that Harmonica achieves better performance with the same sample size.

A PRELIMINARIES ON FOURIER ANALYSIS

We first introduce the Fourier basis: Definition 9 (Fourier basis). For any subset of variables S ⊆ [n], we define the corresponding Fourier basis as χ S (x) = Π i∈S x i ∈ {-1, 1} n → {-1, 1} The Fourier basis is a complete orthonormal basis for Boolean functions, under the uniform distribution on {-1, 1} n . We remark that this uniform distribution is used for theoretical analysis and algorithm design, and is different from the measure µ for interpretation quality assessment in Definition 6. We define the inner product as follows. Definition 10 (Inner product). Given two functions f, g ∈ {-1, 1} n → R, their inner product is: ⟨f, g⟩ = 2 -n x∈{-1,1} n f (x)g(x) = E x∼{-1,1} n [f (x)g(x)] Then we can compute the Fourier spectrum of any Boolean functions based on the Fourier basis. Definition 11 (Fourier expansion). For any Boolean function f ∈ {-1, 1} n → R, we can expand it as f (x) = S⊆[n] fS χ S (x), where fS = ⟨f, χ S ⟩ is the Fourier coefficient on S. All the Fourier coefficients together are called the Fourier spectrum of f . The inner product defines one kind of similarity between two functions and is invariant under different basis. Specifically, we have the following Theorem. Definition 12 (Plancherel's Theorem). Given two functions f, g ∈ {-1, 1} n → R, ⟨f, g⟩ = S⊆[n] fS ĝS When setting f = g, we get the Parseval's identity: E[f 2 ] = S f 2 S . B HARMONICA ALGORITHM Algorithm 1 Harmonica 1. Given uniformly randomly sampled x 1 , • • • , x T , evaluate them on f : {f (x 1 ), ...., f (x T )}. 2. Solve the following regularized regression problem. argmin α∈R |C|      T i=1   S,χ S ∈C α S χ S (x i ) -f (x i )   2 + λ∥α∥ 1      (4) 3. Output the polynomial g(x) = S,χ S ∈C α S χ S (x). To present the theoretical guarantees of the Harmonica algorithm, we introduce the following definition, which is slightly different from its original version in Hazan et al. (2018) . Definition 13 (Approximately sparse function). We say a function f ∈ {-1, 1} n → R is (ϵ, s, C)- bounded, if E[(f -χ S ∈C f (S)χ S ) 2 ] ≤ ϵ and S | f (S)| ≤ s. Here f is (ϵ, s, C)-bounded means it is almost readable and has bounded ℓ 1 norm. Our algorithm is slightly different from the original algorithm proposed by Hazan et al. (2018) , but similar theoretical guarantees still hold, as stated below. Our proof is similar to the one in the original paper Hazan et al. (2018) , with changes in the readable notion, which is now more flexible than being low order. First recall the classical Chebyshev inequality. Theorem 3 (Multidimensional Chebyshev inequality). Let X be an m dimensional random vector, with expected value µ = E[X], and covariance matrix V = E (X -µ)(X -µ) T . If V is a positive definite matrix, for any real number δ > 0 : P (X -µ) T V -1 (X -µ) > δ ≤ m δ 2 Proof of Theorem 2. Let f be an (ε/4, s, C)-bounded function written in the orthonormal basis as S f (S)χ S . We can equivalently write f as f = h + g, where h is supported on C that only only includes coefficients of magnitude at least ε/4s and the constant term of the polynomial expansion of f . Since L 1 (f ) = S fS ≤ s, we know h is 4s 2 /ε + 1 sparse. The function g is thus the sum of the remaining f (S)χ S terms not included in h. Denote the set of bases that appear in C but not in g as R, so we know the coefficient of f on the bases in R is at most ϵ/4s. Draw m (to be chosen later) random labeled examples z 1 , y 1 , . . . , (z m , y m ) and enumerate all N = |C| basis functions χ S ∈ C as {χ 1 , . . . , χ N }. Form matrix A such that A ij = χ j z i and consider the problem of recovering 4s 2 /ε + 1 sparse x given Ax + e = y where x is the vector of coefficients of h, the i th entry of y equals y i , and e i = g z i . We will prove that with constant probability over the choice m random examples, ∥e∥ 2 ≤ √ εm. Applying Theorem 5 in Hazan et al. (2018) by setting η = √ ε and observing that σ 4s 2 /ε+1 (x) 1 = 0 (see definition in the theorem), we will recover x ′ such that ∥x -x ′ ∥ 2 2 ≤ c 2 2 ε for some constant c 2 . As such, for the function f = N i=1 x ′ i χ i we will have E ∥h -f ∥ 2 ≤ c 2 2 ε by Parseval's identity. Note, however, that we may rescale ε by constant factor 1/ 2c 2 2 to obtain error ε/2 and only incur an additional constant (multiplicative) factor in the sample complexity bound. By the definition of g, we have ∥g∥ 2 =   S,χ S ̸ ∈C f (S) 2 + S∈R f (S) 2   where each f (S) for S ∈ R is of magnitude at most ε/4s. By Fact 4 in Hazan et al. (2018) and Parseval's identity we have R f (R) 2 ≤ ε/4. Since f is (ε/4, s, C)-concentrated we have S,χ S ̸ ∈C f (S) 2 ≤ ε/4. Thus, ∥g∥ 2 is at most ε/2. Therefore, by triangle inequality E ∥f -f ∥ 2 ≤ E ∥h -f ∥ 2 + E ∥g∥ 2 ≤ ε. It remains to bound ∥e∥ 2 . Note that since the examples are chosen independently, the entries e i = g z i are independent random variables. Since g is a linear combination of orthonormal monomials (not including the constant term), we have E z∼D [g(z)] = 0. Here we can apply linearity of variance (the covariance of χ i and χ j is zero for all i ̸ = j ) and calculate the variance Var g z i =   S,χ S ̸ ∈C f (S) 2 + S∈R f (S) 2   With the same calculation as (5), we know Var g z i is at most ε/2. Now consider the covariance matrix V of the vector e which equals E ee ⊤ (recall every entry of e has mean 0). Then V is a diagonal matrix (covariance between two independent samples is zero), and every diagonal entry is at most ε/2. Applying Theorem 3 we have P ∥e∥ 2 > ε 2 δ ≤ m δ 2 . Setting δ = √ 2m, we conclude that P (∥e∥ 2 > √ εm) ≤ 1 2 . Hence with probability at least 1/2, we have that ∥e∥ 2 ≤ √ εm. From Theorem 5 in Hazan et al. (2018) , we may choose m = Õ s 2 /ε • log n d . This completes the proof. Note that the probability 1/2 above can be boosted to any constant probability with a constant factor loss in sample complexity. For the running time complexity, we refer to Allen-Zhu & Yuan (2016) for optimizing linear regression with ℓ 1 regularization. The running time is O((T log 1 ϵ + L/ϵ) • |C|), where L is the smoothness of each summand in the objective. Since each χ S takes value in {-1, 1}, the smoothness is bounded by the number of entries in each summand, which is |C|. Therefore, the running time is bounded by O((T log 1 ϵ + |C|/ϵ) • |C|).

C LOW-DEGREE ALGORITHM

The low-degree algorithm is based on the concentration inequality, and it estimates the coefficient of each axis individually. Algorithm 2 Low-degree 1. Given uniformly randomly sampled x 1 , • • • , x T , evaluate them on f : {f (x 1 ), ...., f (x T )}.

2.. For any χ

S ∈ C, let ĝS = T i=1 f (xi)χ S (xi) T . 3. Output the polynomial g(x) = S,χ S ∈C ĝS χ S (x). Theorem 4 (Linial et al. (1993) ). Given any ϵ, δ > 0, assuming that function f is bounded by B, when T ≥ 2B 2 ϵ 2 log 2|C| δ , we have Pr ∀χ S ∈ C, s.t., |ĝ S -fS | ≤ ϵ ≥ 1 -δ Theorem 4 was proved using the Hoeffding bound, and we included the proof here for completeness. Proof. Since we are given T samples to estimate f (S) for every S, we can directly apply the Hoeffding bound (notice that the function is bounded by B): Pr |α S -f (S)| ≥ ϵ • |C| ≤ 2 exp - 2T ϵ 2 4B 2 = 2 exp - T ϵ 2 2B 2 Notice that T ≥ 2B 2 ϵ 2 log 2|C| δ , we know the right hand size is bounded by δ |C| , so Theorem 4 is proved.

D DISCUSSION ON THE EXISTING ALGORITHMS

In this section, we compare our approach with the existing techniques from the perspectives of interpretation error and truthfulness. (Ribeiro et al., 2016) Given an input x, Lime samples the neighborhood points based on a sampling distribution Π x , and optimizes the following program:

LIME

min g∈G L(f, g, Π x ) + Ω(g) where L is the loss function describing the distance between f and g on the sampled data points, G is the set of readable functions (e.g. the set of linear functions), Ω(•) is a function that characterizes the complexity of g. In other words, LIME tries to minimize the fitting error and simultaneously minimizes the complexity of g (which is usually the sparsity of the linear function). By minimizing L, LIME also works towards minimizing the interpretation error, but their approach is purely heuristic, without any theoretical guarantees. Although their readable function set can easily generalize to the set with higher order terms, the sampling distribution Π x is not uniform, so it is difficult to incorporate the orthonormal basis into their framework. In other words, the model they compute is not truthful. Attribution methods As we discussed in the introduction, attribution methods mainly focus on individual inputs, instead of the neighboring points. Therefore, it is difficult for the attribution methods to achieve low inconsistency, especially for first-order methods like SHAP (Lundberg & Lee, 2017) and IG et al., 2017) . For higher-order attribution methods, consistency can potentially be improved due to their enhanced representation power. The classical Shapley interaction index has the problem of not precisely fitting the underlying function, as observed by Sundararajan et al. (2020) , who proposed Shapley Taylor interaction index (Sundararajan et al., 2020) with better empirical performance. Shapley Taylor interaction index satisfies the generalized efficiency axiom, which says for all f ∈ {-1, 1} n → R, S⊆[n],|S|≤k I k S (f ) = f ([n]) -f (∅) We should remark that both the Shapley interaction index and Shapley Taylor interaction index were not originally designed for consistent interpretations, so they did not specify how to generalize the interpretation for the neighboring points. To this end, we make a global extension to the Shapley value based interpretation, that is, using Shapley interaction indices or Shapley Taylor interaction indices as the coefficients of corresponding terms of the polynomial surrogate function. g (x 1 , x 2 , • • • , x n ) = f (∅) + xi∈S,S⊆[n] I(f, S) However, these higher-order Shapley value based methods all focus on the original Shapley value framework, so their interpretations are not truthful, i.e., not getting the exact coefficient of f even on the "simple bases". Moreover, as we will show in our experiments, higher-order methods still incur high interpretation errors compared with our methods. When applying Shapley value techniques for visual search, Hamilton et al. (2022) proposed an interesting and novel sampling + Lasso regression algorithm for efficiently computing higher order Shapley Taylor index in their experiments. However, their methods are based on a sampling probability distribution generated from permutation numbers, which is far from the uniform distribution. Additionally, their algorithm is based on the Shapley Taylor index, so their method is not truthful as well.

E TEST WITH LOWER ORDER POLYNOMIAL FUNCTIONS E.1 FIRST ORDER POLYNOMIAL FUNCTION

To investigate the performance of different interpretation methods, let us take a closer look at a 1st order polynomial function: f 1 (x 1 , x 2 , x 3 ) := 1 2 x 1 - 1 3 x 2 + 1 4 x 3 For this simple function, we can manually compute the outcome of each algorithm, as illustrated in Table 1 . If the algorithm's output is correct, i.e., equal to the output of f 1 , we write a check mark. Otherwise, we write down the actual output of the given interpretation algorithm. As we can see, all methods are consistent and efficient for all cases. In fact, all variants of Shapley indices degraded to 1st order Shapley values. In addition, let us take a closer look at a 2nd order polynomial function: Algorithms (-1, -1, -1) (-1, -1, +1) (-1, +1, -1) (-1, +1, +1) (+1, -1, -1) (+1, -1, +1) (+1, f 2 (x 1 , x 2 , x 3 ) := 1 2 x 1 - 1 3 x 2 + 1 4 x 3 - 1 5 x 1 x 2 + 1 6 x 1 x 3 - 1 7 x 2 x 3 For this simple function, we can manually compute the outcome of each algorithm, as illustrated in Table 2 . If the algorithm's output is correct, i.e., equal to the output of f 2 , we write a check mark. Otherwise, we write down the actual output of the given interpretation algorithm. As we can see, 2nd order interpretation algorithms, including Shapley Taylor index, Faithful Shapley, Low-degree, and Harmonica are consistent and efficient for all cases. Other methods can only fit a few inputs, the 2nd order Shapley interaction index misses all the cases because it is not efficient, and LIME misses all the cases because f 2 is not a linear function. Finally, we investigate the following 3rd order polynomial, and present the result in Table 3 . Algorithms (-1, -1, -1) (-1, -1, +1) (-1, +1, -1) (-1, +1, +1) (+1, -1, -1) (+1, -1, +1) (+1, +1, -1) (+1, f 3 (x 1 , x 2 , x 3 ) := 1 2 x 1 - 1 3 x 2 + 1 4 x 3 - 1 5 x 1 x 2 + 1 6 x 1 x 3 - 1 7 x 2 x 3 + 1 8 x 1 x 2 x 3 As we can see, Faith-Shap, Low-degree, and Harmonica are consistent and efficient for all cases. Other methods can only fit a few inputs, the 3rd order Shapley interaction index misses all the cases because it is not efficient, and LIME misses all the cases because f 3 is not a linear function.  Algorithms (-1, -1, -1) (-1, -1, +1) (-1, +1, -1) (-1, +1, +1) (+1, -1, -1) (+1, -1, +1) (+1, +1, -1) (+1, +1

F EXPERIMENT SETTINGS

In this section, we will describe the detailed experimental settings. For the two language tasks, i.e., SST-2 and IMDb, we use the same CNN neural network. The word embedding layer is pre-trained by GloVe (Pennington et al., 2014) and the maximum word number is set to 25, 000. Besides the embedding layer, the network consists of several convolutional kernels with different kernel sizes (3, 4, and 5) . After that, we use several fully connected layers, non-linear layers, and pooling layers to process the features. A Sigmoid function is attached to the tail of the network to ensure that the output can be seen as a probability distribution. Our networks are trained with an Adam (Kingma & Ba, 2015) optimizer with a learning rate of 0.01 for 5 epochs. For the vision task, we choose the official ResNet (He et al., 2016) architecture, which is available on PyTorch and we do not discuss the architecture details here. Notice that our algorithm is a post-hoc model-agnostic interpretation algorithm, we only need to use the original neural network f to infer on given input as an oracle. This means that one can easily change the network architecture without any additional changes. All the experiments are run on a server with 4 Nvidia 2080 Ti GPUs. More information about the runtime python environment and implementation details can be found in our code.

G DETAILED NUMERICAL RESULTS

In this section, we provide numerical results for Figure 2 , 4, and 6 in algorithm outperforms the Low-degree algorithm by a large margin. We further increase the sample size for the Low-degree algorithm and see that its interpretation error gradually approaches that of Harmonica. However, even with 5x sample size, the Low-degree algorithm still gives a larger interpretation error compared with Harmonica.



For simplicity, below we use model to denote the model that provides interpretation and network to denote the general black-box machine learning model that needs interpretation.



Figure 2: Visualization of interpretation error I p, Nx (f, g) evaluated on SST-2 dataset.

Figure 3: Visualization of truthful gap T C (f, g) evaluated on SST-2 dataset.

Figure 4: Visualization of interpretation error I p, Nx (f, g) evaluated on IMDb dataset.

Figure 5: Visualization of truthful gap T C (f, g) evaluated on IMDb dataset.

Figure 6: Visualization of interpretation error I p, Nx (f, g) evaluated on ImageNet dataset.

Figure 7: Visualization of truthful gap T C (f, g) evaluated on ImageNet dataset.

Given f ∈ {-1, 1} n → R a (ϵ/4, s, C)-bounded function, Algorithm 1 finds a function g with interpretation error at most ϵ in time O((T log 1 ϵ + |C|/ϵ) • |C|) and sample complexity T = Õ(s 2 /ϵ • log |C|).

Table 5 in Appendix G.

Table 6 in Appendix G.

Interpretations by LIME, SHAP, Shapley Interaction Index, Shapley Taylor Index, Faith-Shap, Low-degree, and Harmonica on the 1st order polynomial function f 1 .

Interpretations by LIME, SHAP, Shapley Interaction Index, Shapley Taylor Index, Faith-Shap, Low-degree, and Harmonica on the 2nd order polynomial function f 2 .

Interpretations by LIME, SHAP, Shapley Interaction Index, Shapley Taylor Index, Faith-Shap, Low-degree, and Harmonica on the 3rd order polynomial function f 3 .

Table4, 5 and 6, respectively. The interpretation error of Harmonica and other baseline algorithms evaluated on the SST-2 dataset for different neighborhoods with a radius ranging from 1 to ∞ under L 2 , L 1 and L 0 norm.H DISCUSSION ON THE LOW-DEGREE ALGORITHMFrom Theorem 2 and Theorem 4, we know that the sample complexity of the Harmonica algorithm ( Õ( 1 ϵ )) is much more efficient than the Low-degree algorithm ( Õ( 1 ϵ 2 )). Figure8shows that when evaluating the interpretation error on SST-2 dataset, with the same sample size, the Harmonica radius L 2 norm L 1 norm L

The interpretation error of Harmonica and other baseline algorithms evaluated on the IMDb dataset for different neighborhoods with a radius ranging from 1 to ∞ under L 2 , L 1 and L 0 norm.

The interpretation error of Harmonica and other baseline algorithms evaluated on the ImageNet dataset for different neighborhoods with a radius ranging from 1 to ∞ under L 2 , L 1 and L 0 norm.

