AXIOMATIC EXPLAINER LOCALITY WITH OPTIMAL TRANSPORT

Abstract

Explainability methods have been notoriously difficult to evaluate and compare. Because of this, practitioners are often left guessing as to which explainer they should use for their task. Locality is one critical property of explainers which grants insight into the diversity of produced explanations. In this paper, we define a set of axioms which align with natural intuition regarding globalness, the inverse of locality. We then introduce a novel measure of globalness, Wasserstein Globalness, which uses optimal transport to quantify how local or global a given explainer is. Finally, we provide theoretical results describing the sample complexity of Wasserstein Globalness, and experimentally demonstrate how globalness can be used to effectively compare explainers. These results illustrate connections between both explainer fidelity and explainer robustness.

1. INTRODUCTION

Machine Learning (ML) models are increasingly complex and capable of impressive performance in several domains. However, as models become more complex, they also become less interpretable. For this reason, researchers have begun to explore the topic of explainability, where model decisions are assigned an explanation. These explanations come in many forms, but often indicate how important each feature is towards the model prediction. Explainers need to be trustworthy for their explanations to be valuable. However, ML practitioners have very little information at their disposal when deciding which explainer is right for them. Unlike traditional ML models, whose accuracy can be computed on held-out testing datasets, there is no obvious metric by which we can compare explainers. Ground-truth explanations are rarely known, meaning we cannot directly compute the accuracy of an explainer. Some authors have tried to argue that their explainer is best by proposing a variety of pseudo-accuracy metrics. When explaining image classifiers, for example, one may follow the lead of Zhang et al. (2018) and Wang et al. (2020) , by evaluating explainers based on how well they concentrate their saliency map's energy around the object of interest. This kind of evaluation-metric is far from perfect, as it penalizes explainers from using scene context and is heavily biased towards concentrated saliency. In addition, these psuedo-accuracy metrics are only applicable for the specific task of feature-attribution for image data, when in reality, there are many other types of explanations and many other types of data. Clearly, it would be of great interest to the explainability community to be able to compare and contrast general-purpose explainers for any ML task. Globalness is a property that can be used to compare and contrast explainers. When one explanation fully explains the model's behavior, we call this a global explanation. In the past, models were typically explained globally. For example, feature selection would be done globally, meaning it would generate a single group of salient features for the entire dataset (Song et al., 2010; Yu & Liu, 2004; John et al., 1994; Dy & Brodley, 2004) . More recently, in order to explain complex black-box models, it has become common to generate instance-wise explanations rather than a single explanation for the entire model. In this case, the explainer outputs local explanations which apply to only a subset of the model inputs. Since this distinction between local explainers and global explainers emerged, researchers have begun to acknowledge locality/globalness as a property of explainers. Globalness is a meaningful property of explainers because it indicates how uniform the explanations are. In some cases, we expect or even desire all explanations to be similar to one another. Globalness is also related to the concept of stability and robustness. The robustness of an explainer is limited by its globalness, and we can even use an explainer's globalness to measure its local-robustness, since an explainer that is robust to small changes in the input will be near-global in a local neighborhood of the input space. This is another way in which a measure of globalness can facilitate a better understanding of our explainers, and allow us to compare/contrast their behavior. In this paper, we study the property of globalness. We provide a novel way to measure it, and present several examples of how to use it in practice. To our knowledge, this is the first work that has provided a formal measure for explainer locality/globalness. We believe that this advances the field by enabling a more thorough analysis of various explanation techniques. With our measure of globalness, one can compare and contrast explainers in a way that was previously difficult and heuristic. The contributions of the paper are as follows. First, we introduce intuitive axiomatic properties which align with human intuition surrounding the notion of locality/globalness. Second, we propose Wasserstein Globalness, a novel measure of globalness, that satisfies all these properties. We also present theoretical results regarding the sample complexity of estimating Wasserstein Globalness. Finally, through our experiments, we demonstrate how explainers can be differentiated by their globalness, and make a connection between globalness and adversarial robustness.

2. RELATED WORK

As mentioned before, the community has struggled to effectively evaluate explainers, since real-world data rarely comes with ground-truth explanations. For this reason, several authors have turned to synthetic data, where the relevance of features are known (Chen et al., 2018; 2017) . Instead of accuracy, others seek to describe explainer properties like transparency, sparsity, or robustness, in order to inform ML practitioners' choice of explainer (Zhang et al., 2021) . Because of the interaction between the data and explainer, we often want to describe properties of an explainer when applied to a specific data sample, like locality. A measure of globalness which accounts for both the data and explainer would provide a quantitative method for comparing and contrasting different explainers. Several authors have discussed the locality/globalness of explainers. For example, Doshi-Velez & Kim (2017) describe two types of explainers: local and global. They claim that global explainers are useful for scientific understanding or bias detection, and local explainers are useful for understanding specific model predictions. Zhang et al. (2021) surveys the interpretability literature and categorizes explainers along several axes. One of these axes is related to the locality of the explainer, where explainers are either "global", "local", or "semi-local". The distinctions provided by Doshi-Velez & Kim (2017) ; Zhang et al. (2021) are categorical rather than continuous. Though the community has begun to consider this property, there is still no formal continuous measure of locality/globalness. One common explainer, LIME, requires the user to specify a kernel width which is roughly related to locality/globalness (Ribeiro et al., 2016) . Anchors are a rule-based explanation given by a constrained optimization problem, where the objective is the "coverage" of the anchor (Ribeiro et al., 2018) . An anchor's coverage indicates how broadly applicable the rule is, thus this is also directly related to locality/globalness, but is tied directly into the objective of the explainer. Some authors, like Ribeiro et al. (2016) , offer ways to construct a single global explanation from many local explanations, thereby granting a higher-level understanding of the model. While this offers explanations at two levels of globalness, we emphasize that this is again only a binary distinction. While works like this acknowledge the property of locality and its importance to explainability, they offer no way to quantify the locality of an explainer. The need for a measure like this is becoming increasingly apparent as more researchers study the theory of explainability. For example, Li et al. (2020) define an object called the neighborhood disjointedness factor in order to study the generalization of finite-sample based local approximation explainers. Neighborhood disjointedness measures roughly how far apart points are from one another, and could be applied to explanations as a measure of globalness. However, this is limited to a small class of explainers, and does not apply to general explanation frameworks. We advance the existing literature by formalizing the property of explainer globalness, and proposing a method for measuring it in practice. This is a new tool that the machine learning community can use to better understand how our explainers interact with our data. We can order explainers by their globalness, identify grouped explanations, and compare explainers.

3. METHOD FORMULATION

There are a number of different ways one could measure the locality/globalness of an explainer. In order to guarantee the quality of our approach, we will first discuss the properties which should be satisfied by any measure of explainer globalness. In Section 3.2, these will be stated as axioms which align with our intuition about globalness. In Section 3.3, we will briefly review why other reasonable approaches fall short according to these properties. Then, in Section 3.4, we will give our definition of globalness and demonstrate that it satisfies all of these axioms.

3.1. NOTATIONS

Consider a black-box prediction model f : X → Y ⊆ R c , where X ⊆ R d is the space of model inputs, Y is the target, d is the dimension of the data, and c is the number of class labels. An explainer for this model is a function E : X → E, where the outputs of the explainer are called explanations, and comprise the space E. The space E is either a subset of R d for feature-attribution explainers, or the vertices of a hypercube, {0, 1} d for feature-selection explainers. We will use P (E) to denote the set of probability distributions over the space of explanations, E. Let µ ∈ P (E) be one such probability distribution, generated by passing the data through our explainer. A metric measure space (mm-space) is a 3-tuple (E, d E , µ), where E is the space of explanations, and d E is a distance metric between explanations.

Explanation Framework

E Distance Measure Feature Attribution R d Cosine Feature Selection {0, 1} d Hamming Table 1 : Recommended metric spaces for common explanation frameworks. For p ∈ P (X) and q ∈ P (Y ), Π(p, q) will denote the set of joint distributions in P (X × Y ) which marginalize to q and p. δ x will denote the Dirac measure centered at x, which takes the value 1 at x and 0 elsewhere. We will denote the push-forward measure of µ by ϕ as ϕ # µ(A) = µ(ϕ -1 (A)). For ease of reference, a summary of the notation used in this paper is in Appendix A Table 2 .

3.2. DESIRED PROPERTIES OF A GLOBALNESS MEASURE

In this section, we introduce the desired properties of a general globalness measure G : P(E) → R, which assigns a real value to a distribution of explanations. Our properties state that measure should be non-negative, continuous, convex, and should encode similarity between explanations. Property 1 (Non-negativity) For all mm-spaces (E, d E , µ), the globalness is non-negative, i.e., G(µ) ≥ 0. Property 2 (Fully-local measure) Let U be the uniform distribution on E. Then µ = U ⇐⇒ G(µ) = 0 (1) Property 2 states that the uniform distribution gives zero globalness, and Property 1 states that this is the minimum-globalness distribution for a given (E, d E ). Property 3 (Fully-global measure) A fully global explainer is one which produces the same explanation for every input, meaning its distribution of explanations can be represented by a Dirac distribution δ x0 . Formally, there exists an x 0 ∈ E for which G p (δ x0 ) ≥ G p (µ) for all µ ∈ P(E). Property 4 (Continuity) Let {µ n } be a sequence of probability distributions which converge in distribution to µ. Then G p (µ n ) → G p (µ). This property claims that the globalness should be continuous with respect to the input distribution. In practice, the underlying distribution µ is often unknown, and we need to approximate it with an empirical distribution µ n , where n is the number of samples. Property 4 guarantees that G can be approximated by sampling from µ and finding the globalness of an empirical distribution µ n . Theorem 2, discussed in Section 4, will characterize the rate of this convergence in more detail. The need to account for distance between explanations. Before we introduce the next property, we will motivate it by discussing the geometry of E. A measure of globalness should take into account a notion of difference between explanations, this can be captured by introducing a distance metric, d E . For an illustrative example as to how globalness is related to the geometry of explanations, refer to Figure 1 . Both distributions yield one of two explanations with equal probability. These two explanations are at a hamming distance of 1 for the left explainer, and a distance of two for the right explainer. Since the explanations are more varied for the right explainer, it is less global. The need for isometry invariance. The previous example demonstrated that globalness should account for the distance between explanations. Additionally, when these explanations change, but the distances between them do not, the globalness should not change. In Figure 2 , we show an example of this. The two distributions in this figure are equivalent up to a re-ordering of the features. Clearly, we would consider both of these to be equally global, because one distribution is simply a reflection of the other. Transformations like this, rotations and reflections, are referred to as isometries. The following property states that a measure of globalness should be invariant to isometries. Property 5 (Isometry-invariance) Let T E,d E be the group of distance-preserving measurable maps ϕ. Then, T E,d = {ϕ : E → E|d E (x, y) = d E (ϕ(x), ϕ(y)), ∀x, y ∈ E} (2) Then G is invariant with respect to the group T E,d E : G(µ) = G(ϕ # µ) for all ϕ ∈ T E,d E Additionally, let S denote all the measurable bijective maps ψ : E → E. Then G is not invariant to S: There exists some ψ ∈ S for which G(µ) ̸ = G(ψ # µ) This axiom states that globalness is invariant to isometries of the explanations E, but not arbitrary permutations. First, Equation 3 stated that G should be invariant to distance-preserving transformations of the explanation-space. Because globalness measures roughly how similar the set of explanations are to one another, any rigid transformation (rotation, reflection, etc.) of the space of explanations will have no effect on the globalness. Recall that we want globalness to measure roughly how similar the explanations are. This notion of similarity is directly encoded by the distance d E , hence an isometry which does not change these distances should not change the globalness. There is one more situation that we want our properties to address: when two sets of explanations are combined, how does the globalness change? The following property answers this question. Property 6 (Convexity) If the explanations from two explainers are combined, the combined distribution will yield a lower globalness than the average globalness of the original explainers. Formally, let R = λP + (1 -λ)Q for some λ ∈ [0, 1] and P, Q ∈ P (E). Then G(R) ≤ λG(P ) + (1 -λ)G(Q). This property states that G is convex with respect to distribution µ, implying that a mixture of two explainers is less global than the average of each explainer's globalness. By mixture, we mean probabilistically generating explanation E 1 (x) with probability p or generating explanation E 2 (x) with probability (1 -p) independently for each x ∈ X . Concavity is a common requirement of diversity measures. For example in genetics, where biological diversity is measured, the diversity of two populations combined is expected to be greater than the diversity among one population (Ricotta, 2003) . We instead require convexity because globalness is inversely related to the diversity of explanations.

3.3. CANDIDATE MEASURES

We have considered several "candidate" approaches for G. We will now discuss some of these, and illustrate why they fall short as a measure of globalness. Shannon Entropy. Since globalness is related to how diverse the explanations are, one might initially consider using the Shannon Entropy of the explainer's outputs as a measure of locality (Shannon, 1948) . H(X) = E -log(p(X)) Entropy is maximized when all the outcomes are equally likely, and minimized when the explainer only generates one explanation for all of the data, clearly satisfiying Properties 1 -3. In fact, entropy satisfies several of our desiderata in Section 3.2, but fails on Property 5. Entropy cannot distinguish between how similar explanations are to one another. F-Divergences. One might also consider measuring globalness with an f-divergence (Csiszár, 1964) between the distribution of explanations µ, and the uniform distribution, U : D f (P ∥Q) = E x∼Q f ( P (x) Q(x) ) Remark (Alternative approaches). Shannon entropy and f-divergences both violate Property 5. F-divergences fail for the same reason that entropy failed. They consider only the probability of each explanation and ignore the degree of similarity between them. The proof of this remark is given in Appendix E.

3.4. THE PROPOSED APPROACH

The framework presented in this paper is very general. Only a metric space (E, d E ) needs to be defined. For common explanation frameworks, recommended choices for d E are listed in Table 1 . Given a metric space (E, d E ), we propose to measure the p-Wasserstein Globalness of µ, denoted G p (µ), as the Wasserstein distance between µ and the uniform measure on E (Kantorovich, 1960) . Intuitively, the Wasserstein distance metric between distributions P and Q measures how much "work" one would have to do to move probability mass from P until it matches Q. For this reason, it is sometimes called the earth-movers distance when p = 1. Definition 1 (Wasserstein Globalness). G p (µ) := d p W (µ, U ) (7) = inf π∈Π(µ,U ) E (x,y)∼π [d p E (x, y)] 1/p (8) For all of our experiments, we will have p = 1, but p is left general in the remainder of the paper, with the exception of Theorem 2, where p affects the convergence of approximate solutions to Equation 7. Theorem 1. The Wasserstein Globalness G p : P(E) → R satisfies Properties 1-7. Where P(E) is the set of probability distributions over E. The proof of Theorem 1 is provided in Appendix C. Because it satisfies these intuitive properties, the Wasserstein Globalness is a satisfactory way to quantify the globalness of an explainer.

4. SAMPLE COMPLEXITY

In practice, we do not work with the true distributions µ and U , but their discrete approximations. The discrete approximation for a distribution p is defined as p N = 1 N N i=1 δ Xi for X i iid ∼ p. Using a discrete approximation for the uniform distribution allows us to estimate the Wasserstein Globalness with a discretized formulation: Definition 2 (Empirical Wasserstein Globalness). Ĝp (µ) := d p W (µ, U N ) In this section we address the question of how accurately Ĝp (µ N ) approximates G p (µ). Fortunately, there are related works that have studied the convergence of empirical distributions under the Wasserstein metric, notably by Fournier & Guillin (2015) ; Dereich et al. (2013) . These prior results provide the foundation for the following theorem. Theorem 2. Let p be the order of the Wasserstein Globalness, and let M q (µ) = R d |x| q dµ(x). If p ∈ (0, d/2), q ̸ = d/(d -p), q ≥ dp d-p , then: E[| Ĝp (µ N ) -G p (µ)|] ≤ κ p,q M q (µ) 1/q N -1/d + C p,q,d M p/q q (µ)(N -p/d + N -(q-p)/q ) (10) From this theorem, we can immediately see that E |G p (µ) -G p (µ N )| → 0 as N → ∞. However, increasing d will lead to a looser bound, meaning we need more samples as N increases. This is a problem if we want to compute Wasserstein Globalness in high-dimensional problem settings like computer vision tasks. To scale to high dimensions, we utilize the Sinkhorn Distance (Cuturi, 2013) , which is described in Appendix B.

5. EXPERIMENTS

Wasserstein Globalness allows us to compare and contrast explainers. In this section, we demonstrate certain insights it can provide about both our data and our explainer. In this section, we will present two kinds of experiment. The first demonstrates how globalness can help identify clustered explanations. The second type of experiment uses globalness to identify how locally-stable an explainer is. The code used in all following experiments will be made publicly available. Datasets. Our first experiment uses a simple synthetic dataset where the ground-truth explanations are known. To create this dataset, we generate d Gaussian clusters in R d . Each cluster has a single (known) relevant feature. We use two other datasets in these experiments, which contain highdimensional image data. The first, MNIST (Deng, 2012) , is comprised of grayscale images depicting handwritten digits 0-9. The last dataset, CIFAR-10 Krizhevsky (2009), is comprised of 32x32 color images. The targets in CIFAR-10 are one of ten classes of common objects. Explainers. Before describing the experiments, we will briefly discuss the explanation methods used in the experiments. The Gradients (Simonyan et al., 2013) method directly uses the gradient of class score with respect to the input image as a saliency map. The Integrated Gradients method accumulates this gradient at several points along a path from some baseline input to the input to be explained (Sundararajan et al., 2017) . The Expected Gradients method extends integrated gradients by allowing a distribution over baselines, and computing the expected value of integrated gradients with respect to this distribution (Erion et al., 2021) . The SmoothGrad method smooths the gradients with a Gaussian kernel to alleviate the high frequency fluctuations commonly seen in partial derivatives (Smilkov et al., 2017) . The Deep-SHAP method (Lundberg & Lee, 2017) approximates shapley values by extending the DeepLIFT algorithm (Shrikumar et al., 2017) . LIME (Ribeiro et al., 2016) forms a local-approximation to the decision boundary by training a linear model on distance-weighted samples from the input X . Guided Backpropagation (Springenberg et al., 2015) modifies the gradient approach by adding a signal that prevents the backwards flow of negative gradients during backpropagation. Identifying Grouped Explanations. In some problems, one may observe that certain groups of inputs receive similar explanations. In this case, it could be beneficial to determine how many such groups exist. Our technique allows us to answer such questions because the number of clusters is inversely related to globalness. In this section, we demonstrate this relationship empirically. Specifically, we observe that the Wasserstein Globalness decreases as we add more classes because the explanations for images in a given class will receive similar explanations. We train a simple MLP classifier on the previously described MNIST dataset, with no regularization and hidden layers of size 300, 500, and 300 respectively. We use cross-entropy loss, and train for 5 epochs. For all image-data in our experiments, including MNIST, we use a feature attribution explainer. Feature attribution explainers score each feature according to its importance. Because of this, explanations fall in R d rather than {0, 1} d . We again follow the canonical distance metric from Table 1 , we will endow our metric space (E, d E ) with the cosine distance: d cos (x, y) = 1 -x•y ∥x∥2∥y∥2 . Figure 3a shows the results of this experiment. As expected, each explainer correctly decreases its Wasserstein Globalness as we increase the number of digits explained. Identifying Local Explainer Stability. If explanations change drastically after a very small change to the input, this explainer may be considered less trustworthy because we expect our explainers to be locally-stable. Of course, the degree of this stability depends on the specific problem in question. We will later discuss how this explainer-stability is related to the classifier's stability. By evaluating Wasserstein Globalness in a small neighborhood around a single data point, we can see how robust the explainer is to such perturbations. The following experiments demonstrate this insight. In our first stability experiment, we use the synthetic data shown in Figure 4 . We attempt to fool the explainer by making the originally-straight decision boundary jagged, thereby obscuring the true relevant feature. To make this boundary jagged, we randomly visit a point x, then flip all nearby points x ′ to the label of x, ensuring that visited points x and x ′ are never revisited. This data modification is illustrated in Figure 4 , and pseudocode is provided in Appendix A, Algorithm 1. In this experiment, we select the feature with the highest attribution, and use the hamming distance in our explanation space: d H (x, y) = i 1(x i ̸ = y i ). The results in Figure 3b demonstrate how Acc exp is related to the explainer globalness, G, in this synthetic-data experiment. We expect that, as explainers become less accurate, the variance of their globalness increases. Accurate explainers, on the other hand, should yield globalness values which are consistently close to the ground-truth globalness. In Figure 3b , we see the relationship between Acc exp and Wasserstein Globalness. As the explainer accuracy decreases, the Wasserstein Globalness varies farther from the ground-truth Globalness. Furthermore, the integrated gradients method is more robust to the jagged boundaries, and this is reflected in it having the smallest standard deviation of Wasserstein Globalness. We conduct our next experiment on the CIFAR-10 dataset (Krizhevsky, 2009) . This experiment demonstrates how globalness can be used as an indicator of explainer robustness in a local neighborhood. First we train a CNN on the dataset as our classification model. We generate a set of inputs by taking one CIFAR-10 image and performing 300 independent adversarial attacks on that image. Each adversarial input is explained, and we evaluate the Wasserstein Globalness for that set of adversarial inputs. This is repeated several times, each time allowing stronger adversarial attacks. Each time we evaluate Wasserstein Globalness we are evaluating it in a larger local neighborhood of X . By doing this, we can examine how globalness differs for different explainers. Though we do not know the ground-truth globalness for these images, we expect the globalness to be maximum in this setting, since feature importances should not be affected by very small perturbations. We are primarily interested in perturbations which are small enough that the classifier still performs well. Intuitively, we hope that, as long as the classification does not change, the reason for the classification should also remain roughly constant. Figures 5 and 6 show that, even with only 300 explanations, in 32 × 32 × 3 dimensions, the Wasserstein Globalness can differentiate the explainers. In Figure 5 , we can see how explanations change after the adversarial attacks. This figure also illustrates that the perturbed images are not too different from the original image. Though the images are perceptually near-identical, the explanations can be seen to change, leading to a change in globalness. In Figure 6 , we observe exactly how this globalness changes for the batch of explanations at each σ. The deep-SHAP explainer is the most robust to local perturbations (Wasserstein Globalness close to maximum for a variety of σ), and that the Gradient and SmoothGrad explainers are the most sensitive. This result is not surprising, as SmoothGrad injects noise in order to perform the smoothing operation, and the Gradient approach is the most naive, forming the basis for many of the other approaches. 

6. CONCLUSION

Explainability is a branch of machine learning that not only facilitates trust between the user and the model, but also allows us to learn from our models. Thus, it is critical that we understand our explainers to use them reliably. In this paper, we have argued that globalness is a theoretical property of explainers which can be valuable to the explainability community. We introduced Wasserstein Globalness, a model-agnostic method for measuring the globalness of explainers which satisfies intuitive axioms regarding the notion of globalness. Finally, we presented experimental results which demonstrate how globalness can be used to differentiate explainers and choose the best one for a given task.

ETHICS STATEMENT

As ML models grow in both complexity and popularity, they have also become less interpretable. These complex models often exhibit surprising behavior. In extreme cases, they can propagate racial and gender bias and even put humans in physical danger. However, even the benign cases erode trust between the user and the model. ML practitioners are responsible for the harmful effects of their models, and are therefore are obliged to build reliable, trustworthy systems. Explainability is one way in which we can begin to build trustworthy models. It is crucial that we seek a better understanding of uninterpretable models in order to deploy them safely in the real world, whether that means building interpretable models in the first place or explaining them after the fact. In this paper, we discuss a theoretical property of explainability methods, and present a technique for measuring this quantity in practice. We intend for our globalness measure to facilitate responsible practices by enabling humans to better understand their models. This understanding can facilitate responsible decisions, enable practitioners to remove harmful behavior from their systems, and increase our trust in the models themselves. 4 .

A APPENDIX

Algorithm 1 Jagged-Boundary 1: Let X be a finite data sample. 2: Let label(•) give the label of each point in X. 3: Let N x be a neighborhood around point x. 4: while not all points visited do 5: for unvisited x do 6: for unvisited x ′ ∈ N x do 7: label(x ′ ) ← label(x) 8: x ′ visited 9: end for 10: end for 11: end while We measure the wall clock timing of the three different computationally expensive tasks. Because these quantities vary with the explainer used and the random input image, we present the average over both the explainers and 10 randomly selected CIFAR-10 images. These results are shown in Figure 3 .

Task

Average Time (s) Adversarial Perturbations 56.17 Generating Explanations 19.42 Computing Globalness 6.97 Table 3 : Average wall clock time for the CIFAR-10 experiment. As expected, computing globalness is less computationally expensive than generating the explanations themselves. Figure 7 shows all of the perturbed images from the CIFAR-10 experiment presented in 5. Figure 5 is formed from a subset of the rows in Figure 7 . Figure 8 depicts another set of explanations, with a different original input image. In figure 11, we show an example of the different perturbed images for a single adversarial radius σ. This empirical estimate also scales poorly with the dimension of the data. However, several techniques have been developed to efficiently compute Wasserstein distances in high dimensions. One such technique, called entropy-regularized optimal transport, relies on regularizing the entropy of the joint distribution π ∈ Π(P, Q). The Wasserstein distance can be written as an optimization over the set of product measures Π(P, Q). In its discrete form, the expectation E (x,y)∼π [d E (x, y)] from Equation 7 can be written as a Frobenius inner product between the matrix of distances D, and the transport plan R. d 1 W (p, q) = min R∈Π(p,q) ⟨R, D⟩ F The sinkhorn distance, introduced to the ML community by Cuturi (2013) , utilizes a technique called entropic regularization. The entropy-regularized transport problem adds an additional regularization term Ω(R) = i,j r i,j log(r i,j ). ⟨R, D⟩ F + λΩ(R) This encourages the transport plan to have less mutual information with the marginals p and q. By searching just over these "smooth" transport plans, we have convexified the optimization problem, and now can use matrix scaling algorithms to solve itfoot_0 . There are also existing theoretical guarantees that d sinkhorn will be close to d W , for which we refer the reader to Cuturi (2013) . By utilizing entropic regularization, we can find good solutions faster. However, we can always decrease the strength of this regularization in order to eliminate noise from the computation. As we have discussed, this is especially useful when we work with high-dimensional data, as is done in 5.

C PROOF OF THEOREM 1

We will now restate and prove that the axioms from Section 3.2 apply to the definition of G in Section 3.4. Property 1 states that G p (µ) ≥ 0 ∀µ. Proof. Since d W is a distance metric, it is non-negative by construction. Therefore G p (P ) = d W (P, U ) is non-negative. Property 4 Let {µ n } be a sequence of probability distributions which converges in distribution to µ. Then G p (µ n ) → G p (µ). Proof. By remark 2.4 in (Mémoli, 2011) , |d p W (A, B) -d p W (A n , B n )| ≤ d p W (A, A n ) + d p W (B, B n ) (13) This implies that |d p W (µ, U ) -d p W (µ n , U n )| ≤ d p W (µ, µ n ) + d p W (U, U n ) However, U does not need to be approximated, therefore  U = U n =⇒ |G p (µ) -G p (µ n )| = |d p W (µ, U ) -d p W (µ n , U )| ≤ d p W (µ, µ n ) p W (µ, µ n ) → 0 =⇒ |G p (µ) -G p (µ n )| → 0 From this proof we can see that, not only does G p (µ n ) converge to G p (µ), but it converges at least as quickly as µ n → µ. Property 3 states that there exists an x 0 ∈ E for which G p (δ x0 ) ≥ G p (µ) ∀µ ∈ P(E). Furthermore, we will also show that, for this x 0 , s x0,p ≤ s x,p ∀x ∈ E, where s x,p is the p-eccentricity, defined in Mémoli (2011) as s x,p = X d p (x, x ′ )dx ′foot_1 . P-eccentricity appears in this property in order to characterize the x 0 for which δ x0 achieves maximum globalness. Proof. . Without loss of generality, assume s x0,p ≥ s x,p , ∀x ∈ E. Any discrete distribution can be represented as a delta-train P = i λ i δ(x i ) such that λ i ≥ 0, ∀i and i λ i = 1. By Property 6, we know that G p (P ) ≤ i λ i G p (δ(x i )). But for a Dirac measure, 7 reduces to G p (δ x0 ) = X d p (x 0 , x)dx 1/p = s x0,p So G p (P ) ≤ i λ i s x0,p ≤ max i s xi,p = s x0,p . This proves the axiom for any discrete distribution. But general distributions can be approximated with these weighted delta-trains, so by Property 4, so in the limit, this proof holds for general distributions. Property 2 states that µ = U ⇐⇒ G(µ) = 0 Proof. First note that d p W (P, Q) is a metric. From the identity of indiscernibles axiom of distance metrics, (x, y) = 0 ⇐⇒ x = y. This directly implies that d W (µ, U ) = 0 ⇐⇒ µ = U . We will now prove the first part of Property 5, restated below: Let T E,d E be the group of isometries of (E, d E ) defined in Equation 2. Then G p is T E,d -invariant. Proof. We need to show that G p (µ) = G p (ϕ # µ), ∀ϕ ∈ T E,d . One consequence of the definition of push-forward measure is that P ϕ # q (ϕx) = P q (ϕ -1 ϕx) = P q (x) =⇒ E x∼ϕ # q [ϕx] = E x∼q [x] Consider the set of measure couplings (feasible transport plans) between µ and v, denoted Π(µ, v) Let h ϕ : Π(µ, v) → Π(ϕ # µ, ϕ # v) be given by h ϕ • π(A, B) = π(ϕ -1 A, ϕ -1 B) Since h ϕ is bijective, and since we have shown in 17 that E x∼π [x] = E x∼h ϕ (π) [ϕ(x)], it holds that G p (µ) = inf π∈Π(µ,U ) E (x,y)∼π [d p (x, y)] 1/p (19) Because ϕ is an isometry, = inf π∈Π(µ,U ) E (x,y)∼h ϕ (π) [d p (ϕx, ϕy)] 1/p (20) Because E x∼π [x] = E x∼h ϕ (π) [ϕ(x)], = inf π∈Π(µ,U ) E (x,y)∼h ϕ (π) [d p (x, y)] 1/p (21) Finally, since h ϕ is a bijection between Π(µ, U ) and Π(ϕ # µ, ϕ # U ), = inf π∈Π(ϕ # µ,ϕ # U ) E (x,y)∼π [d p (x, y)] 1/p = G p (ϕ # µ) We will now prove the second part of Property 5, which states that ∃ϕ : E → E ̸ ∈ T E,d E for which G p (ϕ # µ) ̸ = G p (µ) , for some µ. Proof. Consider a ϕ ∈ S, ϕ : E → E such that d E (x, y) = c * d E (ϕx, ϕy) for some constant c, and some x, y ∈ E. Note that ϕ is clearly not in T E,d E , the group of isometries of (E, d E ). Like before, we can write: G p (µ) = inf π∈Π(µ,U ) E (x,y)∼π [d p (x, y)] 1/p (22) = inf π∈Π(ϕ # µ,ϕ # U ) E (x,y)∼π [d p (x, y)] 1/p (23) But since ϕ is not isometric: = inf π∈Π(ϕ # µ,ϕ # U ) E (x,y)∼π [c p d p (x, y)] 1/p (24) ̸ = G p (ϕ # µ) Property 6 (convexity) Let R be a convex combination of measures P and Q. That is, R = λP + (1 -λ)Q for some λ ∈ [0, 1]. Then G(R) ≤ λG(P ) + (1 -λ)G(Q). Proof. Let P, Q be probability distributions. Let ϕ P be the optimal transport plan between P and U . Similarly, let ϕ Q be the optimal transport plan between Q and U . Then the convex combination of P and Q is a feasible transport plan between R = λP + (1 -λ)Q and U , so  D PROOF OF THEOREM 2 We will first introduce two lemmas to simplify the proof of theorem 2. The first lemma concerns the error incurred by approximating µ with finite samples. The second lemma concerns the error incurred by approximating U with finite samples. By bounding both of these errors, we obtain an upper bound on the error of using both finite-sample approximations. Lemma 1. Let p ∈ (0, d/2). Also, let q ̸ = d/(d -p). Let M q (µ) = R d |x| q dµ(x). Then for some constant C p,q,d which depends only on p,q, and d, E |G p (µ) -G p (µ N )| ≤ C p,q,d M p/q q (µ)(N -p/d + N -(q-p)/q ) (30) Proof. From the triangle inequality, d p W (µ N , U ) ≤ d p W (µ, U ) + d p W (µ N , µ) From (Fournier & Guillin, 2015) , E[d p W (µ N , µ)] ≤ C p,q,d M p/q q (µ)(N -p/d + N -(q-p)/q ). So E[G p (µ N )] -G p (µ) ≤ C p,q,d M p/q q (µ)(N -p/d + N -(q-p)/q ) (32) Similarly, we can invoke the triangle inequality with d p W (µ, U ) on the LHS: d p W (µ, U ) ≤ d p W (µ N , U ) + d p W (µ N , µ) And, making the same argument, we get: So G p (µ) -E[G p (µ N )] ≤ C p,q,d M p/q q (µ)(N -p/d + N -(q-p)/q ) (34) Finally, combining 32 and 34, E |G p (µ) -G p (µ N )| ≤ C p,q,d M p/q q (µ)(N -p/d + N -(q-p)/q ) This theorem is related to Property 4 in the sense that the empirical distribution µ N converges to µ. This theorem is concerned with the rate at which G p (µ N ) converges, whereas Property 4 simply states that it will converge. Perhaps the most important consequence of 1 is that E |G p (µ) -G p (µ N )| → 0 as N → ∞. Lemma 2. Let p ∈ (0, d 2 ), and q > dp d-p . Then for some constant κ p,q which depends only on p and q, E | Ĝp (µ N ) -G p (µ N )| ≤ κ p,q R d ∥x∥ q dµ(x) 1/q N -1/d (35) Proof. From the triangle inequality: d W (µ N , U N ) ≤ d W (µ N , U ) + d W (U N , U ) (36) =⇒ d W (µ N , U N ) -d W (µ N , U ) ≤ d W (U N , U ) From (Dereich et al., 2013) , there exists some constant κ p,q such that E[d p W (U N , U )] ≤ κ p,q R d ∥x∥ q dµ(x) 1/q N -1/d . =⇒ E[d W (µ N , U N ) -d W (µ N , U )] ≤ E[d p W (U N , U )] ≤ κ p,q R d ∥x∥ q dµ(x) 1/q N -1/d =⇒ E[ Ĝp (µ N ) -G p (µ N )] ≤ κ p,q R d ∥x∥ q dµ(x) 1/q N -1/d Similarly: d W (µ N , U ) ≤ d W (µ N , U N ) + d W (U N , U ) =⇒ d W (µ N , U ) -d W (µ N , U N ) ≤ d W (U N , U ) So we have that: E[| Ĝp (µ N ) -G p (µ N )|] ≤ κ p,q R d ∥x∥ q dµ(x) 1/q N -1/d (38) We will now combine the previous lemmas to get a final bound on the approximation error incurred by using discrete approximations for both µ and U . Proof. The following one-line proof is a natural consequence of lemmas 1 and 2. E[| Ĝp (µ N ) -G p (µ)|] ≤ E[| Ĝp (µ N ) -G p (µ N )|] + E |G p (µ) -G p (µ N )| ≤ κ p,q R d ∥x∥ q dµ(x) 1/q N -1/d + C p,q,d M p/q q (µ)(N -p/d + N -(q-p)/q ) = E[| Ĝp (µ N ) -G p (µ)|] ≤ κ p,q M q (µ) 1/q N -1/d + C p,q,d M p/q q (µ)(N -p/d + N -(q-p)/q )



For our experiments, we compute the sinkhorn algorithm using available optimal transport solvers available atFlamary et al. (2021). Roughly, the p-eccentricity indicates how far a point x is to the rest of the points in E.



Figure 1: Two distributions of feature-selection explanations on {0, 1} 2 (red), the vertical axis represents the probability of seeing that explanation. The left distribution is more global because it outputs explanations which are closer to one another in the metric space (E, d E ).

Figure 2: An example of two sets of explanations. Though the distributions are different, they can be obtained from one another by simply swapping the ordering of the features. This corresponds to a reflection, which is an isometric transformation.

Figure 3: (a) Identifying Grouped Explanations. The results of computing Wasserstein Globalness for a variable number of classes, on the MNIST dataset. As we explain an increasing amount of digits, the explanations become more local. (b) The results of the jagged-boundary experiment. The accuracy of both the explainers and the model (left) decrease as the Wasserstein Globalness (right) becomes more varied. These results are aggregated across 15 total runs (new data each time). We show the mean ± one standard deviation of Wasserstein Globalness.

Figure 4: A visualization of the synthetic data in R 2 . Data points are illustrated as dots, with color corresponding to class label. Each of the two clusters utilizes a different feature to discriminate between the two classes. The original data is shown (left), alongside a zoomed-in visualization of the original boundary (middle), and a zoomed-in visualization of our jagged-boundary modification (right).

Figure 5: The perturbed input image is shown in the left column. The remaining columns correspond to different explanation methods, and show an explanation overlayed on the original image.

Figure 6: The Wasserstein Globalness for each explainer in our CIFAR-10 experiment (left). The results of 11 separate runs are plotted in the same color for each explainer. The bold line with markers of that color represents the mean across all 11 runs. We also indicate the classifier accuracy (right) in order to indicate the point at which the perturbations become too strong for the explainer.

Figure 7: An extended version of 5

Figure 8: Another set of explanations for an image of a deer from the CIFAR-10 dataset.

Figure 9: Another set of explanations for an image of a ship from the CIFAR-10 Dataset.

Figure 10: Another set of explanations for an image of a truck from the CIFAR-10 dataset.

Figure 11: A visualization of the pertubed images used in our CIFAR-10 experiment. These correspond to the last row in Figure 7.

p (X) = inf π∈Π(µ,U ) E (x,y)∼π [d p (x, y)] 1/p (26) ≤ E (x,y)∼λϕ P +(1-λ)ϕ Q [d p (x, y)] 1/p (27) = λE (x,y)∼ϕ P [d p (x, y)] 1/p + (1 -λ)E (x,y)∼ϕ Q [d p (x, y)] 1/p (28)= λG p (P ) + (1 -λ)G p (Q)

provides a summary of the notation and terminology used in the paper.

Summary of notation used in the paperAlgorithm 1 describes the jagged-boundary data modification shown in Figure

E PROOF OF REMARK 3.3

Proof. Let ψ ∈ S be a permutation of the set of explanations.By Equation 17, a direct consequence of the definition of push-forward measure, we have that:for some function g. Now observe that both entropy and f-divergences, as defined in Equations 5 and 6 respectively, can be written in the form above. Entropy is obtained by directly letting g(x) = -log(x).The f-divergence between µ and U is equivalent by allowing g(x) = f ( c x ), and allowing the constant c to be the uniform probability of an element x, U (x).

