EVALUATION OF SIMILARITY-BASED EXPLANATIONS

Abstract

Explaining the predictions made by complex machine learning models helps users to understand and accept the predicted outputs with confidence. One promising way is to use similarity-based explanation that provides similar instances as evidence to support model predictions. Several relevance metrics are used for this purpose. In this study, we investigated relevance metrics that can provide reasonable explanations to users. Specifically, we adopted three tests to evaluate whether the relevance metrics satisfy the minimal requirements for similarity-based explanation. Our experiments revealed that the cosine similarity of the gradients of the loss performs best, which would be a recommended choice in practice. In addition, we showed that some metrics perform poorly in our tests and analyzed the reasons of their failure. We expect our insights to help practitioners in selecting appropriate relevance metrics and also aid further researches for designing better relevance metrics for explanations.

1. INTRODUCTION

Explaining the predictions made by complex machine learning models helps users understand and accept the predicted outputs with confidence (Ribeiro et al., 2016; Lundberg & Lee, 2017; Guidotti et al., 2018; Adadi & Berrada, 2018; Molnar, 2020) . Instance-based explanations are a popular type of explanation that achieve this goal by presenting one or several training instances that support the predictions of a model. Several types of instance-based explanations have been proposed, such as explaining with instances similar to the instance of interest (i.e., the test instance in question) (Charpiat et al., 2019; Barshan et al., 2020) ; harmful instances that degrade the performance of models (Koh & Liang, 2017; Khanna et al., 2019) ; counter-examples that contrast how a prediction can be changed (Wachter et al., 2018) ; and irregular instances (Kim et al., 2016) . Among these, we focus on the first one, the type of explanation that gives one or several training instances that are similar to the test instance in question and corresponding model predictions. We refer to this type of instance-based explanation as similarity-based explanation. A similarity-based explanation is of the form "I (the model) think this image is cat because similar images I saw in the past were also cat." This type of explanation is analogous to the way humans make decisions by referring to their prior experiences (Klein & Calderwood, 1988; Klein, 1989; Read & Cesa, 1991) . Hence, it tends to be easy to understand even to users with little expertise about machine learning. A report stated that with this type of explanation, users tend to have higher confidence in model predictions compared to explanations that presents contributing features (Cunningham et al., 2003) . In the instance-based explanation paradigm, including similarity-based explanation, a relevance metric R(z, z ) ∈ R is typically used to quantify the relationship between two instances, z = (x, y) and z = (x , y ). Definition 1 (Instance-based Explanation Using Relevance Metric). Let D = {z (i) train = (x (i) train , y (i) train )} N i=1 be a set of training instances and x test be a test input of interest whose predicted output is given by y test = f (x test ) with a predictive model f . An instance-based explanation method gives the most relevant training instance z ∈ D to the test instance z test = (x test , y test ) by z = arg max ztrain∈D R(z test , z train ) using a relevance metric R(z test , z train ). Table 1 : The relevance metrics and their evaluation results. For the model randomization test, the results that passed the test are colored. For the identical class test and identical subclass test, the results with the five highest average evaluation scores are colored. The details of the relevance metrics, the evaluation criteria, and the evaluation procedures can be found in Sections 1.2, 3, and 4, respectively. An immediate critical question is which relevance metric is appropriate for which type of instancebased explanations. There is no doubt that different types of explanations require different metrics. Despite its potential importance, however, little has been explored on this question. Given this background, in this study, we focused on similarity-based explanation and investigated its appropriate relevance metrics through comprehensive experiments. 1Contributions We provide the first answer to the question about which relevance metrics have desirable properties for similarity-based explanation. For this purpose, we propose to use three minimal requirement tests to evaluate various relevance metrics in terms of their appropriateness. The first test is the model randomization test originally proposed by Adebayo et al. (2018) for evaluating saliency-based methods, and the other two tests, the identical class test and identical subclass test, are newly designed in this study. As summarized in Table 1 , our experiments revealed that (i) the cosine similarity of gradients performs best, which is probably a recommended choice for similaritybased explanation in practice, and (ii) some relevance metrics demonstrated poor performances on the identical class and identical subclass tests, indicating that their use should be deprecated for similarity-based explanation. We also analyzed the reasons behind the success and failure of metrics. We expect these insights to help practitioners in selecting appropriate relevance metrics. 

1.2. RELEVANCE METRICS

We present an overview of the two types of relevance metrics considered in this study, namely similarity metrics and gradient-based metrics. To the best of our knowledge, all major relevance • GC: R GC (z, z ) := cos(g z θ , g z θ ) where H and I are the Hessian and Fisher information matrices of the loss L train , respectively.

2. RELATED WORK

Model-specific Explanation Aside of the relevance metrics, there is another approach for similaritybased explanation that uses specific models that can provide explanations by their design (Kim et al., 2014; Plötz & Roth, 2018; Chen et al., 2019) . We set aside these specific models and focus on generic relevance metrics because of their applicability to a wide range of problems.

Evaluation of Metrics for Improving Classification Accuracy

In several machine learning problems, the metrics between instances play an essential role. For example, the distance between instances is essential for distance-based methods such as nearest neighbor methods (Hastie et al., 2009) . Another example is kernel models where the kernel function represents the relationship between two instances (Schölkopf et al., 2002) . Several studies have evaluated the desirable metrics for specific tasks (Hussain et al., 2011; Hu et al., 2016; Li & Li, 2018; Abu Alfeilat et al., 2019) . These studies aimed to find metrics that could improve the classification accuracy. Different from these evaluations based on accuracy, our goal in this study is to evaluate the validity of relevance metrics for similarity-based explanation; thus, the findings in these previous studies are not directly applicable to our goal.

Evaluation of Explanations

There are a variety of desiderata argued as requirements for explanations, such as faithfulness (Adebayo et al., 2018; Lakkaraju et al., 2019; Jacovi & Goldberg, 2020) , plausibility (Lei et al., 2016; Lage et al., 2019; Strout et al., 2019) , robustness (Alvarez-Melis & Jaakkola, 2018) , and readability (Wang & Rudin, 2015; Yang et al., 2017; Angelino et al., 2017) . It is important to evaluate the existing explanation methods considering these requirements. However, there is no standard test established for evaluating these requirements, and designing such tests still remains an open problem (Doshi-Velez & Kim, 2017; Jacovi & Goldberg, 2020) . In this study, as the first empirical study for evaluating the existing relevance metrics for similarity-based explanation, we take an alternative approach by designing minimal requirement tests for two primary requirements, namely faithfulness and plausibility. With this alternative approach, we can avoid the difficulty of directly evaluating these primary requirements.

3. EVALUATION CRITERIA FOR SIMILARITY-BASED EXPLANATION

This study aims to investigate the relevance metrics with desirable properties for similarity-based explanation. In this section, we propose three tests to evaluate whether the relevance metrics satisfy the minimal requirements for similarity-based explanation. If a relevance metric fails one of the tests, we can conclude that the metric does not meet the minimal requirements; thus, its use would be deprecated. The first test (model randomization test) assesses whether each relevance metric satisfies the minimal requirements for the faithfulness of explanation, which requires that an explanation to a model prediction must reflect the underlying inference process (Adebayo et al., 2018; Lakkaraju et al., 2019; Jacovi & Goldberg, 2020) . The latter two tests (identical class and identical subclass tests) are designed to assess relevance metrics in terms of the plausibility of the explanations they produce (Lei et al., 2016; Lage et al., 2019; Strout et al., 2019) , which requires explanations to be sufficiently convincing to users.

3.1. MODEL RANDOMIZATION TEST

Explanations that are irrelevant to a model should be avoided because such fake explanations can mislead users. Thus, any valid relevance metric should be model-dependent, which constitutes the first requirement. We use the model randomization test of Adebayo et al. (2018) to assess whether a given relevance metric satisfies a minimal requirement for faithfulness. If a relevance metric produces almost same explanations for the same inputs on two models with different inference processes, it is likely to ignore the underlying model, i.e., the metric is independent of the model. Thus, we can evaluate whether the metric is model-dependent by comparing explanations from two different models. In the test, a typical choice of the models is a well-trained model that can predict the output well and a randomly initialized model that can make only poor prediction. These two models have different inference processes; hence, their explanations should be different.  (π f (1)) train ) ≥ R(z test , z (π f (2)) train ) ≥ . . . ≥ R(z test , z (π f (N )) train ). We also define π f rand accordingly. Then, we require π f and π f rand to ensure a small rank correlation. If relevance metric R is independent of the model, it produces the same permutation for both f and f rand , and their rank correlation becomes one. If the rank correlation is significantly smaller than one and close to zero, we can confirm that the relevance metric is model-dependent.

3.2. IDENTICAL CLASS TEST

The second minimal requirement is that the raised similar instance should belong to the same class as the test instance, as shown in Figure 1 . The violation of this requirement leads to nonsensical explanations such as "I think this image is cat because a similar image I saw in the past was dog." in Figure 1 . When users encounter such explanations, they might question the validity of model predictions and ignore the predictions even if the underlying model is valid. This observation leads to the identical class test below. Definition 3 (Identical Class Test). We require that the most similar (relevant) instance of a test instance z test = (x test , y test ) is a training instance of the same class as the given test instance. arg max z=(x,y)∈D R z test , z = ( x, ȳ) =⇒ ȳ = y test . (1) Although this test may look trivial, some relevance metrics do not satisfy this minimal requirement, as demonstrated in Section 4.2. is cat because a similar is cat. is cat because a similar is dog. 

3.3. IDENTICAL SUBCLASS TEST

The third minimal requirement is that the raised similar instance should belong to the same subclass as that of the test instance when the the classes consist of latent subclasses, as shown in Figure 2 . For example, consider a problem of classifying images of CIFAR10 into two classes, i.e., animal and vehicle. The animal class consists of images from subclasses such as cat and frog, while the vehicle class consists of images from subclasses such as airplane and automobile. Under the presence of subclasses, the violation of this requirement leads to nonsensical explanations such as "I think this image (cat) is animal because a similar image (frog) I saw in the past was also animal." in Figure 2 . This observation leads to the identical subclass test below. Definition 4 (Identical Subclass Test). Let s(z) be a subclass for class y of an instance z = (x, y). We require that the most similar (relevant) instance of a test instance z test = (x test , y test ) is the training instance of the same subclass as the test instance, under the assumption that the prediction of the test instance is correct y test = y test . 2 arg max z∈D R z test , z = z =⇒ s(barz) = s(z test ). In the experiments, we used modified datasets: we split the dataset into two new classes (A and B) by randomly assigning the existing classes to either classes. The new two classes now contain the original data classes as subclasses that are mutually exclusive and collectively exhaustive, which can be used for the identical subclass test.

3.4. DISCUSSIONS ON VALIDITY OF CRITERIA

Here, we discuss the validity of the new criteria, i.e., the identical class and identical subclass tests. Why do relevance metrics that cannot pass these tests matter? Dietvorst et al. ( 2015) revealed a bias in humans, called algorithm aversion, which states that people tend to ignore an algorithm if it makes errors. It should be noted that the explanations that do not satisfy the identical class test or identical subclass test appear to be logically broken, as shown in Figures 1 and 2 . Given such logically broken explanations, users will consider that the models are making errors, even if they are making accurate predictions. Eventually, the users will start to ignore the models. Is the identical subclass test necessary? This is an essential requirement for ensuring that the explanations are plausible to any users. Some users may not consider the explanations that violate the identical subclass test to be logically broken. For example, some users may find a frog to be an appropriate explanation for a cat being animal by inferring taxonomy of the classes (e.g., both have eyes). However, we cannot hope all users to infer the same taxonomy. Therefore, if there is a discrepancy between the explanation and the taxonomy inferred by a user, the user will consider the explanation to be implausible. To make explanations plausible to any user, instances of the same subclass need to be provided. Is random class assignment in the identical subclass test appropriate? We adopted random assignment to evaluate the performance of each metric independent from the underlying taxonomy. If a specific taxonomy was considered for the evaluations, a metric that performed well with it will be highly valued. Random assignment eliminates such effects, and we can purely measure the performance of the metrics themselves. Do classification models actually recognize subclasses? Is the identical subclass test suitable to evaluate the explanations of predictions made by practical models? It is true that if a model ignores subclasses in its training and inference processes, any explanation will fail the test. We conducted simple preliminary experiments and confirmed that the practical classification models used in this study capture the subclasses. See Appendix E for further detail.

4. EVALUATION RESULTS

Here, we examine the validity of relevance metrics with respect to the three minimal requirements. For this evaluation, we used two image datasets (MNIST (LeCun et al., 1998) , CIFAR10 (Krizhevsky, 2009)) , two text datasets (TREC (Li & Roth, 2002) , AGNews (Zhang et al., 2015) ) and two table datasets (Vehicle (Dua & Graff, 2017) , Segment (Dua & Graff, 2017) ). As benchmarks, we employed logistic regression and deep neural networks trained on these datasets. Details of the datasets, models, and computing infrastructure used in this study is provided in Appendix B. Procedure We repeated the following procedure 10 times for each evaluation test. 1. Train a model using a subset of training instances. 3 Then, randomly sample 500 test instances from the test set.foot_3  2. For each test instance, compute the relevance score for all instances used for training.

3.. (a)

For the model randomization test, compute the Spearman rank correlation coefficients between the relevance scores from the trained model and relevance scores from the randomized model. (b) For the identical class and identical subclass tests, compute the success rate, which is the ratio of test instances that passed the test. In this section, we mainly present the results for CIFAR10 with CNN and AGNews with Bi-LSTM. The other results were similar, and can be found in Appendix F.

Result Summary

We summarize the main results before discussing individual results. • last 2 , cos last , and gradient-based metrics scored low correlation in the model randomization test for all datasets and models, indicating that they are model-dependent. • GC performed the best in most of the identical class and identical subclass tests; thus, GC would be the recommended choice in practice. • Dot metrics as well as IF, FK, and GD performed poorly on the identical class test and identical subclass test. In Section 5, we analyze why some relevance metrics succeed or fail in the identical class and identical subclass tests.

4.1. RESULT OF MODEL RANDOMIZATION TEST

Figure 3 shows the Spearman rank correlation coefficients for the model randomization test. The similarities with the identity feature map x 2 , cos x , and dot x are irrelevant to the model and their correlations are trivially one. In the figures, the other metrics scored correlations close to zero, indicating they will be model-dependent. However, the correlation of all 2 , cos all , dot last was observed to be more than 0.7 on the MNIST and Vehicle datasets (see Appendix F). Therefore, we conclude that these relevance metrics failed the model randomization test because they can raise instances irrelevant to the model for some datasets.

4.2. RESULTS OF IDENTICAL CLASS AND IDENTICAL SUBCLASS TESTS

Figure 4 depicts the success rates for the identical class and identical subclass tests. We also summarized the average success rates of our experiments in Table 1 . It is noteworthy that GC performed consistently well on the identical class and identical subclass tests for all the datasets and models used in the experiment (see Appendix F). In contrast, some relevance metrics such as the dot metrics as well as IF, FK, and GD performed poorly on both tests. The reasons for their failure are discussed in the next section. To conclude, the results of our evaluations indicate that only GC performed well on all tests. That is, only GC seems to meet the minimal requirements; thus, it would be a recommended choice for similarity-based explanation.

5. WHY SOME METRICS ARE SUCCESSFUL AND WHY SOME ARE NOT

We observed that the dot metrics and gradient-based metrics such as IF, FK, and GD failed the identical class and identical subclass tests, in comparison to GC that exhibited remarkable performance. Here, we analyze the reasons why the aforementioned metrics failed while GC performed well. In Appendix D, we also discuss a way to repair IF, FK, and GD to improve their performance based on the findings in this section. Given a criterion, let z (i) train be a desirable instance for a test instance z test . The failures of dot metrics indicate the existence of an undesirable instance z (j) train such that φ(z test ), φ(z (i) train ) < φ(z test ), φ(z (j) train ) . The following sufficient condition for z (j) train is useful to understand the failure. φ(z (i) train ) < φ(z (j) train ) cos(φ(z test ), φ(z (j) train )). (3) The condition implies that any instance with an extremely large norm and a cosine slightly larger than zero can be the candidate of z (j) train . In our experiments, we observed that the condition on the norm is especially crucial. As shown in Figure 5 , even though instances with significanty large norms were scarce, only such extreme instances were selected as relevant instances by IF, FK, and GD. This indicates that these these metrics tend to consider such extreme instances as relevant. In contrast, GC was not attracted by large norms because it completely cancels the norm through normalization. Figure 6 shows some training instances frequently selected in the identical class test on CIFAR10 with CNN. When using IF, FK, and GD, these training instances were frequently selected irrespective of their classes because the training instances had large norms. In these metrics, the term cos(φ(z test ), φ(z train )) seems to have negligible effects. In contrast, GC successfully selected the instances of the same class and ignored those with large norms.

Success of GC

We now analyze why GC performed well, specifically in the identical class test. To simplify the discussion, we consider linear logistic regression whose conditional distribution p(y | x; θ) is given by the y-th entry of σ(W x), where σ is the softmax function, θ = W ∈ R C×d , and C and d denote the number of classes and dimensionality of x, respectively. With some algebra, we obtain R GC (z, z ) = cos(r z , r z ) cos(x, x ) for z = (x, y) and z = (x , y ), where r z = σ(W x) -e y is the residual for the prediction on z and e y is a vector whose y-th entry is one, and zero, otherwise. See Appendix C for the derivation. Here, the term cos(r z , r z ) plays an essential role in GC. By definition, r z c ≤ 0 if c = y and r z c ≥ 0, otherwise. Thus, cos(r z , r z ) ≥ 0 always holds true when y = y , while cos(r z , r z ) can be negative for y = y . Hence, the chance of R GC (z, z ) being positive can be larger for the instances from the same class compared to those from a different class. Figure 7 shows that cos(r z , r z ) is essential also for deep neural networks. Here, for each test instance z test on CIFAR10 with CNN, we randomly sampled two training instances z train (one with the same class and the other with a different class), and computed R GC (z test , z train ) and cos(r ztest , r ztrain ). We also note that cos(r ztest , r ztrain ) alone was not helpful for the identical subclass test, whose success rate was around the chance level. We thus conjecture that while cos(r ztest , r ztrain ) is particularly helpful for the identical class test, the use of the entire gradient is still essential for GC to work effectively. same class different class 0 1 2 3 -1 0 1 RGC(z test , z train ) Frequency -1 0 1 cos(r z test , r z train )

6. CONCLUSION

We investigated and determined relevance metrics that are effective for similarity-based explanation. For this purpose, we evaluated whether the metrics satisfied the minimal requirements for similaritybased explanation. In this study, we conducted three tests, namely, the model randomization test of Adebayo et al. (2018) to evaluate whether the metrics are model-dependent, and two newly designed tests, the identical class and identical subclass tests, to evaluate whether the metrics can provide plausible explanations. Quantitative evaluations based on these tests revealed that the cosine similarity of gradients performs best, which would be a recommended choice in practice. We also observed that some relevance metrics do not meet the requirements; thus, the use of such metrics would not be appropriate for similarity-based explanation. We expect our insights to help practitioners in selecting appropriate relevance metrics, and also to help further researches for designing better relevance metrics for instance-based explanations. Finally, we present two future direction for this study. First, the proposed criteria only evaluated limited aspects of the faithfulness and plausibility of relevance metrics. Thus, it is important to investigate further criteria for more detailed evaluations. Second, in addition to similarity-based explanation, it is necessary to consider the evaluation of other explanation methods, such as counterexamples. We expect this study to be the first step toward the rigorous evaluation of several instancebased explanation methods.

A GRADIENT-BASED METRICS

In gradient-based metrics, we consider a model with parameter θ, its loss (z; θ), and its gradient ∇ θ (z; θ) to measure relevance, where z = (x, y) is an input-output pair. Influence Function (Koh & Liang, 2017) Koh & Liang (2017) proposed to measure relevance according to "how largely the test loss will increase if the training instance is omitted from the training set." Here, the model parameter trained using all of the training set is denoted by θ, and the parameter using all of the training set except the i-th instance z (i) train is denoted by θ -i . The relevance metric proposed by Koh & Liang ( 2017) is then defined as the difference between the test loss under parameters θ and θ -i as follows: R IF (z test , z (i) train ) := (z test ; θ -i ) -(z test ; θ). Here, a greater value indicates that the loss on the test instance increases drastically by removing the i-th training instance from the training set. Thus, the i-th training instance is essential relative to predicting the test instance; therefore, it is highly relevant. In practice, the following approximation is used to avoid computing θ -i explicitly. R IF (z test , z (i) train ) ≈ ∇ θ (z test ; θ), H -1 ∇ θ (z (i) train ; θ)) , ( ) where H is the Hessian matrix of the loss L train . Relative IF (Barshan et al., 2020 ) Barshan et al. (2020) proposed to measure relevance according to "how largely the test loss will increase if the training instance is omitted from the training set under the constraint that the expected squared change in loss is sufficiently small"foot_5 , which is the modified version of the influence function. Relative IF is computed as the cosine similarity of φ(z) = H -1/2 ∇ θ (z; θ): R RIF (z test , z train ) := cos(H -1/2 ∇ θ (z test ; θ), H -1/2 ∇ θ (z train ; θ)). Fisher Kernel (Khanna et al., 2019 ) Khanna et al. (2019) proposed to measure the relevance of instances using the Fisher kernel as follows: R FK (z test , z (i) train ) := ∇ θ (z test ; θ), I -1 ∇ θ (z (i) train ; θ) , ( ) where I is the Fisher information matrix of the loss L train . Grad-Dot, Grad-Cos (Perronnin et al., 2010; Yeh et al., 2018; Charpiat et al., 2019 ) Charpiat et al. (2019) proposed to measure relevance according to "how largely the loss will decrease when a small update is added to the model using the training instance." This can be computed as the dot product of the loss gradients, which we refer to as Grad-Dot. R GD (z test , z train ) := ∇ θ (z test ; θ), ∇ θ (z train ; θ) . Note that a similar metric is studied by Yeh et al. (2018) as the representer point value. As a modification of Grad-Dot, Charpiat et al. ( 2019) also proposed the following cosine version, which we refer to as Grad-Cos. R GC (z test , z train ) := cos(∇ θ (z test ; θ), ∇ θ (z train ; θ)). Note that the use of the cosine between the gradients is also proposed by Perronnin et al. (2010) .

B EXPERIMENTAL SETUP B.1 DATASETS AND MODELS

MNIST (LeCun et al., 1998) The MNIST dataset is used for handwritten digit image classification tasks. Here, input x is an image of a handwritten digit, and the output y consists of 10 classes ("0" regression is given as (z; θ) = - C c=1 y c w c , x + log C c =1 exp( w c , x ), where W = [w 1 , w 2 , . . . , w C ] . Let e y be a vector whose y-th entry is one and zero otherwise. Then, the gradient of the loss with respect to w c can be expressed as ∇ wc (z; θ) = (σ(W x) -(e y ) c )x = (r z ) c x, where r z = σ(W x) -e y is the residual for the prediction on z. Hence, we have ∇ θ (z; θ), ∇ θ (z ; θ) = C c=1 ∇ wc (z; θ), ∇ wc (z ; θ) (12) = C c=1 (r z ) c (r z ) c x, x = r z , r z x, x , which yields R GC (z, z ) = r z , r z x, x r z x r z x (15) = cos(r z , r z ) cos(x, x ).

D REPAIRING GRADIENT-BASED METRICS

As described in Section 5, we found that training instances with extremely large norms were selected as relevant by IF, FK, and GD. Thus, to repair these metrics, we need to design metrics that can ignore instances with large norms. A simple yet effective way of repairing the metrics is to use 2 or cosine instead of the dot product. As Figure 4 shows, the 2 and cosine metrics performed better than the dot metrics. Indeed, the 2 metrics do not favor instances with large norms that lead to large 2 -distance, and, through normalization, the cosine metrics completely ignore the effect of the norms We name the repaired metrics of IF, FK, and GD based on the 2 metric as IF 2 , FK 2 , and GD 2 , respectively, and the repaired metrics based on the cosine metric as cos IF and cos FK , and cos GD , respectively 7 . We observed that these repaired metrics attained higher success rates on several evaluation criteria. The details of the results can be found in Appendix F.

E DO THE MODELS CAPTURE SUBCLASSES?

The identical subclass test requires the model to obtain internal representations that can distinguish subclasses. Here, we confirm that this condition is satisfied for all the datasets and models we used in the experiments. We consider that the model captures the subclasses if the latent representation h all has cluster structures. Figure 9 visualizes h all for each dataset and model using UMAP (McInnes et al., 2018) . The figures show that the instances from different subclasses are not mixed completely random. MNIST and TREC have relatively clear cluster structures, while CIFAR10 and AGNews have vague clusters without explicit boundaries. These figures imply that the models capture subclases (although it may not be perfect).

F.1 FULL RESULTS

We show the complete results of the model randomization test in Table 2 , the identical class test in Table 3 , and the identical subclass test in Table 4 . The results we present here are consistent with our observations in Section 4. 

F.2 ADDITIONAL RESULTS

The identical class test require the most relevant instance to be of the same class as the test instance. In practice, users can be more confident about a model's output if several instances are provided as evidence. In other words, we expect that the most relevant and a first few relevant instances will be of the same class. This observation leads to the additional criterion, which is a generalization of the identical class test. We show the results of the top-10 identical class test in Table 3 , and the top-10 identical subclass test in Table 4 .

G EXAMPLES OF EACH EXPLANATION METHOD

We show some examples of the relevant instances using several relevance metrics on CIFAR10 with CNN in Figure 10 and Figure 11 and on AGNews with LSTM in Table 7 and Table 8 . We show examples of both correct (in Figure 10 and Table 7 ) and incorrect (in Figure 11 and Table 8 ) predictions. As mentioned in Section 5, the relevance metrics based on the dot product of the gradient, such as IF, FK, and GD, tend to select instances with large norms, and therefore we can see that non-typical instances have been selected. 



Our implementation is available at https://github.com/k-hanawa/criteria_for_ instance_based_explanation We require correct predictions in this test because the subclass does not match for incorrect cases. We randomly sampled 10% of MNIST and CIFAR10; 50% of TREC, Vehicle and Segment; and 5% of AGNews For the identical subclass test, we sampled instances with correct predictions only. https://sites.google.com/view/mimaizumi/event/mlcamp2018 This metric is called -RelatIF byBarshan et al. (2020) Note that cos IF is the same as RIF and cos GD is the same as GC.



Notations For vectors a, b ∈ R p , we denote the dot product by a, b := p i=1 a i b i , the 2 norm by a := a, a , and the cosine similarity by cos(a, b) := a,b / a b . Classification Problem We consider a standard classification problem as the evaluation benchmark, which is the most actively explored application of instance-based explanations. The model is the conditional probability p(y | x; θ) with parameter θ. Let θ be a trained parameter θ = arg min θ L train := 1 ; θ), where the loss function is the cross entropy (z; θ) = -log p(y | x; θ) for an input-output pair z = (x, y). The model classifies a test input x test by assigning the class with the highest probability y test = arg max y p(y | x test ; θ).

Figure 1: Valid () and invalid () examples for the identical class test.

Figure 4: Results of the identical class test and identical subclass test.

Figure 5: Distributions of norms of the feature maps of all training instances (colored) and the instances selected by the identical class test (meshed) on CIFAR10 with CNN.

Figure 6: Training instances frequently selected in the identical class test with multiple test instances on CIFAR10 with CNN, the cosine between them, and the norm of training instances.Failure of Dot Metrics and Gradient-based MetricsTo understand the failure, we reformulate IF, FK, and GD as dot metrics of the form R dot (z test , z train ) = φ(z test ), φ(z train ) to ensure that the following discussion is valid for any relevance metric of this form. It is evident that IF, FK, and GD can be expressed in this form by defining the feature maps by φ(z) = H -1/2 g(z; θ), φ(z) = I -1/2 g(z; θ), and φ(z) = g(z; θ), respectively.

Figure 7: Distributions of R GC (z test , z train ) and cos(r ztest , r ztrain ) for training instances with the same / different classes on CIFAR10 with CNN.

MNIST with CNN. y = B.

CIFAR10 with MobileNetV2. y = A.

CIFAR10 with MobileNetV2. y = B.

Figure 8: TREC with LSTM. y = A.

AGNews with LSTM. y = B.

Figure9: visualization of h all in each dataset and model using UMAP.

Figure 10: Relevant instances selected for random test inputs with correct prediction using several relevance metrics on CIFAR10 with CNN.

Definition 2 (Model Randomization Test). Let R denote the relevance metric of interest. Let f and f rand be a well-trained model and randomly initialized model, respectively. For given R, f , and test instance z test = (x test , y test ), let π f be a permutation of the indices of the training instances based on the degree of relevance to the given test instance, i.e., R(z test , z

Result of the model randomization test. Correlations close to zero are ideal.

Average Spearman rank correlation coefficients ± std. of each similarity function for model randomization test. The metrics prefixed with ♦ are the ones we have repaired. The results with the average score in the 95% confidence interval of the null distribution that the correlation is zero, which is [-0.088, 0.088], are colored.

Average success rate ± std. of each relevancy metric for identical class test. The metrics prefixed with ♦ are the ones we have repaired. The results with the average success rate over 0.5 are colored.

Average success rate ± std. of each relevancy metric for identical subclass test. The metrics prefixed with ♦ are the ones we have repaired. The results with the average success rate over 0.5 are colored.

. For z test = (x test , y test ), let zj = ( xj , ȳj ) be a training instance with the j-th largest relevance score. Then, we require ȳj = y test for any j ∈ {1, 2, . . . , k}.This observation also applies to identical subclass test, which leads to the following criterion Definition 6 (Top-k Identical Subclass Test). For z test = (x test , y test ), let zj = ( xj , ȳj ) be a training instance with the j-th largest relevance score. Then, we require s( zj ) = s( z test ), ∀j ∈ {1, 2, . . . , k}.

Average success rate ± std. of each relevancy metric for top-10 identical class test. The metrics prefixed with ♦ are the ones we have repaired. The results with the average success rate over 0.5 are colored.

Average success rate ± std. of each relevancy metric for top-10 identical subclass test. The metrics prefixed with ♦ are the ones we have repaired. The results with the average success rate over 0.5 are colored.

Relevant instances selected for random test inputs with incorrect predictions using several relevance metrics on AGNews with LSTM. Out-of-vocabulary words are followed by[unk].devil[unk]  rays[unk] stuck[unk] in florida hours[unk] before game Sports cos x profiting[unk] from moore[unk] 's law Business cos last devil[unk] rays[unk] stuck[unk] in florida hours[unk] before game Sports cos all devil[unk] rays[unk] stuck[unk] in florida hours[unk] before game Sports dot x italians[unk] , canadians[unk] gather[unk] to honour[unk] living legend[unk] : vc[unk] winner smoky[unk] smith[unk] ( canadian press )

ACKNOWLEDGMENTS

We thank Dr. Ryo Karakida and Dr. Takanori Maehara for their helpful advice. We also thank Overfit Summer Seminar 5 for an opportunity that inspired this research. Additionally, we are grateful to our laboratory members for their helpful comments. Sho Yokoi was supported by JST, ACT-X Grant Number JPMJAX200S, Japan. Satoshi Hara was supported by JSPS KAKENHI Grant Number 20K19860, and JST, PRESTO Grant Number JPMJPR20C8, Japan.

annex

to "9"). We adopted logistic regression and a CNN as the classification models. The CNN has six convolutional layers, and max-pooling layers for each two convolutional layers. The features obtained by these layers are fed into the global average pooling layer followed by a single linear layer. The number of the output channels of all the convolutional layers is set to 16. We trained the models using the Adam optimizer with a learning rate of 0.001. We used randomly sampled 5,500 training instances to train the models. CIFAR10 (Krizhevsky, 2009) The CIFAR10 dataset is used for object recognition tasks. Here, input x is an image containing a certain object, and output y consists of 10 classes, e.g., "bird" or "airplane." Note that we used the same models as for the MNIST dataset. In addition, we adopted MobileNetV2 (Sandler et al., 2018 ) as a model with a higher performance than the previous model. We trained the models using the Adam optimizer with a learning rate of 0.001. In the experiments, we first pre-trained the models using all the training instances of CIFAR10, and then trained the models using randomly sampled 5,000 training instances. Without the pre-training, the classification performance of the models dropped significantly.Note that we did not examine IF and FK on MobileNetV2 because the matrix inverse in these metrics required too much time to calculate even with the conjugate gradient approximation proposed by Koh & Liang (2017) .TREC (Li & Roth, 2002) The TREC dataset is used for question classification tasks. Here, input x is a question sentence, and output y is a question category consisting of six classes, e.g., "LOC" and "NUM." We used bag-of-words logistic regression and a two-layer Bi-LSTM as the classification models. In the Bi-LSTM, the last state is fed into one linear layer. The word embedding dimension is set to 16, and the dimension of the LSTM is set to 16 also. We trained the models using the Adam optimizer with a learning rate of 0.001. We used randomly sampled 2,726 training instances to train the models.AGNews (Zhang et al., 2015) The AGNews dataset is used for news article classification tasks. Here, input x is a sentence, and output y is a category comprising four classes, e.g., "business" and "sports." We used the same models as TREC. We trained the models using the Adam optimizer with a learning rate of 0.001. We used randomly sampled 6,000 training instances to train the models.Vehicle (Dua & Graff, 2017) The vehicle dataset is used for vehicle type classification tasks. Here, the input x consists of 18 features, and the output y is a type of vehicle comprising four classes, e.g., "bus" and "van." We used logistic regression and a three-layer MLP as the classification models. We trained the models using the Adam optimizer with a learning rate of 0.001. We used randomly sampled 423 training instances to train the models.Segment (Dua & Graff, 2017) The segment dataset is used for image classification tasks. Here, the input x consists of 19 features, and the output y consists of seven classes, e.g., "sky" and "window." We used the same models as Vehicle. We trained the models using the Adam optimizer with a learning rate of 0.001. We used randomly sampled 924 training instances to train the models.

B.2 COMPUTING INFRASTRUCTURE

In our experiments, training of the models was run on a NVIDIA GTX 1080 GPU with Intel Xeon Silver 4112 CPU and 64GB RAM. Testing and computing relevance metrics were run on Xeon E5-2680 v2 CPU with 256GB RAM.

C DERIVATION OF GC FOR LINEAR LOGISTIC REGRESSION

We consider linear logistic regression whose conditional distribution p(y | x; θ) is given by the y-th entry of σ(W x), where σ is the softmax function, θ = W ∈ R C×d , and C and d are the number of classes and the dimensionality of x, respectively. Recall that the cross entropy loss for linear logistic Published as a conference paper at ICLR 2021 

