LEARNING TO ESTIMATE SHAPLEY VALUES WITH VISION TRANSFORMERS

Abstract

Transformers have become a default architecture in computer vision, but understanding what drives their predictions remains a challenging problem. Current explanation approaches rely on attention values or input gradients, but these provide a limited view of a model's dependencies. Shapley values offer a theoretically sound alternative, but their computational cost makes them impractical for large, high-dimensional models. In this work, we aim to make Shapley values practical for vision transformers (ViTs). To do so, we first leverage an attention masking approach to evaluate ViTs with partial information, and we then develop a procedure to generate Shapley value explanations via a separate, learned explainer model. Our experiments compare Shapley values to many baseline methods (e.g., attention rollout, GradCAM, LRP), and we find that our approach provides more accurate explanations than existing methods for ViTs. * Equal contribution.

1. INTRODUCTION

Transformers (Vaswani et al., 2017) were originally introduced for NLP, but in recent years they have been successfully adapted to a variety of other domains (Wang et al., 2020; Jumper et al., 2021) . In computer vision, transformer-based models are now used for problems including image classification, object detection and semantic segmentation (Dosovitskiy et al., 2020; Touvron et al., 2021; Liu et al., 2021) , and they achieve state-of-the-art performance in many tasks (Wortsman et al., 2022) . The growing use of transformers in computer vision motivates the question of what drives their predictions: understanding a complex model's dependencies is an important problem in many applications, but the field has not settled on a solution for the transformer architecture. Transformers are composed of alternating self-attention and fully-connected layers, where the selfattention operation associates attention values with every pair of tokens. In vision transformers (ViTs) (Dosovitskiy et al., 2020) , the tokens represent non-overlapping image patches, typically a total of 14 × 14 = 196 patches each of size 16 × 16. It is intuitive to view attention values as indicators of feature importance (Abnar and Zuidema, 2020; Ethayarajh and Jurafsky, 2021) , but interpreting transformer attention in this way is potentially misleading. Recent work has raised questions about the validity of attention as explanation (Serrano and Smith, 2019; Jain and Wallace, 2019; Chefer et al., 2021) , arguing that it provides an incomplete picture of a model's dependence on each token. If attention is not a reliable indicator of feature importance, then what is? We consider the perspective that transformers are no different from any other architecture, and that we can explain their predictions using model-agnostic approaches that are currently used for other architectures. Among these methods, Shapley values are a theoretically compelling approach with feature importance scores that are designed to satisfy many desirable properties (Shapley, 1953; Lundberg and Lee, 2017) . The main challenge for Shapley values in the transformer context is calculating them efficiently, because a naive calculation has exponential running time in the number of patches. If Shapley values are poorly approximated, they are unlikely to reflect a model's true dependencies, but calculating them with high accuracy is currently too slow to be practical. Thus, our work aims to make Shapley values practical for transformers, and for ViTs in particular. Our contributions include: a connection between attention flow and Shapley values (Ethayarajh and Jurafsky, 2021) , but this approach is fundamentally different from SHAP (Lundberg and Lee, 2017) : attention flow treats each feature's influence on the model as strictly additive, which is computationally convenient but fails to represent feature interactions. Our work instead focuses on the original formulation (Lundberg and Lee, 2017) and aims to make Shapley values based on feature removal practical for ViTs.

3. BACKGROUND

Here, we define notation used throughout the paper and briefly introduce Shapley values. 3.1 NOTATION Our focus is vision transformers trained for classification tasks, where x ∈ R 224×224×3 denotes an image and y ∈ {1, . . . , K} denotes the class. We write the image patches as x = (x 1 , . . . , x d ), where ViTs typically have x i ∈ R 16×16×3 and d = 196. The model is given by f (x; η) ∈ [0, 1] K and f y (x; η) ∈ [0, 1] represents the probability for the yth class. Shapley values involve feature subsets, so we use s ∈ {0, 1} d to denote a subset of indices and x s = {x i : s i = 1} a subset of image patches. We also use 0 and 1 ∈ R d to denote vectors of zeros and ones, and e i ∈ R d is a vector with a one in the ith position and zeros elsewhere. Finally, bold symbols x, y are random variables, x, y are possible values, and p(x, y) denotes the data distribution. 3.2 SHAPLEY VALUES Shapley values were developed in game theory for allocating credit in coalitional games (Shapley, 1953) . A coalitional game is represented by a set function, and the value for each subset indicates the profit achieved when the corresponding players participate. Given a game with d players, or a set function v : {0, 1} d → R, the Shapley values are denoted by ϕ 1 (v), . . . , ϕ d (v) ∈ R for each player, and the value ϕ i (v) for the ith player is defined as follows: ϕ i (v) = 1 d s:si=0 d -1 1 ⊤ s -1 v(s + e i ) -v(s) . Intuitively, eq. ( 1) represents the change in profit from introducing the ith player, averaged across all possible subsets to which i can be added. Shapley values are defined in this way to satisfy many reasonable properties: for example, the credits sum to the value when all players participate, players with equivalent contributions receive equal credit, and players with no contribution receive zero credit (Shapley, 1953) . These properties make Shapley values attractive in many settings: they have been applied with coalitional games that represent a model's prediction given a subset of features (SHAP) (Štrumbelj and Kononenko, 2010; Lundberg and Lee, 2017) , as well as several other use-cases in machine learning (Ghorbani and Zou, 2019; 2020; Covert et al., 2020) . There are two main challenges when using Shapley values to explain individual predictions (Chen et al., 2022) . The first is properly withholding feature information, and we explore how to address this challenge in the ViT context (Section 4). The second is calculating Shapley values efficiently, because their computation scales exponentially with the number of inputs d. Traditionally, they are approximated using sampling-based estimators like KernelSHAP (Castro et al., 2009; Štrumbelj and Kononenko, 2010; Lundberg and Lee, 2017 ), but we build on a more efficient learning-based approach (FastSHAP) recently introduced by Jethani et al. (2021b) (Section 5).

4. EVALUATING VISION TRANSFORMERS WITH PARTIAL INFORMATION

The basic idea behind Shapley values, as well as other removal-based explanations (Covert et al., 2021) , is to evaluate the model with partial feature information and analyze how a prediction changes. Most models need values for all the features to make predictions, so in practice we require a mechanism to represent feature removal. For example, we can set held-out image regions to zero, or we can average the prediction across randomly sampled replacement values. With vision transformers, the options for removing features are slightly different. Recent work has demonstrated the robustness of ViTs to randomly zeroed pixel values (Naseer et al., 2021) , but the self-attention operation enables a more elegant approach: we can simply ignore tokens for image patches we wish to remove (Jain et al., 2021) . We achieve this by masking attention values at each self-attention layer, or setting them to a large negative value before applying the softmax operation (see Appendix A). This resembles causal attention masking in transformer language models like GPT-3 (Brown et al., 2020) , but we use masking for a different purpose. Alternatively, we could use a unique token value as in masked language models such as BERT (Devlin et al., 2018) , which would involve simply setting held-out tokens to the mask value. Using this attention masking approach, we can evaluate a ViT model f (x; η) given subsets of image patches, denoted by x s . However, because these partial inputs represent off-manifold examples, the predictions with partial information may not behave as desired. We have two options to correct this: 1) we can ensure that the model is trained with random masking, or 2) we can fine-tune the model to encourage sensible behavior with missing patches. The first option is more direct, but it does not allow us to explain models trained without masking. For the latter option, we can create an updated model denoted by g(x s ; β) that we fine-tune using the following loss, min β E p(x) E p(s) D KL f (x; η) || g(x s ; β) , where p(s) is a distribution over subsets. In practice, we sample the cardinality m = 1 ⊤ s from m ∼ Unif(0, d) and then sample m patches uniformly at random. Intuitively, eq. ( 2) encourages g(x s ; β) to preserve the original model's predictions even with missing features. We use this loss because it satisfies the desirable property that the optimal model g(x s ; β * ) outputs the expected prediction given the available information (Covert et al., 2021) , or g(x s ; β * ) = E[f (x; η) | x s = x s ]. (3) Note that this represents a best-effort prediction, because if f (x; η) = p(y | x) then we have g(x s ; β * ) = p(y | x s ). Similarly, in the case where f (x s ; η) is trained directly with random masking, the training process estimates f (x s ; η) ≈ p(y | x s ) (see Appendix B). We refer to the fine-tuned model g(x s ; β) as a surrogate, following the naming in prior work (Frye et al., 2020) . Whether we use the original model or a version fine-tuned with random masking, our attention masking approach enables us to probe how individual predictions change as we remove groups of image patches.

5. LEARNING TO ESTIMATE SHAPLEY VALUES

Given our approach for evaluating ViTs with partial information, we can use Shapley values to identify influential image patches for an input x and class y. This involves evaluating the model with many feature subsets x s , so we define a coalitional game v xy (s) = g y (x s ; β). Alternatively, if we use a model trained with masking, we can define the coalitional game as v xy (s) = f y (x s ; η). Common Shapley value approximations are based on sampling feature permutations (Castro et al., 2009; Štrumbelj and Kononenko, 2010) or fitting a weighted least squares model (Lundberg and Lee, 2017; Covert and Lee, 2021) , but these can require hundreds or thousands of model evaluations to explain a single prediction.foot_0 Instead, we develop a learning-based estimation approach for ViTs. Our goal is to obtain an explainer model that estimates Shapley values directly. To do so, we train a new vision transformer ϕ ViT (x, y; θ) ∈ R d that outputs approximate Shapley values for an inputoutput pair (x, y) in a single forward pass. Crucially, rather than training the model using a dataset of ground truth Shapley value explanations, we train it by minimizing the following objective, L(θ) = E p(x,y) E pSh(s) v xy (s) -v xy (0) -s ⊤ ϕ ViT (x, y; θ) 2 (4) s.t. 1 ⊤ ϕ ViT (x, y; θ) = v xy (1) -v xy (0) ∀ (x, y), where p Sh (s) is a distribution defined as p Sh (s) ∝ (1 ⊤ s -1)!(d -1 ⊤ s -1)! for 0 < 1 ⊤ s < d and p Sh (1) = p Sh (0) = 0. Intuitively, eq. ( 4) encourages the explainer model to output feature scores that provide an additive approximation for the predictions with partial information, where the predictions are represented by v xy (s) and the additive approximation by v xy (0) + s ⊤ ϕ ViT (x, y; θ). The loss in eq. ( 4) was introduced by Jethani et al. (2021b) and is derived from an optimization-based characterization of the Shapley value (Charnes et al., 1988) . To rigorously justify this training approach, we derive new results that show how this objective controls the Shapley value estimation error. Proofs are in Appendix D. First, we show that the explainer's loss for a single input is strongly convex in the prediction, a result that implies the existence of unique optimal predictions. Lemma 1. For a single input-output pair (x, y), the expected loss under eq. ( 4) for the prediction ϕ ViT (x, y; θ) is µ-strongly convex with µ = H -1 d-1 , where H d-1 is the (d -1)th harmonic number. Next, we utilize the strong convexity property from Lemma 1 to prove our main result: that the explainer model's loss function upper bounds the distance between the exact and approximated Shapley values. This is notable because we do not utilize ground truth values during training. Theorem 1. For a model ϕ ViT (x, y; θ) whose predictions satisfy the constraint in eq. (4), the objective value L(θ) upper bounds the Shapley value estimation error as follows, E p(x,y) ϕ ViT (x, y; θ) -ϕ(v xy ) 2 ≤ 2H d-1 L(θ) -L * , where L * represents the loss achieved by the exact Shapley values. This shows that our objective is a viable approach for training without exact Shapley values, because optimizing eq. ( 4) minimizes an upper bound on the estimation error. In other words, if we can iteratively optimize the explainer model so that its loss approaches the optimum obtained by the exact Shapley values (L(θ) → L * ), our estimation error will go to zero. In practice, we train the explainer model ϕ ViT (x, y; θ) using stochastic gradient descent, and several other steps are important during training. First, we normalize the explainer's unconstrained predictions in order to satisfy the objective's constraint in eq. ( 4); this ensures that the Shapley value's efficiency property holds (Shapley, 1953) . Next, rather than training the explainer from scratch, we fine-tune an existing model that can be either the original classifier or a ViT pre-trained on a different supervised or self-supervised learning task (Touvron et al., 2021; He et al., 2021) ; ViTs are more difficult to train than convolutional networks, and we find that fine-tuning is important to train the explainer effectively (Table 3 ). Finally, we simplify the architecture by estimating Shapley values for all classes simultaneously. Our training approach is described in more detail in Appendix C. By using a ViT to estimate Shapley values, we model the true explanation function and learn rich representations that capture not only which class is represented, but where key information is located. And by fine-tuning an existing model, we allow the explainer to re-use visual features that were informative for other challenging tasks. Ultimately, the explainer cannot guarantee exact Shapley values, but no approximation algorithm can; instead, it offers a favorable trade-off between accuracy and efficiency, and we find empirically that this approach offers a powerful alternative to the methods currently used for ViTs. Figure 2 : ViT predictions given partial information. We delete patches at random using several removal mechanisms, and then measure the quality of the resulting predictions via the KL divergence relative to the original, full-image predictions (lower is better).

6. EXPERIMENTS

We now demonstrate the effectiveness of our approach, termed ViT Shapley.foot_1 First, we evaluate attention masking for handling held-out patches in ViTs (Section 6.1). Next, we compare explanations from ViT Shapley to several existing methods (Section 6.2). Our baselines include attention-, gradientand removal-based explanations, and we compare these methods via several metrics for explanation quality, including insertion/deletion of important features (Petsiuk et al., 2018) , sensitivity-n (Ancona et al., 2018) , faithfulness (Bhatt et al., 2021) and ROAR (Hooker et al., 2019) . Our experiments are based on three image datasets: ImageNette, a natural image dataset consisting of ten ImageNet classes (Howard and Gugger, 2020; Deng et al., 2009) , MURA, a medical image dataset of musculoskeletal radiographs classified as normal or abnormal (Rajpurkar et al., 2017) , and the Oxford IIIT-Pets dataset, which has 37 classes (Parkhi et al., 2012) . See Figure 1 for example images. The main text shows results for ImageNette and MURA, and Pets results are in Appendix H. We use ViT-Base models (Wightman, 2019) as classifiers for all datasets, unless otherwise specified.

6.1. EVALUATING IMAGE PATCH REMOVAL

Our initial experiments test whether attention masking is effective for handling held-out image patches. We fine-tuned the classifiers for each dataset following the procedure described in Section 4, and we also tested several approaches without performing any fine-tuning: attention masking, attention masking applied after the softmax operation (how dropout is often implemented for ViTs, Wightman 2019), setting input patches to zero (Naseer et al., 2021) , setting token embeddings to zero, and replacing with random patches from the dataset. Finally, we performed identical fine-tuning while replacing input patches with zeros, which is equivalent to introducing a fixed mask token. As a measure of how well missing patches are handled, we calculated the KL divergence relative to the full-image predictions as random patches are removed. This can be interpreted as a divergence measure between the masked predictions and the predictions with patches marginalized out (see Appendix B), or how close we are to correctly removing patch information. The metric is calculated with randomly generated patch subsets, and it represents whether the model makes reasonable predictions given partial inputs. Similar results for top-1 accuracy are in Appendix H. Figure 2 shows the results. Most methods perform well with <25% of patches missing, leading to only small increases in KL divergence. This is especially true for ImageNette, where large objects make the model more robust to missing patches. However, the methods with no fine-tuning begin to diverge as larger numbers of patches are removed and the partial inputs become increasingly off-manifold. Thus, fine-tuning becomes necessary to properly account for partial inputs as more patches are removed. For all datasets, we find that fine-tuning with either attention masking or input patches set to zero provide comparable performance, and that these perform best across all numbers of patches. This means that fine-tuning makes attention masking significantly more effective for marginalizing out missing patches, and these results suggest that training ViTs with held-out tokens may be necessary to enable robustness to partial information. As prior work suggests, properly handling held-out information is crucial for generating informative explanations (Frye et al., 2020; Covert et al., 2021) , so the remainder of our experiments proceed with the fine-tuned attention masking approach.

6.2. EVALUATING EXPLANATION ACCURACY

Next, we implemented ViT Shapley by training explainer models for both datasets. We used the finetuned classifiers from Section 6.1 to handle partial information, and we used the ViT-Base architecture with extra output layers to generate Shapley values for all patches. The explainer models were trained by optimizing eq. ( 4) using stochastic gradient descent (see details in Appendix C), and once trained, the explainer outputs approximate Shapley values in a single forward pass (Figure 1 ). As comparisons for ViT Shapley, we considered a large number of baselines. For attention-based methods, we use attention rollout and the last layer's attention directed to the class token (Abnar and Zuidema, 2020). Similar to prior work (Chefer et al., 2021) , we did not use attention flow due to the computational cost. Next, for gradient-based methods, we use Vanilla Gradients (Simonyan et al., 2013) , IntGrad (Sundararajan et al., 2017) , SmoothGrad (Smilkov et al., 2017) , VarGrad (Hooker et al., 2019) , LRP (Chefer et al., 2021) and GradCAM (Selvaraju et al., 2017) . For removal-based methods, we use the leave-one-out approach (Zeiler and Fergus, 2014) and RISE (Petsiuk et al., 2018) . Appendix F describes the baselines in more detail, including how several were modified to provide patch-level results, and Appendix H shows the running time for each method. Given our set of baselines, we used several metrics to evaluate ViT Shapley. Evaluating explanation accuracy is difficult when the true importance is not known a priori, so we rely on metrics that test how removing (un)important features affects a model's predictions. Intuitively, removing influential features for a particular class should reduce the class probability, and removing non-influential features should not affect or even increase the class probability. Removal-based explanations are implicitly related to such metrics (Covert et al., 2021) , but attention-and gradient-based methods may be hoped to provide strong performance with lower computational cost. First, we implemented the widely used insertion and deletion metrics (Petsiuk et al., 2018) . For these, we generate predictions while inserting/removing features in order of most to least important, and we then evaluate the area under the curve of prediction probabilities (see Figure 1 ). Here, we average the results across 1,000 images for their true class. We use random test set images for ImageNette, and for MURA we use test examples that were classified as abnormal because these are more important in practice. When removing information, we use the fine-tuned classifier because this represents the closest approximation to properly removing information from the model (Section 4). Practically, this means that ViT Shapley identifies important features that quickly drive the prediction towards a given class, and that quickly reduce the prediction probability when deleted. Next, we modified these metrics to address a common issue with model explanations: that their results are not specific to each class (Rudin, 2019) . ViT Shapley produces separate explanations for each class, so it can identify relevant patches even for nontarget classes (see Figure 1 ). The insertion and deletion metrics only test importance rankings, so we require other metrics to test the specific attribution values. Sensitivity-n (Ancona et al., 2018) was proposed for this purpose, and it measures whether attributions correlate with the impact on a model's prediction when a feature is removed. The correlation is typically calculated across subsets of a fixed size, and then averaged across many predictions. Faithfulness (Bhatt et al., 2021 ) is a similar metric where the correlation is calculated across subsets of all sizes. Table 1 and Table 2 show faithfulness results. Among the baselines, RISE and LRP remain most competitive, but ViT Shapley again performs best for both datasets. Figure 3 shows sensitivity-n results calculated across a range of subset sizes. Leave-one-out naturally performs best for large subset sizes, but ViT Shapley performs the best overall, particularly with smaller subsets. The sensitivity-n results focus on the target class, but Appendix H shows results for non-target classes where ViT Shapley's advantage over many baselines (including LRP) is even larger. Finally, we performed an evaluation inspired by ROAR (Hooker et al., 2019) , which tests how a model's accuracy degrades as important features are removed. ROAR suggests retraining with masked inputs, but this is unnecessary here because the fine-tuned classifier is designed to handle held-out patches. We therefore generated multiple versions of the metric. First, we evaluated accuracy while using the fine-tuned classifier to handle masked patches. Second, we repeated the evaluation using a separate evaluator model trained directly with held-out patches, similar to EVAL-X ( 2021a). Third, we performed masked retraining as described by ROAR. The first version represents the original classifier's best-effort prediction, and the second is a best-effort prediction disconnected from the original model; masked retraining is similar, but the retrained model can exploit information communicated by the masking, such as the shape and position of the removed object. Figure 4 shows the results when removing important patches. ViT Shapley consistently outperforms the baselines across the first two versions of the metric, yielding faster degradation when important patches are removed. ViT Shapley also performs best when inserting important patches, yielding a faster increase in accuracy (Appendix H). It is outperformed by several baselines with masked retraining in the deletion direction (Figure 4 bottom left), but we find that this is likely due to spatial information leaked by ViT Shapley's deleted patches; indeed, when we retrained without positional embeddings, we found that ViT Shapley achieved the fastest degradation with a small number of deleted patches (Figure 4 bottom right). Interestingly, positional embeddings in ViTs offer a unique approach to alleviate ROAR's known information leakage issue (Jethani et al., 2021a) . In addition to these experiments, we include many further results in the supplement (Appendix H). First, we observe similar benefits for ViT Shapley when using the Oxford-IIIT Pets dataset. Next, regarding the choice of architecture, we observe consistent results when replacing ViT-Base with ViT-Tiny, -Small or -Large. We also replicate our results when using a classifier trained directly with random masking, an approach discussed in prior work to accommodate partial input information (Covert et al., 2021) 

A ATTENTION MASKING

This section describes our attention masking approach in detail. First, recall that ViTs use query-keyvalue self-attention (Vaswani et al., 2017; Dosovitskiy et al., 2020) , which accepts a set of input tokens and produces a weighted sum of learned token values. Given an input z ∈ R d×h and parameters U qkv ∈ R h×3h ′ , we compute the self attention output SA(z) for a single head as follows: [Q, K, V] = zU qkv (5) A = softmax(QK ⊤ / √ h ′ ) (6) SA(z) = AV. In multihead self-attention, we perform this operation in parallel over k attention heads and project the concatenated outputs. Denoting each head's output as SA i (z) and the projection matrix as U msa ∈ R k•h ′ ×h , the multihead self-attention output MSA(z) is MSA(z) = [SA 1 (z), . . . , SA k (z)]U msa . Multihead self-attention can operate with any number of tokens, so given a subset s ∈ {0, 1} d and an input x, we can evaluate a ViT using only tokens for the patches x s = {x i : s i = 1}. However, for implementation purposes it is preferable to maintain the same number of tokens within a minibatch. We therefore provide all tokens to the model and achieve the same effect using attention masking. Our exact approach is described below. Let z ∈ R d×h represent the full token set for an input x and let s be a subset. At each selfattention layer, we construct a mask matrix S = [s, . . . , s] ⊤ ∈ {0, 1} d×d and calculate the masked self-attention output SA(z, s) as follows: A = softmax((QK ⊤ -(1 -S) • ∞)/ √ h ′ ) (9) SA(z, s) = AV. ( ) The masked multihead self-attention output is then calculated similarly to the original version: MSA(z, s) = [SA 1 (z, s), . . . , SA k (z, s)]U msa . Due to the masking in eq. ( 9), each output token in MSA(z, s) is guaranteed not to attend to tokens from x 1-s = {x i : s i = 0}. We use masked self-attention in all layers of the network, so that the tokens for x s remain invariant to those for x 1-s throughout the entire model, including after the layer norm and fully-connected layers. When the final prediction is calculated using the class token, the output is equivalent to using only the tokens for x s . If the final prediction is instead produced using global average pooling (Beyer et al., 2022) , we can modify the average to account only for tokens we wish to include.

B MASKED TRAINING

In this section, we provide proofs to justify training a ViT classifier with held-out tokens, either as part of the original training or as part of a post-hoc fine-tuning procedure (the surrogate model training described in Section 4). Our proofs are similar to those in prior work that discusses marginalizing out features using their conditional distribution (Covert et al., 2021) . First, consider a model trained directly with masking. Given a subset distribution p(s) and the data distribution p(x, y), we can train a model f (x s ; η) with cross-entropy loss and random masking by minimizing the following: min η E p(x,y) E p(s) [-log f y (x s ; η)]. To understand the global optimizer for this loss function, consider the expected loss for the prediction given a fixed model input x s : E p(y,x1-s|xs) [-log f y (x s ; η)] = E p(y|xs) [-log f y (x s ; η)]. The expression in eq. ( 13) is equal to the KL divergence D KL (p(y | x s ) || f (x s ; η)) up to a constant value, so the prediction that minimizes this loss is p(y | x s ). For any subset s ∈ {0, 1} d where p(s) > 0, we then have the following result for the model f (x s ; η * ) that minimizes eq. ( 12): f y (x s ; η * ) = p(y | x s ) a.e. in p(x). Intuitively, this means that training the original model with masking estimates f (x s ; η) ≈ p(y | x s ). In practice, we use a subset distribution p(s) where p(s) > 0 for all s ∈ {0, 1} d : we set p(s) by sampling the cardinality uniformly at random and then sampling the members, which is equivalent to defining p(s) as p(s) = 1 d 1 ⊤ s • (d + 1) . Alternatively, we can use a model f (x; η) trained without masking and fine-tune it to better handle held-out features. In our case, this yields a surrogate model (Frye et al., 2020 ) denoted as g(x s ; β) that we fine-tune by minimizing the following loss: min β E p(x) E p(s) D KL f (x; η) || g(x s ; β) . To understand the global optimizer for the above loss, we can again consider the expected loss given a fixed input x s : E p(x1-s|xs) D KL f (x; η) || g(x s ; β) = D KL E[f (x; η) | x s ] || g(x s ; β) + const. The distribution that minimizes this loss is the expected output given the available features, or E[f (x; η) | x s ] . By the same argument presented above, we then have the following result for the optimal surrogate g(x s ; β * ) that minimizes eq. ( 14): g(x s ; β * ) = E[f (x; η) | x s ] a.e. in p(x). Notice that if the initial model is optimal, or f (x; η) = p(y | x), then the optimal surrogate satisfies g(x s ; β * ) = p(y | x s ).

C EXPLAINER TRAINING APPROACH

In this section, we summarize our approach for training the explainer model and describe several design choices. Recall that the explainer is a vision transformer ϕ ViT (x, y; θ) ∈ R d that we train by minimizing the following loss: min θ E p(x,y) E pSh(s) v xy (s) -v xy (0) -s ⊤ ϕ ViT (x, y; θ) 2 s.t. 1 ⊤ ϕ ViT (x, y; θ) = v xy (1) -v xy (0) ∀ (x, y). Additive efficient normalization The constraint on the explainer predictions is necessary to ensure that the global optimizer outputs the exact Shapley values, and we use the same approach as prior work to enforce this constraint (Jethani et al., 2021b) . We allow the model to make unconstrained predictions that we then modify using the following transformation: ϕ ViT (x, y; θ) ← ϕ ViT (x, y; θ) + v xy (1) -v xy (0) -1 ⊤ ϕ ViT (x, y; θ) d . ( ) This operation is known as the additive efficient normalization (Ruiz et al., 1998) , and it can be interpreted as projecting the predictions onto the hyperplane where the constraint holds (Jethani et al., 2021b) . We implement it as an output activation function, similar to how softmax is used to ensure valid probabilistic predictions for classification models.

Subset distribution

The specific distribution p Sh (s) in our loss function is motivated by the Shapley value's weighted least squares characterization (Charnes et al., 1988; Lundberg and Lee, 2017) . This result states that the Shapley values for a game v : {0, 1} d → R are the solution to the following optimization problem: min ϕ∈R d 0<1 ⊤ s<d d -1 d 1 ⊤ s (1 ⊤ s)(d -1 ⊤ s) v(s) -v(0) -s ⊤ ϕ 2 s.t. 1 ⊤ ϕ = v(1) -v(0). We obtain p Sh (s) by normalizing the weighting term in the summation, and doing so yields a distribution p Sh (s) ∝ (1 ⊤ s -1)!(d -1 ⊤ s -1)! for 0 < 1 ⊤ s < d and p Sh (1) = p Sh (0) = 0. To sample from p Sh (s), we calculate the probability mass on each cardinality, sample a cardinality m from this multinomial distribution, and then select m indices uniformly at random.

Stochastic gradient descent

As is common in deep learning, we optimize our objective using stochastic gradients rather than exact gradients. To estimate our objective, we require a set of tuples (x, y, s) that we obtain as follows. First, we sample an input x ∼ p(x). Next, we sample multiple subsets s ∼ p Sh (s). To reduce gradient variance, we use the paired sampling trick (Covert and Lee, 2021) and pair each subset s with its complement 1 -s. Then, we use our explainer to output Shapley values simultaneously for all classes y ∈ {1, . . . , K}. Finally, we minibatch this procedure across multiple inputs x and calculate our loss across the resulting set of tuples (x, y, s). Fine-tuning Rather than training the ViT explainer from scratch, we find that fine-tuning an existing model leads to better performance. This is consistent with recent work that finds ViTs challenging to train from scratch (Dosovitskiy et al., 2020) . We have several options for initializing the explainer: we can use 1) the original classifier f (x; η), 2) the fine-tuned classifier g(x s ; β), or 3) a ViT pre-trained on another task. We treat this choice as a hyperparameter, selecting the initialization that yields the best performance. We also experiment with freezing certain layers in the model, but we find that training all the parameters leads to the best performance. Explainer architecture We use standard ViT architectures for the explainer. These typically append a class token to the set of image tokens (Dosovitskiy et al., 2020) , and we find it beneficial to preserve this token in pre-trained architectures even though it is unnecessary for Shapley value estimation. We require a separate output head from the pre-trained architecture, and our explainer head consists of one additional self-attention block followed by three fully-connected layers. Each image patch yields one Shapley value estimate per class, and we discard the results for the class token. Hyperparameter tuning To select hyperparaters related to the learning rate, initialization and architecture, we use a pre-computed set of tuples (x, y, s) to calculate a validation loss. These are generated using inputs x that were not used for training, so our validation loss can be interpreted as an unbiased estimator of the objective function. This approach serves as an inexpensive alternative to comparing with ground truth Shapley values for a large number of samples.  set ϕ ← ϕ + d -1 v xy (1) -v xy (0) -1 ⊤ ϕ calculate L ← v xy (s) -v xy (0) -s ⊤ ϕ 2 update θ ← θ -α∇ θ L end C.1 HYPERPARAMETER CHOICES When training the original classifier and fine-tuned classifier models, we used a learning rate of 10 -5 and trained for 25 epochs and 50 epochs, respectively. The MURA classifier was trained with an upweighted loss for negative examples to account for class imbalance. The best model was selected based on the validation criterion, where we used 0-1 accuracy for ImageNette and Oxford-IIIT Pets, and Cohen Kappa for MURA. When training the explainer model, we used the same ViT-Base architecture as the original classifier and initialized using the fine-tuned classifier, as this gave the best results. We used the AdamW optimizer (Loshchilov and Hutter, 2018) with a cosine learning rate schedule and a maximum learning rate of 10 -4 , and we trained the model for 100 epochs, selecting the best model based on the validation loss. We used standard data augmentation steps: random resized crops, vertical flips, horizontal flips, and color jittering including brightness, contrast, saturation, and hue. We used minibatches of size 64 with 32 subset samples s per x sample, and we found that using a tanh nonlinearity on the explainer predictions was helpful to stabilize training. Finally, we modified the ViT architecture to output Shapley values for each token and each class: we removed the classification head and added an extra attention layer, followed by three fully-connected layers with width 4 times the embedding dimension, and we fine-tuned the entire ViT backbone. These choices were determined by an ablation study with different model configurations, and we also compared with training training the ViT from scratch and training a separate U-Net explainer model (Ronneberger et al., 2015) (see Table 3 ). We used a machine with 2 GeForce RTX 2080Ti GPUs to train the explainer model, and due to GPU memory constraints we loaded the classifier and explainer to separate GPUs and trained with mixed precision using PyTorch Lightning. 3 Training the explainer model required roughly 19 hours for the ImageNette dataset and 60 hours for the MURA dataset.

D PROOFS

Here, we re-state and prove our main results from Section 5. Lemma 1. For a single input-output pair (x, y), the expected loss under eq. (4) for the prediction ϕ ViT (x, y; θ) is µ-strongly convex with µ = H -1 d-1 , where H d-1 is the (d -1)th harmonic number. Proof. For an input-output pair (x, y), the expected loss for the prediction ϕ = ϕ ViT (x, y; θ) under our objective is given by h xy (ϕ) = ϕ ⊤ E pSh(s) [ss ⊤ ]ϕ -2ϕ ⊤ E pSh(s) s v xy (s) -v xy (0) + E pSh(s) v xy (s) -v xy (0) 2 . This is a quadratic function of ϕ with its Hessian given by ∇ 2 ϕ h xy (ϕ) = 2 • E pSh(s) [ss ⊤ ]. The convexity of h xy (ϕ) is determined by the Hessian's eigenvalues, and its entries can be derived from the subset distribution p Sh (s); see similar results in Simon and Vincent (2020) and Covert and Lee (2021) . The distribution assigns equal probability to subsets of equal cardinality, so we define the shorthand notation p k ≡ p Sh (s) for s such that 1 ⊤ s = k. Specifically, we have p k = Q -1 d -1 d k k(d -k) and Q = d-1 k=1 d -1 k(d -k) . We can then write A ≡ E pSh(s) [ss ⊤ ] and derive its entries as follows: A ii = Pr(s i = 1) = d k=1 d -1 k -1 p k = Q -1 d-1 k=1 (d -1) d(d -k) = d-1 k=1 d-1 d(d-k) d-1 k=1 d-1 k(d-k) A ij = Pr(s i = s j = 1) = d k=2 d -2 k -2 p k = Q -1 d-1 k=2 k -1 d(d -k) = d-1 k=2 k-1 d(d-k) d-1 k=1 d-1 k(d-k) Based on this, we can see that A has the structure A = (b -c)I d + c11 ⊤ for b ≡ A ii -A ij and c ≡ A ij . Following the argument by Simon and Vincent (2020) , the minimum eigenvalue is then given by λ min (A) = b -c. Deriving the specific value shows that it depends on the (d -1)th harmonic number, H d-1 : λ min (A) = b -c = A ii -A ij = 1 d + d-1 k=2 d-1 d(d-k) -k-1 d(d-k) d-1 k=1 d-1 k(d-k) = 1 d d-1 k=1 1 k(d-k) = 1 2 d-1 k=1 1 k = 1 2H d-1 . The minimum eigenvalue is therefore strictly positive, and this implies that h xy (ϕ) is µ-strongly convex with µ given by µ = 2 • λ min (A) = H -1 d-1 . Note that the strong convexity constant µ does not depend on (x, y) and is determined solely by the number of features d. Theorem 1. For a model ϕ ViT (x, y; θ) whose predictions satisfy the constraint in eq. ( 4), the objective value L(θ) upper bounds the Shapley value estimation error as follows, E p(x,y) ϕ ViT (x, y; θ) -ϕ(v xy ) 2 ≤ 2H d-1 L(θ) -L * , where L * represents the loss achieved by the exact Shapley values. Proof. We first consider a single input-output pair (x, y) with prediction given by ϕ = ϕ ViT (x, y; θ). Rather than writing the expected loss h xy (ϕ), we now write the Lagrangian L xy (ϕ, γ) to account for the linear constraint in our objective, see eq. ( 4): L xy (ϕ, γ) = h xy (ϕ) + γ v xy (1) -v xy (0) -1 ⊤ ϕ . Regardless of the Lagrange multiplier value γ ∈ R, the Lagrangian L xy (ϕ, γ) is a µ-strongly convex quadratic with the same Hessian as h xy (ϕ): ∇ 2 ϕ L xy (ϕ, γ) = ∇ 2 ϕ h xy (ϕ). Strong convexity enables us to bound ϕ's distance to the global minimizer via the Lagrangian's value. First, we denote the Lagrangian's optimizer as (ϕ * , γ * ), where ϕ * is given by the exact Shapley values (Charnes et al., 1988) : ϕ * = ϕ(v xy ). Next, we use the first-order strong convexity condition to write the following inequality: L xy (ϕ, γ * ) ≥ L xy (ϕ * , γ * ) + (ϕ -ϕ * ) ⊤ ∇ ϕ L xy (ϕ * , γ * ) + µ 2 ||ϕ -ϕ * || 2 . According to the Lagrangian's KKT conditions (Boyd et al., 2004) , we have the property that ∇ ϕ L xy (ϕ * , γ * ) = 0. The inequality therefore simplifies to L xy (ϕ, γ * ) ≥ L xy (ϕ * , γ * ) + µ 2 ||ϕ -ϕ * || 2 2 , or equivalently, ||ϕ -ϕ * || 2 2 ≤ 2 µ L xy (ϕ, γ * ) -L xy (ϕ * , γ * ) . If we constrain ϕ to be a feasible solution (i.e., it satisfies our objective's linear constraint), the KKT primal feasibility condition implies that the inequality further simplifies to ||ϕ -ϕ * || 2 2 ≤ 2 µ h xy (ϕ) -h xy (ϕ * ) . ( ) Now, we can consider this bound in expectation over p(x, y). First, we denote our full training loss as L(θ), which is equal to L(θ) = E p(x,y) E pSh(s) v xy (s) -v xy (0) -s ⊤ ϕ ViT (x, y; θ) 2 = E p(x,y) h xy ϕ ViT (x, y; θ) . Next, we denote L * as the training loss achieved by the exact Shapley values, or L * = E p(x,y) h xy ϕ(v xy ) . Given a network ϕ ViT (x, y; θ) whose predictions are constrained to satisfy the linear constraint, taking the bound from eq. ( 17) in expectation yields the following bound on the distance between the predicted and exact Shapley values: E p(x,y) ϕ ViT (x, y; θ) -ϕ(v xy ) 2 2 ≤ 2 µ L(θ) -L * . Applying Jensen's inequality to the left side, we can rewrite the bound as follows: E p(x,y) ϕ ViT (x, y; θ) -ϕ(v xy ) 2 ≤ 2 µ L(θ) -L * . Substituting in the strong convexity parameter µ from Lemma 1, we arrive at the final bound: E p(x,y) ϕ ViT (x, y; θ) -ϕ(v xy ) 2 ≤ 2H d-1 L(θ) -L * . We also present a corollary to Theorem 1. This result formalizes the intuition that if we can iteratively optimize the explainer such that its loss approaches the optimum, our Shapley value estimation error will go to zero. Corollary 1. Given a sequence of models ϕ ViT (x, y; θ 1 ), ϕ ViT (x, y; θ 2 ), . . . whose predictions satisfy the constraint in eq. ( 4) and where L(θ n ) → L * , the Shapley value estimation error goes to zero: lim n→∞ E p(x,y) ϕ ViT (x, y; θ n ) -ϕ(v xy ) 2 = 0. Proof. Fix ϵ > 0. By assumption, there exists a value n ′ such that L(θ n ) -L * < µϵ 2 2 for n > n ′ . Following the result in Theorem 1, we have E p(x,y) ϕ ViT (x, y; θ n ) -ϕ(v xy ) 2 < ϵ for n > n ′ . Finally, we also consider the role of our loss function in quantifying the Shapley value estimation error, which we define for a given explainer model ϕ ViT (x, y; θ) as SVE = E p(x,y) ϕ ViT (x, y; θ) -ϕ(v xy ) 2 . One natural approach is to use an external dataset (e.g., the test data) consisting of samples (x i , y i ) for i = 1, . . . , n, calculate their exact Shapley values ϕ(v xiyi ), and generate a Monte Carlo estimate as follows: ŜVE n = 1 n n i=1 ϕ ViT (x i , y i ; θ) -ϕ(v xiyi ) 2 . While standard concentration inequalities allow us to bound SVE using ŜVE n , generating the ground truth values can be computationally costly, particularly for large n. Instead, another approach is to use our result from Theorem 1, which bypasses the need for ground truth Shapley values. For this, recall that L(θ) represents our weighted least squares loss function, where we assume that the explainer ϕ ViT (x, y; θ) satisfies the constraint in eq. ( 4) for all predictions. If we know L(θ) exactly, then Theorem 1 yields the following bound with probability 1: SVE ≤ 2H d-1 L(θ) -L * . If we do not know L(θ) exactly, we can instead form a Monte Carlo estimate L(θ) n using samples (x i , y i , s i ) for i = 1, . . . , n. Then, using concentration inequalities like Chebyshev or Hoeffding (the latter only applies if we assume bounded errors), we can get probabilistic bounds of the form P(|L(θ)-Ln | > ϵ) ≤ δ. With these, we can say with probability at least 1-δ that L(θ) ≤ L(θ) n +ϵ. Finally, combining this with the last steps of our Theorem 1 proof, we obtain the following bound with probability at least 1 -δ: SVE ≤ 2H d-1 L(θ) n -L * + ϵ . Naturally, δ is a function of ϵ and the number of samples n used to estimate L(θ) n , with the rate of convergence to probability 1 depending on the choice of concentration inequality (Chebyshev or Hoeffding). Although this procedure yields an inexpensive upper bound on the Shapley value estimation error, the bound's looseness, as well as the fact that we do not know L * a priori, make it unappealing as an evaluation metric. The more important takeaways are 1) that training with the loss L(θ) effectively minimizes an upper bound on the Shapley value estimation error, and 2) that comparing explainer models via their validation loss, which is effectively L(θ) n , is a principled approach to perform model selection and hyperparameter tuning.

E DATASETS

The the training and validation data were used to train the original classifiers, fine-tuned classifiers and explainer models, and the test data was used only when calculating performance metrics.

F BASELINE METHODS

This section provides implementation details for the baseline explanation methods. We used a variety of attention-, gradient-and removal-based methods as comparisons for ViT Shapley, and we modified several approaches to arrive at patch-level feature attribution scores. Attention last This approach calculates the attention directed from each image token into the class token in the final self-attention layer, summed across attention heads (Abnar and Zuidema, 2020; Chefer et al., 2021) . The results are provided at the patch-level, but they are not generated separately for each output class. Attention rollout This approach accounts for the flow of attention between tokens by summing across attention heads and multiplying the resulting attention matrices at each layer (Abnar and Zuidema, 2020). Like the previous method, results are not generated separately for each output class. We used an implementation provided by prior work (Chefer et al., 2021) . Common gradient-based methods Several methods that operate via input gradients are Vanilla gradients (Simonyan et al., 2013) , SmoothGrad (Smilkov et al., 2017) , VarGrad (Hooker et al., 2019) , and IntGrad (Sundararajan et al., 2017) . These methods were run using the Captum package (Kokhlikyan et al., 2020) , and we used 10 samples per image for SmoothGrad, VarGrad and IntGrad. We tried applying these at the level of pixels and patch embeddings, and in both cases we created class-specific, patch-level attributions by summing across the unnecessary dimensions. We calculated the absolute value before summing for Vanilla and SmoothGrad, VarGrad automatically produces non-negative values, and we preserved the sign for IntGrad because it should be meaningful. GradCAM Originally designed for intermediate convolutional layers (Selvaraju et al., 2017) , GradCAM has since been generalized to the ViT context. The main operations remain the same, only the representation being analyzed is the layer-normed input to the final self-attention layer, and the aggregation is across the embedding dimension rather than convolutional channels (GradCAM LN) (Gildenblat and contributors, 2021) . We also experimented with using a different internal layer for generating explanations (the attention weights computed in the final self-attention layer, denoted as GradCAM Attn. (Chefer et al., 2021) ). Layer-wise relevance propagation (LRP) Originally described as a set of constraints for a modified backpropagation routine (Bach et al., 2015) , LRP has since been implemented for a variety of network layers and architectures, and it was recently adapted to ViTs (Chefer et al., 2021) . We used an implementation provided by prior work (Chefer et al., 2021) . Leave-one-out The importance scores in this approach are the difference in prediction probability for the full-image and the iamge with a single patch removed. We removed patches by setting pixels to zero, similar to the original version for CNNs (Zeiler and Fergus, 2014) . RISE This approach involves sampling many occlusion masks and reporting the mean prediction when each patch is included. The original version for CNNs (Petsiuk et al., 2018) used a complex approach to generate masks, but we simply sampled subsets of patches. As in the original work, we sample from all subsets with equal probability, and we use 2,000 mask samples per sample to be explained. We occlude patches by setting pixel values to zero, similar to the original work. Random Finally, we included a random baseline as a comparison for the insertion, deletion and ROAR metrics. These metrics only require a ranking of important patches, so we generated ten random orderings and averaged the results across these orderings. Table 4 shows the same metrics as Table 1 with additional results for alternative implementations of several baselines. For the methods based on input gradients, we experimented with generating explanations at both the pixel level and embedding level; the preferred approach depends on the method and metric, but both versions tend to underperform ViT Shapley, with the exception of faithfulness on ImageNette where the 95% confidence intervals overlap for many methods. We also experimented with two versions of GradCAM (described above) and find that the GradCAM LN implementation generally performs slightly better. In the main text, we present results only for GradCAM LN and the remaining gradient-based methods generated at the embedding level.

G METRICS DETAILS

This section provides additional details about the performance metrics used in the main text experiments (Section 6). Insertion/deletion These metrics involve repeatedly making predictions while either inserting or deleting features in order of most to least important (Petsiuk et al., 2018) . While the original work removed features by setting them to zero, we use the fine-tuned classifier that was trained to handle partial information. We calculated the area under the curve for individual predictions and then averaged the results across 1,000 test set examples; we used random examples for ImageNette, and only examples that were predicted to be abnormal for MURA. Table 1 presents results for the true class only, and Table 2 presents results averaged across all the remaining classes. Sensitivity-n This metric samples feature subsets at random and calculates the correlation between the prediction with each subset and the sum of the corresponding features' attribution scores (Ancona et al., 2018) . It typically considers subsets of a fixed size n, which means sampling from the following subset distribution p n (s): p n (s) = 1(1 ⊤ s = n) d n -1 . Mathematically, the metric is defined for a model f (x), an individual sample x and label y, feature attributions ϕ ∈ R d and subset size n as follows: Sens(f, x, y, ϕ, n) = Corr pn(s) s ⊤ ϕ, f y (x) -f y (x 1-s ) . Similar to insertion/deletion, we use the fine-tuned classifier to handle held-out patches and calculate the metric across 1,000 test set images. We use subset sizes ranging from 14 to 182 patches with step size 14, and we estimate the correlation for each example and subset size using 1,000 subset samples. Faithfulness This metric is nearly identical to sensitivity-n, only it calculates the correlation across subsets of all sizes (Bhatt et al., 2021) . Mathematically, it is defined as Faith(f, x, y, ϕ) = Corr p(s) s ⊤ ϕ, f y (x) -f y (x 1-s ) , and we sample from a distribution with equal probability mass on all cardinalities, or p(s) = 1 d 1 ⊤ s • (d + 1) . We use the fine-tuned classifier to handle held-out patches, and we compute faithfulness across 1,000 test set images and with 1,000 subset samples per image. ROAR Finally, ROAR evaluates the model's accuracy after removing features in order from most to least important (Hooker et al., 2019) . We also experimented with inserting features in order of most to least important. Crucially, the ROAR authors propose handling held-out features by retraining the model with masked inputs. We performed masked retraining by performing test-time augmentations for all training, validation and test set images, generating explanations to identify the most important patches for the true class, and setting the corresponding pixels to zero. Because masked retraining leaks information through the masking, we also replicated this metric using the fine-tuned classifier model, and with a separate evaluator model trained directly with random masking; the evaluator model trained with random masking has been used in prior work (Jethani et al., 2021a; b) . We generated results for each number of inserted/deleted patches (1, 3, 7, 14, 28, 56, 84, 112, 140, 168, and 182) with the final accuracy computed across the entire test set. Ground truth metrics Previous work has considered evaluations involving comparison with ground truth importance, where the ground truth is either identified by humans (Chefer et al., 2021) or introduced via synthetic dataset modifications (Zhou et al., 2022) . An important issue with such methods is that they test explanations against what a model should depend on rather than what it does depend on, so the results do not reflect the explanation's accuracy for the specific model (Petsiuk et al., 2018) . We thus decided against including such metrics.

H ADDITIONAL RESULTS

This section provides additional experimental results. We first show results involving similar baselines and metrics as in the main text, and we then show results comparing ViT Shapley to Ker-nelSHAP.

H.1 MAIN BASELINES AND METRICS

Figure 5 shows our evaluation of attention masking for handling held-out image patches using two separate metrics: 1) KL divergence relative to the full-image predictions (also shown in the main text), and 2) top-1 accuracy relative to the true labels. The former can be understood as a divergence measure between the predictions with masked inputs and the predictions with patches marginalized out using their conditional distribution (see Appendix B). The latter is a more intuitive measure of how much the performance degrades given partial inputs. The results are similar between the two metrics, showing that the predictions diverge more quickly if the model is not fine-tuned with random masking. Table 5 shows insertion, deletion and faithfulness results for the MURA dataset with examples that were predicted to be normal, but while evaluating explanations for the abnormal class. ViT Shapley outperforms the baseline methods, reflecting that our explanations correctly identify patches that influence the prediction towards and against the abnormal class even for normal examples. Table 6 shows insertion, deletion and faithfulness results for the Pets dataset. We observe that ViT Shapley outperforms other methods for all metrics with the exception of faithfulness for target classes, where 95% confidence intervals overlap for many methods (similar to the other datasets). Table 7 , Table 8 , and Table 9 show insertion, deletion, and faithfulness metrics for ImageNette when using other ViT architectures (i.e., ViT-Tiny, -Small, and -Large, respectively) (Wightman, 2019; Dosovitskiy et al., 2020) for the classifier and explainer. They show results for target-class explanations and non-target-class explanations, respectively. The results are consistent with those obtained for ViT-Base, and ViT Shapley outperforms the baseline methods across all three metrics. This shows that our explainer model can be trained successfully with architectures of different sizes, including when using a relatively small number of parameters. Table 10 shows insertion, deletion and faithfulness results for a ViT classifier trained directly with random masking. Whereas our Section 6 experiments utilize a fine-tuned classifier to handle missing patches, a classifier trained with random masking allows us to bypass the fine-tuning stage and train the explainer directly. The results are similar to Table 1 , and we find that ViT Shapley consistently achieves the best performance. Figure 6 , Figure 7 , and Figure 8 show the average curves used to generate the insertion/deletion AUC results. All sets of plots reflect that explanations from ViT Shapley identify relevant patches that quickly move the prediction towards or away from a given class. In the case of ImageNette and Pets, we observe that this holds for both target and non-target classes. Figure 9 shows the sensitivity-n metric evaluated for non-target classes on the ImageNette dataset. Similarly, these results show that ViT Shapley generates attribution scores that represent the impact of withholding features from a model, even for non-target classes. In this case, RISE and leave-one-out are more competitive with ViT Shapley, but their performance is less competitive when the correlation is calculated for subsets of all sizes (see faithfulness in Table 2 ). Next, Figure 10 shows ROAR results generated in both the insertion and deletion directions, using the four patch removal approaches: 1) the fine-tuned classifier, 2) the separate evaluator model trained directly with random masking, 3) masked retraining, and 4) masked retraining without positional embeddings. The results show that in addition to strong performance in the deletion direction, ViT Shapley consistently achieves the best results in the insertion direction, even in the case of masked retraining with positional embeddings. Figure 11 shows ROAR results for the same settings, but when using the ViT-Small architecture. We observe the same results obtained with ViT-Base. Except for the deletion direction with masked retraining and positional embeddings enabled, ViT Shapley achieves the best performance among all methods. Finally, Table 11 shows the time required to generate explanations using each approach. Because ViT Shapley requires a single forward pass through the explainer model, it is among the fastest approaches and is paralleled only by the attention-based methods. The gradient-based methods require forward and backward passes for all classes, and sometimes for many altered inputs (e.g., with noise injected for SmoothGrad). RISE is the slowest of all the approaches tested because it requires making several thousand predictions to explain each sample. Our evaluation was conducted on a GeForce RTX 2080 Ti GPU, with minibatches of 16 samples for attention last, attention rollout and ViT Shapley; batch size of 1 for Vanilla Gradients, GradCAM, LRP, leave-one-out and RISE; and internal minibatching for SmoothGrad, IntGrad and VarGrad (implemented via Captum (Kokhlikyan et al., 2020) ). ViT Shapley is the only method considered here to require training time, and as described in Appendix C, training the explainer models required roughly 0.8 days for ImageNette and 2.5 days for MURA. The training time is not insignificant, but investing time in training the explainer is worthwhile if 1) high-quality explanations are required, 2) there are many examples to be explained (e.g., an entire dataset), or 3) fast explanations are required during a model's deployment. First, Figure 12 compares the approximation quality of Shapley value estimates produced by ViT Shapley and KernelSHAP. The estimates are evaluated in terms of L2 distance, Pearson correlation and Spearman (rank) correlation, and our ground truth is generated by running KernelSHAP for a large number of iterations. Specifically, we use the convergence detection approach described by Covert and Lee (2021) with a threshold of t = 0.1. The results are computed using just 100 randomly selected ImageNette images due to the significant computational cost. Based on Figure 12 , we observe that the original version of KernelSHAP takes roughly 120,000 model evaluations to reach the accuracy that ViT Shapley reaches with a single model evaluation. KernelSHAP with paired sampling (Covert and Lee, 2021) converges faster, and it requires roughly 40,000 model evaluations on average. ViT Shapley's estimates are not perfect, but they reach nearly 0.8 correlation with the ground truth for the target class, and nearly 0.7 correlation on average across non-target classes. Next, Table 12 compares the ViT Shapley estimates and the fully converged estimates from Ker-nelSHAP via the insertion and deletion metrics. The fully converged KernelSHAP estimates performed better than ViT Shapley on both metrics, and the gap is largest for deletion with the target class. The results were also computed using only 100 ImageNette examples due to the computational cost. These results reflect that there is room for further improvement if ViT Shapley's estimates can be made more accurate. KernelSHAP itself is not a viable option in practice, as we found that its estimates took between 30 minutes and 2 hours to converge when using paired sampling (Covert and Lee, 2021) (equivalent to roughly 300k and 1,200k model evaluations), but it represents an upper bound on how well ViT Shapley could perform with near-perfect estimation quality. 

I QUALITATIVE EXAMPLES

This section provides qualitative examples for ViT Shapley and the baseline methods. When visualizing explanations from each method, we used the icefire color palette, a diverging colormap implemented in the Python Seaborn package (Waskom, 2021) . Negative influence is an important feature for ViT Shapley, and a diverging colormap allows us to highlight both positive and negative contributions. To generate the plots shown in this paper, we first calculated the maximum absolute value of an explanation and then rescaled the values to each end of the color map; next, we plotted the original image with an alpha of 0.85, and finally we performed bilinear upsampling on the explanation and overlaid the color-mapped result with an alpha of 0.9. The alpha values can be tuned to control the visibility of the original image. Figure 13 shows a comparison between ViT Shapley and KernelSHAP explanations for several examples from the ImageNette dataset. The results are nearly identical, but ViT Shapley produces explanations in a single forward pass while KernelSHAP is considerably slower. Determining the number of samples required for KernelSHAP to converge is challenging, and we used the approach proposed by prior work with a convergence threshold of t = 0.2 (Covert and Lee, 2021) . With this setting and with acceleration using paired sampling, the KernelSHAP explanations required between 30 minutes to 1 hour to generate per image, versus a single forward pass for ViT Shapley. Figure 14 , Figure 15 and Figure 16 show comparisons between ViT Shapley and the baselines on ImageNette samples. We only show results for attention last, attention rollout, Vanilla Gradients, Integrated Gradients, SmoothGrad, LRP, leave-one-out, and ViT Shapley; we excluded VarGrad, GradCAM and RISE because their results were less visually appealing. The explanations are shown only for the target class, and we observe that ViT Shapley often highlights the main object of interest. We also find that the model is prone to confounders, as ViT Shapley often highlights parts of the background that are correlated with the true class (e.g., the face of a man holding a tench, the clothes of a man holding a chainsaw, the sky in a parachute image). Similarly, Figure 17 , Figure 18 and Figure 19 compare ViT Shapley to the baselines on example images from the MURA dataset. The ViT Shapley explanations almost always highlight clear signs of abnormality, which are only sometimes highlighted by the baselines. Among those shown here, LRP is most similar to ViT Shapley, but they disagree in several cases. Next, Figure 20 , Figure 21 , Figure 22 , and Figure 23 compare ViT Shapley to the baselines on example images from the Pets dataset. We showed one randomly sampled image per each class (i.e., breed). We observe that ViT Shapley highlights distinctive features of breeds (e.g., the mouth of Boxer or the fur pattern of Egyptian Mau), and rarely puts significant importance on background patches. Finally, Figure 24 shows several examples of non-target class explanations. These results show that ViT Shapley highlights patches that can push the model's prediction towards certain non-target classes, which is not the case for other methods. These results corroborate those in Table 2 and Table 5 , which show that ViT Shapley offers the most accurate class-specific explanations.



The number of model evaluations depends on how fast the estimators converge, and we find that KernelSHAP requires >100,000 samples to converge for ViTs (Appendix H). https://github.com/suinleelab/vit-shapley https://github.com/PyTorchLightning/pytorch-lightning



Figure 1: Explanations where our approach identifies relevant information for target and non-target classes. Left: original images from the ImageNette and MURA datasets. Middle left: explanations generated by ViT Shapley for specific classes. Right: probability of the class being explained after the insertion or deletion of important patches (higher is better for insertion, lower for deletion).

Figure3: Sensitivity-n evaluation for different subset sizes. The metric is generated separately for a range of subset sizes, whereas faithfulness is calculated jointly over subsets of all sizes.

Figure5: ViT predictions given partial information. We delete patches at random using several removal mechanisms, and we then measure the quality of the resulting predictions via two metrics: the KL divergence relative to the original, full-image predictions (top), and the top-1 accuracy relative to the true labels (bottom).

Figure 12: Comparing the quality of Shapley value estimates obtained by ViT Shapley and Ker-nelSHAP. The shaded areas represent 95% confidence intervals.

Evaluating ViT Shapley using standard explanation metrics, with explanations calculated for the target class only. Methods that fail to outperform the random baseline are shown in gray, and the best results are shown in bold (accounting for 95% confidence intervals).



Evaluating ViT Shapley for explaining non-target classes. Methods that fail to outperform the random baseline are shown in gray, and the best results are shown in bold (accounting for 95% confidence intervals).

. We then generated metrics comparing ViT Shapley's approximation quality with KernelSHAP(Lundberg and Lee, 2017), and we found that ViT Shapley's accuracy is equivalent to running KernelSHAP for roughly 120,000 model evaluations (Appendix H). Lastly, we provide qualitative examples in Appendix I, including comparisons with the baselines. Overall, these results show that ViT Shapley is a practical and effective approach for explaining ViT predictions.

Training pseudocode Algorithm 1 shows a simplified version of our training algorithm, without minibatching, sampling multiple subsets s, or parallelizing across the classes y.

Ablation experiments for ViT Shapley explainer architecture on the ImageNette dataset, with and without fine-tuning.

ImageNette dataset contains 9,469 training examples and 3,925 validation examples, and we split Performance metrics for target-class explanations with additional baselines. Methods that fail to outperform the random baseline are shown in gray, and the best results are shown in bold (accounting for 95% confidence intervals).

MURA non-target metrics for images that were predicted to be normal. Methods that fail to outperform the random baseline are shown in gray, and the best results are shown in bold (accounting for 95% confidence intervals).

Performance metrics for ViT-Base on Pets. Methods that fail to outperform the random baseline are shown in gray, and the best results are shown in bold (accounting for 95% confidence intervals).

Performance metrics for ViT-Tiny on ImageNette. Methods that fail to outperform the random baseline are shown in gray, and the best results are shown in bold (accounting for 95% confidence intervals).

Comparing the quality of Shapley value estimates obtained using ViT Shapley and KernelSHAP via insertion/deletion scores. KERNELSHAP COMPARISONS Here, we provide two results comparing ViT Shapley with KernelSHAP.

ACKNOWLEDGEMENTS

We thank Mukund Sudarshan, Neil Jethani, Chester Holtz and the Lee Lab for helpful discussions. This work was funded by NSF DBI-1552309 and DBI-1759487, NIH R35-GM-128638 and R01-NIA-AG-061132. BIBLIOGRAPHY Abnar, S. and Zuidema, W. (2020). Quantifying attention flow in transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4190-4197.Adebayo, J., Gilmer, J., Muelly, M., Goodfellow, I., Hardt, M., and Kim, B. (2018) . Sanity checks for saliency maps. Advances in Neural Information Processing Systems, 31. Agarwal, C. and Nguyen, A. (2020). Explaining image classifiers by removing input features using generative models. In Proceedings of the Asian Conference on Computer Vision. Ancona, M., Ceolini, E., Öztireli, C., and Gross, M. (2018). Towards better understanding of gradientbased attribution methods for deep neural networks. In International Conference on Learning Representations.

annex

Shapley ViT Shapley ViT Shapley ViT Shapley ViT Shapley ViT Shapley ViT Shapley ViT Shapley ViT Shapley ViT ShapleyShapley ViT Shapley ViT Shapley ViT Shapley ViT Shapley ViT Shapley ViT Shapley ViT Shapley ViT Shapley ViT ShapleyShapley ViT Shapley ViT Shapley ViT Shapley ViT Shapley ViT Shapley ViT Shapley ViT Shapley ViT Shapley ViT ShapleyShapley ViT Shapley ViT Shapley ViT Shapley ViT Shapley ViT Shapley ViT Shapley ViT Shapley ViT Shapley ViT ShapleyShapley ViT Shapley ViT Shapley ViT Shapley ViT Shapley ViT Shapley ViT Shapley ViT Shapley ViT Shapley ViT ShapleyShapley ViT Shapley ViT Shapley ViT Shapley ViT Shapley ViT Shapley ViT Shapley ViT Shapley ViT Shapley ViT ShapleyShapley ViT Shapley ViT Shapley ViT Shapley ViT Shapley ViT Shapley ViT Shapley ViT Shapley ViT Shapley ViT ShapleyShapley ViT Shapley ViT Shapley ViT Shapley ViT Shapley ViT Shapley ViT Shapley ViT Shapley ViT Shapley ViT ShapleyShapley ViT Shapley ViT Shapley ViT Shapley ViT Shapley ViT Shapley ViT Shapley ViT Shapley ViT Shapley ViT Shapley

