CONTRASTIVE CORPUS ATTRIBUTION FOR EXPLAIN-ING REPRESENTATIONS

Abstract

Despite the widespread use of unsupervised models, very few methods are designed to explain them. Most explanation methods explain a scalar model output. However, unsupervised models output representation vectors, the elements of which are not good candidates to explain because they lack semantic meaning. To bridge this gap, recent works defined a scalar explanation output: a dot product-based similarity in the representation space to the sample being explained (i.e., an explicand). Although this enabled explanations of unsupervised models, the interpretation of this approach can still be opaque because similarity to the explicand's representation may not be meaningful to humans. To address this, we propose contrastive corpus similarity, a novel and semantically meaningful scalar explanation output based on a reference corpus and a contrasting foil set of samples. We demonstrate that contrastive corpus similarity is compatible with many post-hoc feature attribution methods to generate COntrastive COrpus Attributions (COCOA) and quantitatively verify that features important to the corpus are identified. We showcase the utility of COCOA in two ways: (i) we draw insights by explaining augmentations of the same image in a contrastive learning setting (SimCLR); and (ii) we perform zero-shot object localization by explaining the similarity of image representations to jointly learned text representations (CLIP).

1. INTRODUCTION

Machine learning models based on deep neural networks are increasingly used in a diverse set of tasks including chess (Silver et al., 2018) , protein folding (Jumper et al., 2021) , and language translation (Jean et al., 2014) . The majority of neural networks have many parameters, which impede humans from understanding them (Lipton, 2018) . To address this, many tools have been developed to understand supervised models in terms of their prediction (Lundberg & Lee, 2017; Wachter et al., 2017) . In this supervised setting, the model maps features to labels (f : X → Y), and explanations aim to understand the model's prediction of a label of interest. These explanations are interpretable, because the label of interest (e.g., mortality, an image class) is meaningful to humans (Figure 1a ). In contrast, models trained in unsupervised settings map features to representations (f : X → H). Existing supervised explanation methods can be applied to understand an individual element (h i ) in the representation space, but such explanations are not useful to humans unless h i has a natural semantic meaning. Unfortunately, the meaning of individual elements in the representation space is unknown in general. One possible solution is to enforce representations to have semantic meaning as in Koh et al. (2020) , but this approach requires concept labels for every single training sample, which is typically impractical. Another solution is to enforce learned representations to be disentangled as in Tran et al. (2017) and then manually identify semantically meaningful elements to explain, but this approach is not post-hoc and requires potentially undesirable modifications to the training process.

Related work.

Rather than explain a single element in the representation, approaches based on explaining the representation as a whole have recently been proposed, including RELAX (Wickstrøm et al., 2021) and label-free feature importance (Crabbé & van der Schaar, 2022) (Figure 1b ) (additional related work in Appendix A). These approaches both aim to identify features in the explicand (the sample to explain) that, when removed, point the altered representation away from the explicand's original representation. Although RELAX and label-free feature importance successfully extend existing explanation techniques to an unsupervised setting, they have two major limitations. First, they only consider similarity to the explicand's representation; however, there are a variety of other meaningful questions that can be asked by examining similarity to other samples' representations. Examples include asking, "Why is my explicand similar to dog images?" or "How is my rotation augmented image similar to my original image?". Second, RELAX and label-free importance find features which increase similarity to the explicand in representation space from any direction, but in practice some of these directions may not be meaningful. Instead, just as human perception often explains by comparing against a contrastive counterpart (i.e., foil) (Kahneman & Miller, 1986; Lipton, 1990; Miller, 2019) , we may wish to find features that move toward the explicand relative to an explicit "foil". As an example, RELAX and label-free importance may identify features which increase similarity to a dog explicand image relative to other dog images or even cat images; however, they may also identify features which increase similarity relative to noise in the representation space corresponding to unmeaningful out-of-distribution samples. In contrast, we can use foil samples to ask specific questions such as, "What features increase similarity to my explicand relative to cat images?".

Contribution.

(1) To address the limitations of prior works on explaining unsupervised models, we introduce COntrastive COrpus Attribution (COCOA), which allows users to choose corpus and foil samples in order to ask, "What features make my explicand's representation similar to my corpus, but dissimilar to my foil?" (Figure 1c ). ( 2 (Grill et al., 2020) . Despite their widespread use and applicability, unsupervised models are largely opaque. Explaining them can help researchers understand and therefore better develop and compare representation learning methods (Wickstrøm et al., 2021; Crabbé & van der Schaar, 2022) . In deployment, explanations can help users better monitor and debug these models (Bhatt et al., 2020) . Moreover, COCOA is beneficial even in a supervised setting. Existing feature attribution methods only explain the classes the model has been trained to predict, so they can only explain classes which are fully labeled in the training set. Instead, COCOA only requires a few class labels after the training process is complete, so it can be used more flexibly. For instance, if we train a supervised model on CIFAR-10 (Krizhevsky et al., 2009) , existing methods can only explain the ten classes the model was trained on. Instead, we can collect new samples from an unseen class, and apply COCOA to a representation layer of the trained model to understand this new class.

2. NOTATION

We consider an arbitrary input space X and output space Z. Given a model f : X → Z and an explicand x e ∈ X to be explained, local feature attribution methods assign a score to each input feature based on the feature's importance to a scalar explanation target. A model's intermediate or final output is usually not a scalar. Hence, feature attribution methods require an explanation target function γ f, * : X → R that transforms a model's behavior into an explanation target. The subscript f indicates that a model is considered a fixed parameter of an explanation target function, and * denotes an arbitrary number of additional parameters. To make this concrete, consider the following example with a classifier f class : X → [0, 1] C , where C is the number of classes. Example 2.1. Given an explicand x e ∈ X , let the explanation target be the predicted probability of the explicand's predicted class. Then the explanation target function is γ f class ,x e (x) = f class arg max j=1,...,C f class j (x e ) (x), for all x ∈ X , where f class i (•) denotes the ith element of f class (•). Here, the explanation target function has the additional subscript x e to indicate that the explicand is a fixed parameter.



) We apply COCOA to representations learned by a self-supervised contrastive learning model and observe class-preserving features in image augmentations. (3) We perform object localization by explaining a mixed modality model with COCOA. Motivation. Unsupervised models are prevalent and can learn effective representations for downstream classification tasks. Notable examples include contrastive learning (Chen et al., 2020) and self-supervised learning

