DIAGNOSING AND RECTIFYING VISION MODELS USING LANGUAGE

Abstract

Recent multi-modal contrastive learning models have demonstrated the ability to learn an embedding space suitable for building strong vision classifiers, by leveraging the rich information in large-scale image-caption datasets. Our work highlights a distinct advantage of this multi-modal embedding space: the ability to diagnose vision classifiers through natural language. The traditional process of diagnosing model behaviors in deployment settings involves labor-intensive data acquisition and annotation. Our proposed method can discover high-error data slices, identify influential attributes and further rectify undesirable model behaviors, without requiring any visual data. Through a combination of theoretical explanation and empirical verification, we present conditions under which classifiers trained on embeddings from one modality can be equivalently applied to embeddings from another modality. On a range of image datasets with known error slices, we demonstrate that our method can effectively identify the error slices and influential attributes, and can further use language to rectify failure modes of the classifier.

1. INTRODUCTION

Recent models trained using multi-modal contrastive learning have leveraged large-scale datasets of aligned image-caption pairs to obtain shared embedding spaces that capture rich visual and textual features. The learned image and text encoders resulting from multi-modal contrastive learning have been demonstrated to be effective feature extractors that can be used to train strong single-modality classifiers (Radford et al., 2021; Jia et al., 2021; Yuan et al., 2021) . In this work, we show how visual classification models obtained through multi-modal contrastive learning, as described above, offer a significant additional advantage: the ability to use language to probe and diagnose the behavior of the vision models. Model diagnosis aims to gain a systematic and comprehensive understanding of when and why models fail. This is a critical quality assurance process to prevent unexpected and catastrophic failures of models in high-stake settings. A growing body of work has proposed methods for addressing this need. For example, error slice discovery methods aim to find subsets of inputs with similar characteristics where the model performs significantly worse (d'Eon et al., 2022; Eyuboglu et al., 2022) . Interpretability methods aim to understand the black-box process of model prediction and thus the reasons why models fail for certain inputs (Ribeiro et al., 2016; Lundberg & Lee, 2017; Koh et al., 2020) . In addition, model diagnosis is relevant to model auditing, an important topic that also deals with identifying model failures and sensitive attributes (Raji et al., 2020) , and has a broad societal impact in terms of AI accountability and integration (Buolamwini & Gebru, 2018; Mitchell et al., 2019; Gebru et al., 2021) . While these prior efforts have made progress in vision model diagnosis, they all suffer from a critical Achilles' heel -susceptibility to lack of visual data. Curated training and test sets from the same data distribution are typically used to develop vision models. Even if models achieve perfect performance on these datasets, their performance can degrade drastically when deployed in-the-wild, due to distribution shifts (Koh et al., 2021; Wiles et al., 2022) . Yet most existing model diagnosis methods require visual examples of failure modes (e.g., present in the test set) to discover them. As Gadwall crow in bamboo. Figure 1 : Overview of our approach, DrML, that diagnoses and rectifies vision models using language. Our approach leverages the shared image and text representation space learned by multimodal contrastive learning. We find that classifiers trained on embeddings from one modality can be equivalently applied to embeddings from another modality, despite the fact that embeddings from these two modalities are distinctly separated. This cross-modal transferability phenomenon enables us to diagnose a vision model by training it on the image embedding space and probing it with text embeddings. The use of language allows us to generate a large set of diverse and novel inputs to discover error slices, identify influential attributes, and rectify model misbehaviors. a result, using these methods is reliant on efforts to collect large-enough datasets to cover all data distributions and potential failure modes of interest, which is often impractical or infeasible. The goal of our work is to circumvent this need to collect test data representing all data distributions of interest, and instead use natural language input to diagnose vision classifiers. It is often easier to generate a set of diverse natural language inputs by combining known attributes and prompt generators than to collect a set of image inputs representing the same desired concepts. We observe that vision classifiers trained on image embeddings from a shared image-text embedding space suggest the possibility of leveraging text embeddings as a proxy for image embeddings. Multi-modal contrastive losses are frequently used to learn such shared embedding spaces. However, while these losses encourage image and text embeddings to be closer for aligned pairs than for mismatched pairs, there is no guarantee that in practice, using text embeddings as input into a vision classifier trained on the image embeddings will result in the same predictions. In this work, we first verify that text inputs can indeed work as good proxies to image inputs trained on a shared image-text embedding space obtained through contrastive learning. We refer to this as cross-modal transferability. Based on the phenomenon of cross-modal transferability, we then present DrML for Diagnosing and Rectifying Vision Models using Language. We show that DrML can use language to diagnose vision models in two different ways: discovering error slices including concepts for which we have no visual data, and identifying attributes that have the greatest impact on model predictions. Finally, we present a method that uses language to rectify undesirable behaviors without requiring the collection of more visual data. Figure 1 illustrates our framework for diagnosing and rectifying vision models using language. On three image datasets representing the three most common types of model failure modes, we demonstrate that DrML can effectively identify error slices and influential attributes, and can further rectify these model failure modes using language. In summary, our contributions are: 1. We present a theoretical explanation of when cross-modal transferability happens (Section 2.1), and empirically verify that the assumptions required by the analysis is true in practice across a range of multi-modal contrastive models and datasets (Section 3.2). 2. We propose DrML, a framework for diagnosing vision models using natural language, including error slice discovery and influential attribute identification. We empirically validate DrML by simulating common types of failure modes using the Waterbirds (Sagawa et al., 2020) , Fair-Face (Karkkainen & Joo, 2021) , and dSpitesV (Matthey et al., 2017) datasets, and show the effectiveness of our method in identifying known error slices and influential attributes. We first define basic notations used in this paper. Given a pre-trained multi-modal contrastive model, along with an image X ∈ X or text Y ∈ Y as input, we can obtain their l 2 -normalized embeddings x or y from the image encoder f x ∶ X ↦ R d or the text encoder f y ∶ Y ↦ R d , respectively , where d is the dimension of the shared multi-modal embedding space. We can build classifiers h ∶ R d ↦ C such as a linear layer or multi-layer perception on the shared embedding space to predict the label c ∈ C given an image embedding or text embedding. We focus on the case of vision classifiers trained using image embeddings.

2.1. TEXT EMBEDDINGS AS PROXIES FOR IMAGE EMBEDDINGS

The core of our work hinges on the ability to use text as a proxy for image inputs, thereby enabling us to use language to diagnose vision models. Here we describe our approach to analyze if this is feasible in practice -are text inputs good proxies for images in contrastive representation space? Cross-modal Transferability. To answer the question, we first define cross-modal transferability. Let P D be the joint data distribution over image-text pairs. For X, Y ∼ P D , we denote x = f x (X) and y = f y (Y ) the corresponding image and text embeddings respectively. We say that a vision classifier h achieves cross-modal transferability when it outputs similar predictions on x and y. In other words, the difference across the prediction pair is small: E x,y [D(h(x), h(y))] ≈ 0, where D(⋅, ⋅) measures the difference between predictions, e.g. the 0-1 loss D(u, v) = 1 u≠v . Modality Gap. While intuition suggests that embeddings of a matched image-caption pair should be close, recent work shows instead that the embeddings are approximately clustered per modality Liang et al. (2022) . They refer to the distance between these clusters as the modality gap. We define the individual-level modality gap g as the difference between image and text embeddings for a single pair, and the class-level gap g c as the average difference between image and text embeddings for a given class c ∈ C. Formally, the modality gap definitions are written as: g = xy and g c = x c -y c , where x c = E X∼P D (X|c) [f x (X)], y c = E Y ∼P D (Y |c) [f y (Y )]. Modality Gap Geometry. We take a closer look at the modality gap geometry across a range of multi-modal contrastive models and datasets, presented in detail in Section 3.2, and empirically find that the following hold true: 1. The modality gap between corresponding image and text embeddings can be approximated by a constant vector, particularly at the class level. We verify this by computing distributions over ∥g∥ (magnitude) and cos(g, E g [g]) (direction). 2. The modality gap is orthogonal to the span of image embeddings and text embeddings, and image embeddings and text embeddings have zero mean in the subspace orthogonal to the modality gap. We verify this by computing distributions over cos(x - E x [x], E g [g]) (or- thogonality) and E x [x -x T g ′ g ′ ] i (center), where g ′ = E g [g]/∥E g [g]∥ and i ∈ [d]. The subscript i denotes indexing the i-th dimension of the vector. Cross-modal Transferability under Modality Gap. The above findings with respect to the geometry of the modality gap indicate that the classifier input between training and cross-modal evaluation only differs in a constant g, i.e., h(x) ≈ h(y + g). Intuitively, since the modality gap g is an orthogonal constant to the span of embeddings, the weight matrix of the learned classifier should also be orthogonal to g. Hence the prediction of the classifier is not affected by g. This intuition explains why we observe strong cross-modal transferability under modality gap in practice, across different multi-modal contrastive models trained on different datasets. These results are presented in Section 3.2. In the following Proposition 2.1, we further theoretically prove that a linear classifier trained with a regularized quadratic loss is guaranteed to be orthogonal to the modality gap and hence achieves cross-modal transferability. The formal statement and proof are in Appendix A.2. Proposition 2.1 (Informal version of Proposition A.1). Suppose there exists a gap vector g ∈ R d such that every pair of image embedding x and caption embedding y satisfies g = xy, the gap g is orthogonal to the span of image embeddings, and the image embeddings have zero mean in the subspace orthogonal to g. Then, any linear function minimizing a regularized quadratic loss on image embeddings achieves the same loss on text embeddings, enabling cross-modal transferability. Cross-modal Transferability by Closing the Modality Gap. The observation that the modality gap approximates a constant provides us another perspective to achieve cross-modal transferability -by closing the modality gap so that there is no inconsistency when feeding embeddings from another modality. We propose a simple technique to close the modality gap so that the gap becomes zero. During training, instead of feeding x to the model h, we feed it with x -E x [x]. During cross-modal evaluation, we feed y -E y [y] instead of y. With this strategy, we close the gap and observe additional improvements in cross-modal transferability compared to training with the gap.

2.2. DIAGNOSING VISION MODELS USING LANGUAGE

Having established that text embeddings can be good proxies for image embeddings (Section 2.1 and 3.2), we now describe DrML, which uses natural language inputs for diagnosing vision classifers. Discovering Error Slices through Language. Deep learning models often make systematic errors on subgroups of inputs with similar attributes, referred to as error slices and formally defined as: S = {S ⊆ X |e(S) ≫ e(X )}, where X is a test set of images and e(⋅) is the model's error rate on the set of input images. However, collecting a large enough test set that covers different image distributions is a fundamental challenge. The collected test set often only covers a small percentage of model failure modes (i.e., error slices) in the wild. In contrast, language inputs are easy to generate. Our proposed method, DrML, is capable of discovering error slices through language inputs. DrML works as follows: 1. We define an attribute set A related to the task. In this way, we can combine a wide range of attributes with different prompts to collect a diverse and novel set of text inputs. The generated text set Y is typically much more diverse than the available image test set X , allowing the discovery of more comprehensive and unseen error slices. Importantly, DrML has two distinctive benefits over the typical approach of using an image test set. First, DrML only requires minimal effort to define a meaningful set of attributes to generate the input set, circumventing the human cost of data collection. Second, the combination of defined attributes naturally defines human-interpretable data slices, whereas image-based slice discovery methods do not directly provide a text summary of the error slice.

Identifying Influential Attributes through Language.

Interpreting what attributes influence model predictions is crucial for understanding why models fail. Since language is directly interpretable by humans, we perform counterfactual analysis using language to understand which attributes or concepts most impact model predictions. With A defined as the attribute set, we aim to identify a subset of attributes that significantly influences model predictions to any given class c: A c = {a ∈ A|s c (a) ≫ 0} , where s c (⋅) is the influence of an attribute to class c. We measure the influence by Shapley value, a widely-used interpretation tool in machine learning (Lundberg & Lee, 2017; Ghorbani & Zou, 2019) , which computes average prediction change with the presence and absence of this attribute: s c (a) = ∑ F ⊆A\{a} |F|!(|A| -|F| -1)! |A|! (p c (F ∪ {a}) -p c (F)), where p c (⋅) is the average predicted probability of class c on a set of inputs with certain attributes. With natural language, we can easily compose a large set of inputs with and without that attribute and feed them to the model to calculate the influence. For example, to compute the influence of attribute "ocean" on class "waterbird", we can generate various text inputs such as "A photo of species on the ocean" and "A photo of species", and compute the average difference of the model predicted probabilities of "waterbird". Note that it is particularly challenging to identify influential attributes using image inputs because it requires an extensive collection of images with attribute annotations. Connection. Discovering error slices and identifying influential attributes are important complementary applications, with the same ultimate goal --diagnosing the model. Error slice discovery finds specific subgroups about when the model fails, while attributes provide abstract explanations of why the model fails. Meanwhile, figuring out influential attributes helps discover error slices, because attributes provide information about the space of potential error slices, and vice versa.

2.3. RECTIFYING VISION MODELS USING LANGUAGE

In addition to discovering errors during model diagnosis, how to rectify these errors is a practical but challenging problem, which is seldomly addressed in existing works about the model diagnosis. Our finding about cross-modal transferability enables us to rectify undesirable behaviors of vision classifiers through language. Here we propose a simple solution where we generate additional data that the model fails using language and continue training the model on these synthesized data. Given the error slices S = {S ⊆ X |e(S) ≫ e(X )} discovered, we aim to rectify model performance on these error slices by minimizing |S|. For each S ∈ S defined by a list of attributes, we generate a large set of natural language inputs related to this slice Y S through attribute composition and prompt manipulation (Appendix B) and continue training the model on these text inputs Y S . We continue training the model using the same hyperparameters as if the model is trained on corresponding images, since we have proved that texts are effective proxies of images. This simple strategy significantly improves model performances on corresponding image error slices with minimal impact on other data, and has a distinct advantage that no visual data is required for rectification.

3. EXPERIMENTS

In this section, we first demonstrate that text embeddings are good proxies for image embeddings in multi-modal contrastive representation space (Section 3.2). Based on that, we demonstrate how DrML successfully discovers error slices (Section 3.3), identifies influential attributes (Section 3.4), and further rectifies model misbehaviors on three datasets (Section 3.5).

3.1. EXPERIMENTAL DETAILS

Model Architecture. We use CLIP (Radford et al., 2021) as the shared multi-modal embedding space. For classifiers built on CLIP's embeddings, we use linear layers and multi-layer perceptrons. Datasets. For cross-modality transferability (Section 3.2), we use the MS-COCO dataset (Lin et al., 2014) , which includes both captions and object annotations for each image. The task is a multi-label classification problem of predicting the presence of 80 objects based on images or captions. For model diagnosis and rectification, we simulate the three common types of model failures. For spurious correlation, we use the Waterbirds dataset (Sagawa et al., 2020) which asks a model to classify if a given bird image is a waterbird or a landbird. The training data contains a spurious correlation between bird species and backgrounds -95% of waterbirds appear in the water, and 95% of landbirds appear on the land. For underrepresented data, we use FairFaces (Karkkainen & Joo, 2021) which contains face images from 9 age groups and 7 race groups. The task is gender classification. To simulate the underrepresentation of minority groups, we sample races in proportion to the demographics of the state of Montana for our training set. For unseen data, we use dSpritesV (Matthey et al., 2017) which contains images of shapes with different colors, sizes, and positions. The task is to classify the shape in an image. To simulate errors caused by unseen data, we only use images with orange triangles or green squares during training. More details are shown in the Appendix B.

3.2. ARE TEXT EMBEDDINGS GOOD PROXIES FOR IMAGES EMBEDDINGS?

We have provided theoretical explanations in Section 2.1 that a classifier's boundary is transferable across modalities if the modality gap satisfies certain geometric conditions. Here we first verify these conditions and then show empirically that closing the modality gap can improve transferability. 1.33 ± 0.04 -0.79 ± 0.12 -0.03 ± 0.13 0.00 ± 0.02 Table 1 : Geometry analysis of modality gap for various multi-modal contrastive representation spaces. The modality gap approximates a constant vector, indicated by the magnitude and direction distributions. Modality gap is also orthogonal to the span of embeddings from two modalities, and embeddings' centers for both two modalities are zero vectors in the subspace orthogonal to the gap, indicated by the orthogonality and center distributions. Based on our theoretical analysis, these findings suggest that cross-modal transferability is widely established in multi-modal contrastive learning. ± connects mean and standard deviation. Detailed distributions in Figure 3 . Table 2 : Cross-modal transferability in multi-modal contrastive representation learning. We train a classifier using CLIP's image embeddings and test the trained classifier using text embeddings on the MS-COCO multi-label classification dataset. Despite the modality gap, classification boundaries learned from one modality are transferable to another modality. Closing the modality gap further improves cross-modal transferability without harm to in-modal evaluation. Notations: mF1 -Micro F1, MF1 -Macro F1, Random -A randomly initialized linear classifier.

Modality Gap Geometry.

In Table 1 , we first show that the modality gap can be well approximately by a constant vector. For instance, on MS-COCO, the class-level gaps between image and text embeddings extracted from CLIP (ViT-B/32) have almost the same magnitude (0.88 ± 0.04) and direction (cosine similarity 0.94 ± 0.04). We then show that the modality gap is orthogonal to the span of image embeddings and text embeddings, and embeddings have zero mean in the subspace orthogonal to modality gap. This is supported by the near-zero means with low standard deviations in "orthogonality" and "center" columns. Our findings here show that the assumptions required by our theory of cross-modal transferability (Section 2.1) hold true in practice across various datasets and contrastive multi-modal models, suggesting that cross-modal transferability should be a pervasive phenomenon in multi-modal contrastive learning. Cross-modal Transferability. Table 2 shows the image-to-text transfer results on the MS-COCO validation set. Based on our theory, we indeed find that cross-modality transferability is possible regardless of the modality gap. For instance, we find that an image-embeddings-trained linear classifier capable of achieving 67.90% macro F1 score can maintain 54.29% macro F1 score using text embeddings as inputs, and the consistency between predictions using images and texts is 96.37%. Similarly, text-to-image transfer is also possible, which is shown in Appendix Table 7 . While there exists slight degradation in performance under cross-modal evaluation, the difference in performance is relatively small, and the cross-modal transfer performance is much higher than random classification. The same finding is observed when using multi-layer perceptrons that learn non-linear features. As shown in the bottom half of Table 2 , closing the modality gap further improves cross-modal transferability. The linear classifier achieves 9.12%, 7.39%, and 2.05% absolute improvements on micro F1, macro F1, and prediction consistency for image-to-text transfer without harm to in-modality evaluation. The improvements using MLP are smaller but consistent. Are Generated Language Prompts Good Predictors of Error Slices? Here we further investigate whether our generated language prompts are good predictors of the error rate of a given data slice. We do so by looking at the correlation between performances on generated prompts and corresponding image slices. A strong correlation indicates that we can perform error slice discovery using text as proxies, which circumvents the challenges of collecting image data. We treat each attribute subset F ⊆ A as a slice. For each slice, we generate a set of text inputs Y F using prompt generators P and select all the images X F with attributes F. We compute the Spearman and Pearson correlation between model performances on Y F and X F . Table 3 shows strong correlation between image and text slices. Furthermore, correlation can be improved by: 1) using the average probability of the label on text predictions instead of accuracy, 2) generating better text inputs via prompt engineering which composes attributes into a more fluent sentence, and 3) prompt ensemble that uses different prompts to generate more diverse inputs (details in Appendix B). As baselines for comparison, we use the state-of-the-art text-to-image generation model (Rombach et al., 2022) t ∶ Y ↦ X to generate a set of (we use 1 or 20 in our experiment) images X ′ F from text prompts Y F and compute correlations between X ′ F and X F . Our method outperforms this baseline by a large margin and does not utilize significant computational time and cost typically required for the image generation process. Samples of the generated image samples are shown in Appendix C. Even while significant progress has been made in text-to-image generation, generating high-fidelity images that maintain the original semantics is still challenging. In summary, combining the empirical findings presented in this section and the theoretical results in Section 2.1, we show that text inputs can act as good proxies for image inputs, enabling us to diagnose vision classifiers using generated language prompts.

3.3. DISCOVERED ERROR SLICES

The strong correlation between the performances on text and image slices allows us to confidently run image slice discovery algorithm using text inputs. In this study, we use a simple error slice discovery method of sorting slices by their performances. We further marginalize attributes by merging similar slices into larger slices. In Table 4 , we summarize the most essential discovered error slices by our language-based approach on the three datasets, each representing one of the three typical model failure patterns under distribution shifts (Wiles et al., 2022) . For Waterbird, the top identified error slices are waterbirds in land and landbirds in water, which correctly corresponded to errors are caused by spurious correlations present in the dataset. For Fair-Face, the African American population is among the top identified error slice, which also reflects their underrepresentation in our training set. For dSpitesV, our method correctly identifies green triangles and orange square as critical error slices. Additionally, pink triangle slices are also correctly identified, since they were never seen in the training data. By using images to verify our discovered slices, our method not only correctly identifies the most critical error slices but also accurately predicts the slice performances on images. In Appendix C, we report results from the state-of-the-art slice discovery baseline DOMINO (Eyuboglu et al., 2022) . When evaluated using datasets with the same distribution as the training set, DOMINO can only discover slices present in the dataset, and could not discovery errors caused by distribution shifts.

3.4. IDENTIFIED INFLUENTIAL ATTRIBUTES

In Table 5 , we report the most influential attributes to a specific class on the same three datasets. These attributes provide a high-quality interpretation of how models predict and why they fail. For Table 6 : Rectified model performances on discovered error slices. We continue training models on language inputs corresponding to error slices (bolded) and observe significant performance improvements on these slices. GDRO not directly comparable because attribute annotations required. example, one of the most influential attributes for waterbird classification is "ocean" with an influence value of 0.3062, indicating that the model predicted probability of waterbird increases by 0.3 on average when "ocean" is present in a bird image. Since the attribute "place" should not affect predictions, this shows an obvious error of the model. Similar findings apply to the attribute "color" for dSpitesV. But what is more interesting is that the color "pink" is never seen during training but will bias the model to predict "square" with 0.1 increased probability. On FairFace, no attribute is found to significantly influence model prediction; thus, no obvious spurious correlations were learned.

3.5. RECTIFIED MODEL MISBEHAVIORS

In Table 6 , we report performances of original models and rectified models. On both Waterbirds and FairFace dataset, our simple method of continue training the model on text inputs significantly improves model performances on error slices with minor influences on other slices. We also perform ablation by only training the model on all the language inputs from scratch (Lonly), and find that continuing to train the pre-trained image model achieves better results, but even training only with language can also work reasonably. Our approach rectifies model misbehaviors caused by spurious correlation and underrepresented data by correcting the data bias. Another series of methods to tackle these errors are robust training techniques, such as GDRO (Sagawa et al., 2020) and JTT (Liu et al., 2021) , which explicitly optimizes each slice's performance during training. Our method outperforms JTT. While GDRO performs similarly to ours, it requires attribute annotations on images, which is highly time-consuming and cost-prohibitive for most real-world applications. Moreover, GDRO and JTT cannot fix errors on unseen data, while ours can, because our rectifying process requires no visual data.

4. RELATED WORK & DISCUSSION

Multi-modal Contrastive Learning. Many recent works in vision-language contrastive learning, such as CLIP (Radford et al., 2021) , ALIGN (Jia et al., 2021) , and Florence (Yuan et al., 2021) , have leveraged large image-caption datasets to obtain embedding spaces that capture rich visual and textual features. As a result, the learned image and text encoders are demonstrated to be strong uni-modal classifiers. In this work, we show how vision models obtained through multi-modal contrastive learning offer another significant advantage -model diagnosis and rectification. Multi-modal Contrastive Representation Space Geometry. Although multi-modal contrastive learning minimizes the distance between embeddings for matched pairs, prior work has shown that embeddings from two modalities are distinctively separated in the embedding space, which is referred to as modality gap (Liang et al., 2022) . In this work, we further analyze the modality gap geometry and connect it to the cross-modal transferability phenomenon. Our finding is related to several recent works built on multi-modal contrastive representation spaces, such as DALL-E 2 (Ramesh et al., 2022) , ClipCap (Mokady et al., 2021) , and other works (Cohen et al., 2022; Gal et al., 2022) . They found that trained models can directly take cross-modal embeddings but worse than same-modal embeddings. We not only explain this but provide a straightforward solution to improve transferability, which can be applied to all future works built upon multi-modal embeddings. Slice Discovery. Many recent works aim to understand model systematic errors by finding subsets of inputs with similar characteristics where the model performs significantly worse. This is referred to as slice discovery (Chung et al., 2019; Singla et al., 2021; d'Eon et al., 2022; Eyuboglu et al., 2022; Jain et al., 2022a) . However, these algorithms fail to address the most fundamental challenge for slice discovery -the lack of data. These works are only able to find errors that exist in the dataset. Our work circumvents the data challenge by performing slice discovery on the text space. Interpretation. Many model interpretation methods have been proposed, including attributionbased (Ribeiro et al., 2016; Lundberg & Lee, 2017; Shrikumar et al., 2017) and concept-based (Ghorbani et al., 2019b; Koh et al., 2020) . While these methods help in understanding the model prediction process, the outputs are complicated for humans to understand and inconsistent across models and algorithms (Ghorbani et al., 2019a; Jain et al., 2022b; Joshi et al., 2021) . Others require modifications in model architectures or complex post-processings (Ghorbani et al., 2019b; Koh et al., 2020) . In contrast, language is inherently understandable by humans and simple to construct. In this work, we interpret the model prediction process by identifying the most influential attributes using language, which provides us meaningful interpretations without pre-processing or post-processing. Algorithm Fairness. Ensuring algorithmic fairness is key to avoiding potential harm to our society (Hovy & Spruit, 2016; Zou & Schiebinger, 2018) . Methods for improving the fairness of machine learning algorithms is an ongoing active area of work (Bolukbasi et al., 2016; Sagawa et al., 2020; Sohoni et al., 2020; Ramaswamy et al., 2021; Liu et al., 2021) . Among these, a notable solution is to correct data bias, as model bias stems from data bias. In this work, we show that language can be used to correct data bias by generating additional data, hence improving model fairness. Limitations. While our work introduces a novel and effective approach for diagnosing and rectifying visual classifiers, there are additionally important areas for future work. First, since we assume vision classifiers are built using an image-text embedding space trained through multi-modal contrastive learning, our method can also inherit limitations from the contrastive model and pre-training dataset. For example, although we aim to leverage large and general-purpose image-caption datasets in pre-training, the encoders may still not appropriately embed out-of-distribution examples far from what the contrastive model was trained on. Misaligned or inaccurate pre-training data can also affect encoder quality. Additionally, it is challenging to diagnose low-level visual attributes that are difficult to describe in words, such as texture or object orientation (Leclerc et al., 2021) . We consider these fruitful directions for future work. Our method will also benefit from improvements in multi-modal contrastive pre-training as these methods are improved.

5. CONCLUSION

Our work reveals a valuable advantage of using vision classifiers built on top of multi-modal embedding spaces learned through contrastive learning -the ability to diagnose and rectify the vision classifiers using natural language inputs. We first use a combination of theoretical analysis and experimental findings to verify that cross-modal transferability exists; namely, that text inputs can act as good proxies for image inputs. This then allows us to propose and validate a framework for diagnosing and rectifying vision classifiers using natural language inputs. Our work suggests promising new directions both for achieving reliable and trustworthy computer vision models, and for the use of cross-modal transferability in other problem domains.

ETHICS STATEMENT

One of the main contributions of our work is an approach for diagnosing and rectifying vision classifiers trained using embeddings from a multi-modal contrastive model. We showcase experimental results on identifying error slices and influential attributes. For example, our method can detect failures caused by the lack of representation of certain races in the training set. In our FairFace experiments, the prediction of gender (i.e., the label "female") given an image was affected by race (e.g., the race "black"). We further show that we can rectify this behavior using our approach. Hence, we see our work as a contribution to the broader community concerned with model accountability and model auditing, and to improving the responsible integration of AI into society. However, it is also important to be aware of potential negative impacts brought about by our findings. One can imagine an adversary who extends our approach and uses it to their advantage, perhaps reinforcing racial or gender biases by fine-tuning a vision model using biased language prompts. Our work also inherits limitations from the contrastive model and pre-training datasets used to obtain the image and text encoders, as described in the Discussion section of our paper. We hope that this statement raises awareness both of the importance of better model diagnosis and rectification methods and of future directions of work to address limitations and potential negative impacts.

REPRODUCIBILITY STATEMENT

We provide open-source implementation of our work at https://github.com/ yuhui-zh15/drml. The implementations will enable researchers to reproduce all the experiments described here as well as run their own analyses on additional multi-modal models and datasets.

OVERVIEW OF APPENDIX

In this appendix, we supplement additional details of theory, datasets, experiments, and baselines. • In Appendix A, we provide more details about the modality gap geometry, a theoretical proof of cross-modal transferability given the modality gap, additional cross-modal transferability results on MS-COCO and ImageNet. • In Appendix B, we provide details of four datasets (MS-COCO, Waterbirds, FairFace, and dSpritesV) used in our experiments, including data preprocessing, attributes, and prompts. We also provide the model and experimental details. • In Appendix C, we provide two baseline methods. First, we present the result using text-toimage generation for model diagnosis, which sometimes fails to generate fidelity images given text prompts. Second, we present the baseline method for slice discovery using DOMINO, which fails when error slices are absent in the dataset.

A CROSS-MODAL TRANSFERABILITY

A.1 MODALITY GAP GEOMETRY Figure 2 shows the modality gap phenomenon in various multi-modal contrastive learning models, where inputs from two modalities are embedded at arm's length in their shared representation space. This phenomenon is caused by the combined effect of model initialization and optimization. Deep neural networks have the cone effect -encoders will only map inputs to a small cone of the entire representation space. Therefore, two cones will be created for a multi-modal model with two encoders. As a sequence, the modality gap occurs at the initialization stage. During optimization, the contrastive loss will preserve the gap due to mismatched data (Liang et al., 2022) . Figure 3 shows four statistics that reveal important properties of the modality gap geometry. • The modality gap approximates a constant vector, particularly at the class level. We verify this by computing distributions over ∥g∥ (magnitude) and cos(g, E g [g]) (direction). g is the gap between embeddings of paired data from two modalities. • The modality gap is orthogonal to the span of embeddings, and embeddings have zero mean in the subspace orthogonal to the modality gap. We verify this by computing distributions over cos( x -E x [x], E g [g]) (orthogonality) and E x [x -x T g ′ g ′ ] i (center) , where g ′ = E g [g]/∥E g [g]∥ and i ∈ [d] denoting i-th dimension of the vector. Based on our theoretical analysis in the next section, these findings suggest that cross-modal transferability is widely established in multi-modal contrastive learning. Figure 3 : Geometry analysis of modality gap for various multi-modal contrastive representation spaces. The modality gap approximates a constant vector, indicated by the magnitude and direction distributions. Modality gap is also orthogonal to the span of embeddings from two modalities, and embeddings' centers for both two modalities are zero vectors in the subspace orthogonal to the gap, indicated by the orthogonality and centering distributions.

A.2 THEORETICAL PROOF FOR CROSS-MODAL TRANSFERABILITY

In this section, we expand and formally discuss what is in section 2.1. We theoretically explain the intriguing cross-modal transferability phenomenon. We explain why the modality gap in the multi-modal representation space does not prevent cross-modal transferability because of the unique geometry of the modality gap. For class c ∈ [|C|], let e c ∈ {0, 1} |C| be a one-hot vector such that the c-th dimension is 1 and other dimensions are 0. We define the following balanced target label vector ẽc ∶= e c -E c ′ [e c ′ ], where the expectation is over the distribution of classes on the image domain. We consider learning a linear function h W (u) = W u, where W ∈ R |C|×d is the weight matrix and u ∈ R d is the image or text embedding. Given h W (u) and a label c, we consider the following quadratic loss: L quad (h W (u), c) = ∥h W (u) -ẽc ∥ 2 2 . The following proposition shows that when the gap between image and caption embeddings is the same for all image-caption pairs and is orthogonal to the embedding span for each modality, a linear model trained to minimize the quadratic loss on one modality transfers to the other modality without loss of accuracy. Proposition A.1. Suppose there exists a gap vector g ∈ R d such that every pair of image embedding x and caption embedding y satisfies g = xy. Suppose the gap g is orthogonal to the span of image features (i.e., g T x = g T x ′ for two image embeddings x and x ′ ), and the image features have zero mean in the subspace orthogonal to g (i.e., E x [Π g (x)] = 0 where Π g (x) projects the vector x to the subspace orthogonal to g). Then, for any λ > 0 and linear function h W (u) that minimizes the regularized quadratic loss E x,c [L quad (h W (x), c)] + λ∥W ∥ 2 F , we have that h W (x) = h W (y) Thus, cross-modal transferability happens. Proof of Proposition A.1. Since g T x = g T x ′ for all image features x and x ′ , we can find a τ ∈ R such that x = Π g (x) + τ g. Notice that E x,c [L quad (h W (x), c)] = E x,c [∥W x -ẽc ∥ 2 2 ] = ∥E x [W x] -E c [ẽ c ]∥ 2 2 + E x,c [∥(W x -ẽc ) -(E x [W x] -E c [ẽ c ])∥ 2 2 ] = ∥E x [W x] -E c [ẽ c ]∥ 2 2 + E x,c [∥W Π g (x) -ẽc ∥ 2 2 ] = ∥W E x [Π g (x)] + τ W g -E c [ẽ c ]∥ 2 2 + E x,c [∥W Π g (x) -ẽc ∥ 2 2 ]. Since E x [Π g (x)] = 0 and E c [ẽ c ] = 0, the first term reduces to τ 2 ∥W g∥ 2 2 . Notice that the second term in the loss decomposition only involves W 's components that are orthogonal to g. Thus the minimization of the second term is independent of the minimization of the first term. As a result, any W that minimizes the regularized quadratic loss must satisfy W g = 0. For a pair of image and text features x, y, since x-y = g and W g = 0, we have h W (x) = h W (y), which finishes the proof.

MS-COCO.

In the main paper, we only report image-to-text transfer, where we train a classifier on image embeddings and test on text embeddings. Here we report the full results, including textto-image transfer, in Table 7 . ImageNet. In the main paper, we report cross-modal transferability on the MS-COCO dataset. Here we report cross-modal transferability results using the ImageNet dataset (Deng et al., 2009) . We split ImageNet validation set into 40K / 10K images for training / evaluation. We apply OpenAI 

A.4 THEORETICAL INTUITION FOR MODALITY GAP AND CROSS-MODAL TRANSFERABILITY

In this section, we provide theoretical insights about the intriguing modality gap and cross-modal transferability phenomenon. We show that after optimizing multi-modal contrastive loss, there is a modality gap between image embeddings and text embeddings, and there is a linear classifier trained on image embeddings that is guaranteed to generalize to text embeddings regardless of the modality gap. Basic Notations. Given an image x and a text z, we use an image encoder f (⋅) and a text encoder g(⋅) to map them to the shared D-dimensional representation space. We denote f (x) ∈ R D as image embedding and g(z) ∈ R D as text embedding. Given N images and M texts, we denote the concatenated image embedding matrix as F ∈ R N ×D and concatenated text embedding matrix as G ∈ R M ×D , which is shown in the following equation: F = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ f (x 1 ) T ... f (x N ) T ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ ∈ R N ×D G = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ g(z 1 ) T ... g(z M ) T ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ ∈ R M ×D (1) Image-Text Connection Graph. Given an image x and a text z, we denote the probability of (x, z) being an image-text pair as p(x, z). With the probability definition, we have ∑ x,z p(x, z) = 1. Given N images and M texts, we denote the probability matrix as P ∈ R N ×M . Note that the probability matrix P is a sparse matrix with most elements as zero, because most image-text pairs are mismatched and cannot be collected (e.g., a cat image with a dog caption). P can be viewed as the adjacency matrix of a bi-particle graph G = ({x, z}, {p(x, z)}), where all the images and texts are the vertices of the graph and their connection probabilities are the edges. The adjacency matrix of this graph can be written as the following equation: P = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ p(x 1 , z 1 ) ... p(x 1 , z M ) ... ... ... p(x N , z 1 ) ... p(x N , z M ) ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ We have the following theorem which shows the connection between multi-modal contrastive learning with the partitioning of the connection graph defined above. This result can be viewed as a simple generalization of the results in HaoChen et al. (2021) to the multi-modal setting. Theorem 1 (Equivalence of Multi-modal Contrastive Loss and Graph Partitioning). With the mild assumption that every image and every text pair has equal presence probability ∀x ∶ p x = ∑ z p(x, z) = 1 N , ∀z ∶ p z = ∑ x p(x, z) = 1 M , minimizing the multi-modal contrastive loss in Equation 3 is equivalent to minimizing ∥P -F G T ∥ 2 F : L = -2E x,z [f (x) T g(z)] + N M E x∼P x ,z∼P z [(f (x) T g(z)) 2 ] Proof of Theorem 1. The following equation proves Theorem 1: min ∥P -F G T ∥ 2 F = min ∑ x,z -2p(x, z)f (x) T g(z) + ∑ x,z (f (x) T g(z)) 2 + ∑ x,z p(x, z) 2 = min ∑ x,z -2p(x, z)f (x) T g(z) + ∑ x,z (f (x) T g(z)) 2 = min ∑ x,z -2p(x, z)f (x) T g(z) + N M ∑ x,z 1 N 1 M (f (x) T g(z)) 2 = min -2E x,z [f (x) T g(z)] + N M E x∼P x ,z∼P z [(f (x) T g(z)) 2 ] (4) Connection to CLIP Contrastive Loss. Given N images and N texts, with the assumption ∀i, j ∶ p(x i , z j ) = 1 N 1[i = j], the CLIP contrastive loss is shown in Equation 5and is very similar to the contrastive loss in Equation 3. L CLIP = 1 N ∑ i ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ -log exp (f (x i ) T g(z i )) ∑ j exp (f (x i ) T g(z j )) -log exp (f (x i ) T g(z i )) ∑ j exp (f (x j ) T g(z i )) ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ = 1 N ∑ i [-2f (x i ) T g(z i ) + log ∑ j exp (f (x i ) T g(z j )) + log ∑ j exp (f (x j ) T g(z i ))] = -2 N ∑ i f (x i ) T g(z i ) + 1 N (∑ i log ∑ j exp (f (x i ) T g(z j )) + ∑ i log ∑ j exp (f (x j ) T g(z i ))) = -2E x,z [f (x) T g(z)] + E x∼P x ,z∼P z [log ∑ z ′ exp(f (x) T g(z ′ )) + log ∑ x ′ exp(f (x ′ ) T g(z))] Let us now consider the modality gap phenomenon in the above mentioned contrastive learned representation space. Proposition 1 (Modality Gap). After optimizing the multi-modal contrastive loss defined in Equation 3, image embedding and text embeddings will be separated in the shared presentation space, causing the modality gap phenomenon. Proof of Proposition 1. Since optimizing the multi-modal contrastive loss defined in Equation 3is equivalent to minimizing ∥P -F G T ∥ 2 F , achieving the minima of ∥P -F G T ∥ 2 F does not indicate image embeddings F and text embeddings G are close in the embedding space. For instance, since for any scalar c > 0, we can scale F by a factor of c and G by a factor of 1/c without changing the contrastive loss, we know that there exist many solutions that both achieve the minimal loss and also exhibit modality gap. Now we consider the cross-modal transferability phenomenon on downstream classification problems using the above mentioned contrastive learned representations. We first introduce the following notations for the downstream task's labels. We make the following assumption which says the label of a text can be predicted by the labels of the images that it is paired with. Assumption 1. With the definition of connection probablity matrix P , image label matrix Y x and text label matrix Y z , we have P

Labels of

T Y x = 1 M Y z . Intuition of Assumption 1. In most realistic settings, the connection between an image and a text p(x, z) > 0 if y x = y z and p(x, z) = 0 if y x ≠ y z . Therefore, ∀z ∶ ∑ x p(x, z)e y x = 1 M e y z . The following equation proves the matrix form: P T Y x = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ p(x 1 , z 1 ) ... p(x N , z 1 ) ... ... ... p(x 1 , z M ) ... p(x N , z M ) ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ e T y x 1 ... e T y x N ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ ∑ i p(x i , z 1 )e T y x i ... ∑ i p(x i , z M )e T y x i ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ = 1 M ⋅ ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ e T y z 1 ... e T y z M ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ = 1 M ⋅ Y z 6) Proposition 2 (Cross-modal Transferability). After optimizing the multi-modal contrastive loss introduced in Equation 3 with infinite encoder dimension D = ∞, we can find a linear classifier that only uses image representations but is transferable to text representations. More specifically, the weight of the linear layer is M F T Y x ∈ R D×C , which can be viewed as a class-mean classifier. vulnerabilities. Therefore, this process can be improved with human involvement to iteratively design better attributes based on the model feedback, which can be useful for future works. While attribute selection is important and may require human involvement, our method is still very useful because we provide an easy way to test the model under many cases. Like software testing, there is generally no free lunch for model diagnosis, and it is impossible to design a general diagnosis framework for any task without any prior knowledge. We have already significantly reduced the diagnosis cost compared to previous works. Previous works all assume a large collection of labeled images is available for model testing, which is unrealistic given the extreme difficulty of collecting diverse image inputs and the cost of data annotation. Our method instead provides a way to test sensitive attributes for which you may even have no image data. For any specific task, it is always much easier to come up with a meaningful set of attributes and then generate a large collection of novel text inputs by combining different attributes than collecting corresponding images, thanks to the easy-to-manipulate and compositional nature of the language modality. More importantly, the combination of defined attributes naturally defines human-interpretable data slices, whereas imagebased slice discovery methods do not directly provide a text summary of the error slice. Finally, we hope to clarify that one of the main contributions of our work is to theoretically and empirically demonstrate a pervasive phenomenon in multi-modal contrastive learning --cross-modal transferability, which allows texts to be effective proxies for images. Our method performance, in terms of correlation strength between model performances on images and corresponding texts, is independent of attribute selection. It is just more errors can be discovered with more human involvement in this process. Moreover, it is possible to collect a large set of text inputs in a different way to diagnose vision models instead of using the attribute-based combination. For example, one may be able to prompt large language models such as GPT-3 in a few-shot fashion to generate a large set of descriptions of certain classes, and then feed these inputs into vision models for diagnosis. We leave this to future work. Overall, diagnosing vision models using text modality is much more desirable than image modality, because language enables us to easily generate realistic and diverse inputs with better control and manipulation.

C BASELINES C.1 LANGUAGE-BASED VISION MODEL DIAGNOSIS BASELINE: TEXT-TO-IMAGE GENERATION

In Figure 5 , we show the baseline method to diagnose vision models using language -text-to-image generation. The method generates real images to test models using the text-to-image generation model. From the results, we can understand why this baseline is worse than our method, which does not require explicitly generating images. While significant progress has been made in the text-to-image generation field, state-of-the-art text-to-image generation models (Rombach et al., 2022) still fail to generate fidelity images. In addition, text-to-image generation is computationally expensive, requiring thousands of more computations than our approach. C.2 ERROR SLICE DISCOVERY BASELINE: DOMINO In Figure 6 and 7, we show the discovered error slices using the baseline slice discovery method, DOMINO (Eyuboglu et al., 2022) . In real-world applications, it is unrealistic to assume a large set of labeled images from different distributions is available. Therefore, the most critical challenge for slice discovery is data. In this work, we circumvent the data challenge by using language to synthesize extensive test examples.

C.3 MODEL RECTIFICATION BASELINES: GDRO AND JTT

Our approach rectifies model misbehaviors by correcting the data bias. Another series of methods to tackle these errors are robust training techniques, such as GDRO (Sagawa et al., 2020) and JTT (Liu et al., 2021) , which explicitly optimizes each slice's performance during training. Compared to them, one of the distinct advantages of our approach is that we do not require any visual data during the rectification process. Both GDRO and JTT require image data present in the training set. Therefore, they cannot fix errors on unseen data. GDRO even requires attribute annotations on images, which is highly time-consuming and cost-prohibitive for most real-world applications. Moreover, our method can also be combined with robust training techniques when image data and attribute annotations are available, and we leave this to future work. Here we provide implementation details of GDRO and JTT: GDRO. We reproduce GDRO on our datasets by adopting the official GDRO loss implementation to our code base. We use all the same hyperparameters they use in the paper, where important hyperparameters include l 2 penalty strength α = 0.2 and group adjustment γ = 0.1. We train a linear classifier for 25 epochs using the Adam optimizer with a fixed learning rate of 0.001. During training, CLIP's image encoder is fixed. We pick the best model based on the lowest validation loss. JTT. We reproduce JTT on our datasets by implementing the algorithm by ourselves. We use all the same hyperparameters they use in the paper. We also perform a hyperparameter search on the upsampling weight λ up ∈ {5, 20, 50}, which is a very important hyperparameter based on the paper. The best λ up is 20 for Waterbirds and 5 for FairFace. We train a linear classifier for 25 epochs using the Adam optimizer with a fixed learning rate of 0.001 for round 1 and round 2. During training, CLIP's image encoder is fixed. We pick the best model based on the lowest validation loss. Well-generated examples (columns 2-3) are realistic and correctly reflect the ethnicity and age group described in the text prompts. Poorly generated images include more than one person (column 4), do not include the person's face (column 5 bottom), or are noticeably unrealistic (column 5 top). For instance, the top image in column 5 includes a woman with arms protruding out of her chest, which is rare in the real world. Well-generated examples (columns 2-3) correctly reflect the shape and color described in the text prompt. Poorly generated images either include incorrect shapes (column 4-5 top) or are misinterpreted due to polysemy (column 4-5 bottom). For instance, the phrase "red square" was misinterpreted by our model as a historic site in Moscow. For out-of-distribution dSpritesV, DOMINO is able to discover slices with spurious correlation (orange squares & green triangles). However, the generated text descriptions do not include attributes that contribute to the spurious correlation. Additionally, DOMINO did not discover slices with unseen data (pink triangles). Figure 7 : Discovered error slices on out-of-distribution Waterbirds, FairFace, and dSpitesV datasets using the state-of-the-art slice discovery method DOMINO (Eyuboglu et al., 2022) . DOMINO was able to capture some, but not all, error slices. Furthermore, artificially generating out-of-distribution data for evaluation remains challenging in real-world settings.



We further demonstrate that DrML can rectify undesirable model behaviors and improve model performance with respect to the identified error slices and influential attributes, by fine-tuning the vision classifier using text embeddings constructed from the diagnosis process.



Given a specific attribute subset F ⊆ A, we use different prompt generators p ∈ P ∶ 2 A ↦ Y to map attribute combinations to text inputs.

Figure 2: Modality gap for multi-modal contrastive learning. Embeddings from two modalities are visualized using UMAP and SVD. Figure credit: Liang et al. (2022).

Images and Texts. Given an image x with the label y x ∈ [C], where C is the number of classes, we denote its one-hot representation as e y x = [1[y x = 1], ..., 1[y x = C]] T . Given N images, we denote the image label matrix as Y x = [e T y x 1 , ..., e T y x N ] ∈ {0, 1} N ×C . We use similar notations y z , e y z , Y z for the text.

Figure 4: A subset of the 80 prompts form OpenAI we use to augment text inputs for prompt ensembling.

Figure 5: Text-to-image generation results on Waterbirds, FairFace, and dSpitesV using the stateof-the-art generation model (Rombach et al., 2022).

Correlation analysis of model performance on image and text slices. Correlation can be improved by using label probability instead of label accuracy on text predictions, generating better text through prompt engineering and ensemble. Our approach outperforms the baseline textto-image generation model by a large margin. The best or near-best results are bolded.

Discovered error slices using language. With the images used for validation, our method succeeds in discovering important error slices (bolded) and accurately predicts model performances on image slices. Notations: Image -model accuracy using real image inputs, Text-predicted accuracy using text inputs as a proxy, W -water, L -land.

Identified influential attributes using language. We show top 2 most positively and negatively influential attributes, which provide insights into how models predict and why they fail.

Cross-modal transferability in multi-modal contrastive representation learning. We train a classifier using CLIP's image embeddings and test the trained classifier using text embeddings, vice versa, on the MS-COCO multi-label classification dataset. Despite the modality gap, classification boundaries learned from one modality are transferable to another modality. Closing the modality gap further improves cross-modal transferability without harm to in-modal evaluation. Notations: I -Image, T -Text, M -Modality, mF1 -Micro F1, MF1 -Macro F1, Random -A randomly initialized linear model. CLIP's 80 prompts to 1000 ImageNet class names and get 80K texts, and we split them into 64K / 16K for training / evaluation. All the experimental settings are the same as MS-COCO experiments. Results are shown in Table8.Again, despite the modality gap, we find that the classification boundaries learned from one modality are transferable to the other modality. When a linear classifier is trained on image embeddings and achieves 70.86% image classification accuracy, directly feeding the text embeddings to the trained classifier achieves 85.24% accuracy. The transfer from text to image is much worse than from image to text, because the texts we used are generated from prompts and thus lack diversity to train a classifier with good decision boundaries. Closing the modality gap improves the transferability in most cases.

Cross-modal transferability in multi-modal contrastive representation learning using the ImageNet dataset. We split ImageNet validation set 50K images to 40K / 10K for training and evaluation. Texts are generated using OpenAI's 80 prompts multiply by 1000 class names.

annex

Proof of Proposition 2. After optimizing the contrastive loss, we have P = F G T (Theorem 1).WithThe following equation explains why M F T Y x is a class-mean classifier:Intuitively, the class-mean linear head is very similar to the trained linear head. For instance, learning the linear head with one step of gradient descent starting from zero initialization would recover the class-mean linear head. We study the class-mean linear head here because it's more amenable to theoretical analysis. This provides intuition why cross-modal transferability can be achieved regardless of the modality gap.

B DATASETS AND EXPERIMENTAL DETAILS

In this section, we report details of four datasets: MS-COCO (Lin et al., 2014) , Waterbirds (Sagawa et al., 2020) , FairFace (Karkkainen & Joo, 2021) , and dSpritesV (Matthey et al., 2017) , and details of two major experiments.

B.1 DATA PRE-PROCESSING

MS-COCO. We follow the standard MS-COCO dataset split, which includes 118K / 5K images for training / validation. Each image is annotated with multiple objects from 80 categories and five human-written captions. We randomly select one caption from five captions. Therefore, we have 118K / 5K image-caption pairs with multiple labels for training / validation.Waterbirds. We follow the standard Waterbirds dataset split, which includes 4.8K / 1.2K images for training / validation. Data samples can be viewed in Figure 6 and 7 .FairFace. We resample the training set using the demographics from the state of Montana, which includes 92.8% White, 6.4% Indian, 0.5% Asian, and 0.3% Black. The final dataset contains 17K / 11K images for training / validation. Data samples can be viewed in Figure 6 and 7 .dSpritesV. We use our own scripts to reproduce a variant of the dSprites dataset and name it dSpritesV. We use six colors (red, pink, orange, green, cyan, blue), four locations (upper left, upper right, lower left, lower right), and three sizes (small, medium, large) to create triangles and squares with a scale ranging from 0.8 to 1.2. Each attribute is uniformly sampled, and we synthesize 10K images. We only use 80% orange triangle and 80% green square for training. Finally, it has 1.3K / 8.7K images for training / validation. Data samples can be viewed in Figure 6 and 7 .

B.2 ATTRIBUTES

Here we list the known attributes during the data collection of the three datasets. We do not cherrypick attributes and just use these attributes for our experiments.Waterbirds. Two attributes are used: species (200 values) and places (4 places).FairFace. Three attributes are used: races (7 values), ages (9 values), and genders (2 values).dSpritesV. Three attributes are used: colors (6 values), size (3 values), and shapes (2 values).

B.3 PROMPT ENGINEERING

We use the prompt engineering techniques proposed in CLIP (Radford et al., 2021) for our experiments.Waterbirds. We use "{species}, {place}." as the raw prompt, and "a photo of a {species} in the {place}." as the engineered prompt. Therefore, we can generate 200 × 4 = 800 text inputs.FairFace. We use "{age adjective}, {race}, {gender}." as the raw prompt, and "a photo of a {race} {age adjective} {gender}." as the engineered prompt. Therefore, we can generate 7 × 9 × 2 = 126 text inputs. The age adjectives are infant (0-2), little (3-9), teenage (10-19), young (20-29), adult (30-39), middle-aged (40-49), senior (50-59), elderly (60-69), and very old (more than 70).dSpritesV. We use "{size}, {color}, {shape}." as the raw prompt, and "{size} {color} {shape}." as the engineered prompt. Therefore, we can generate 3 × 6 × 2 = 36 text inputs.

B.4 PROMPT ENSEMBLE

We use OpenAI CLIP's 80 prompts (Radford et al., 2021) to augment text inputs by 80 times. Parts of them are shown in Figure 4 .

B.5 EXPERIMENTAL DETAILS

Model Details. Unless explicitly stated, we use CLIP (ViT-B/32) for all experiments, encoding images and texts in the same 512-dimensional space. Linear layer maps input dimension 512 to the number of classes. Multi-layer perception uses the hidden size as 512.

Cross-modal Transferability Training Details.

For each image-caption pair, we use CLIP's image and text encoder (Radford et al., 2021) to get its image embedding and text embedding. We do not use image augmentation techniques during training and inference. We train the linear model or multi-layer perception for 25 epochs using the Adam optimizer with a fixed learning rate of 0.001. During training, CLIP's image and text encoder are fixed. We pick the best model based on the lowest validation loss on the training modality.Classifiers Training Details. For each image in the dataset, we use CLIP's image encoder (Radford et al., 2021) to get its image embedding. We do not use image augmentation techniques during training and inference. We train the linear model or multi-layer perception for 25 epochs using the Adam optimizer with a fixed learning rate of 0.001. During training, CLIP's image encoder is fixed. We pick the best model based on the lowest validation loss on images.

Model Rectification Training Details.

For each text from error slices generated by attribute composition and prompt engineering, we use CLIP's text encoder (Radford et al., 2021) to get its text embedding. We continue training the pre-trained linear model or multi-layer perception for 10 epochs using the Adam optimizer with a fixed learning rate of 0.001. During training, CLIP's text encoder is fixed. We pick the best model based on the lowest validation loss on texts.

B.6 DISCUSSION: IS USING ATTRIBUTES AN APPROPRIATE CHOICE FOR MODEL DIAGNOSIS?

In this work, we heavily rely on attributes for model diagnosis and rectification. A natural question is what is the rationale for using attributes in these processes, and how much will attribute selection affect the validity of our method.We first clarify that in our experiments, we did not cherry-pick attributes and just used the known attributes from the data curation process of the three datasets. More broadly, we believe attributes are useful for model diagnosis, because it is not hard to define a meaningful set of attributes given a specific task; there are many known attributes given any dataset, and sometimes there may be specific attributes of interest for which we wish to test model vulnerability. For example, for a selfdriving car application, we can easily come up with attributes such as weather, traffics, pedestrians, buildings, etc. For any class in ImageNet classification, such as guitar, it is straightforward to think about its color, material, location, size, etc. We agree that an initial set of chosen attributes may not be perfect for reflecting all the essential errors, and better attribute selection can reveal more model Descriptions For Each Slice 

