CAPTION SUPERVISION ENABLES ROBUST LEARNERS

Abstract

Vision language (VL) models like CLIP are robust to natural distribution shifts, in part because CLIP learns on unstructured data using a technique called caption supervision; the model inteprets image-linked texts as ground-truth labels. In a carefully controlled comparison study, we show that caption-supervised CNNs trained on a standard cross-entropy loss (with image labels assigned by scanning captions for class names) can exhibit greater distributional robustness than VL models trained on the same data. To facilitate future experiments with highaccuracy caption-supervised models, we introduce CaptionNet, which includes a class-balanced, fully supervised dataset with over 50,000 new human-labeled ImageNet-compliant samples which includes web-scraped captions. In a series of experiments on CaptionNet, we show how the choice of loss function, data filtration and supervision strategy enable robust computer vision. We also provide the codebase necessary to reproduce our experiments at VL Hub. Under review as a conference paper at ICLR 2023 Table 1: This table lists works which relate to ours own, evaluating distributional robustness in computer vision. We catalog the contributions of each paper with with respect to certain key factors. VL vs CE-loss indicates whether the paper conducted controlled comparisons on the effects of VL-loss (InfoNCE) and CE-loss. Captioning strategy is whether the study evaluated the effects of captioning strategy on model performance. Our paper is the first to compare CE-loss and VL-loss models trained and evaluated on multiple datasets at both low and high accuracies.

1. INTRODUCTION

Motivation. Real world uses of deep learning require predictable model behavior under data distribution shift. Our paper deals with distributional robustness, the effect of distribution shift on image classifiers. For brevity's sake, we may simply refer to this as robustness. Since 2019, numerous works have studied effective robustness on ILSVRC-2012 ImageNet classification, quantifying the impact of various interventions (Taori et al., 2020; Miller et al., 2021; Fang et al., 2022; Radford et al., 2021) . The majority of standard deep network models have been found to perform significantly worse under so-called natural distribution shifts, such as changes in lighting or stylized renderings of object classes (Hendrycks & Dietterich, 2019; Miller et al., 2021) , leading many researchers to reconsider how close computer vision models had truly come to human-level accuracy, and underscoring the need for more robust models and diverse evaluation datasets. The now-popular ViT-L CLIP model by Radford et al. (2021) was the first vision language (VL) model to show natural distributional robustness comparable to humans across a wide range of ImageNet shifts, at the cost of some basetask accuracy. Subsequent work from Jia et al. (2021) ; Pham et al. (2021) showed that human-level distributional robustness is possible even as base accuracy approaches SOTA, as long as sufficient data is available for training. The gains are not limited to CLIP; other VL-loss functions also achieve strong distributional robustness (Yu et al., 2022) . In our paper, we focus on CLIP, since it is publicly available and by far the most popular VL model. Since CLIP's design differs from typical models in several important ways (loss function, training dataset, the use of natural language captions as labels), it is of great interest to isolate the effect of these factors on model distributional robustness. Recent works have addressed this question, and have reached various interesting conclusions. Fang et al. (2022) posit that the intrinsic diversity of training image data is the main cause for the distributional robustness gains of VL models in the zero-shot setting, with other factors such as language supervision contributing little to no distributional robustness. On the other hand, Santurkar et al. (2022) seemingly provide a counterpoint; given a sufficiently large pretraining dataset and descriptive, low variability captions, contrastively trained VL models, AKA caption-supervised models, outperform models trained with the SIMCLR-loss in the transfer learning setting. Does caption supervision lead to models which perform better under distribution shift, or does it not? This question is difficult to answer conclusively, not only because of the aforementioned confounds, but also the often vast discrepancies in base accuracy (Taori et al., 2020) . VL-loss models are Labeling strategies: For unsupervised datasets such as LAION and YFCC, ground-truth labels do not exist. Therefore, we must choose a labeling strategy by which the model associates labels with classes. VL-loss uses learned embeddings from natural language captions to associate text tokens and image features. But CE-loss models cannot directly learn all possible token combinations as unique classes; the set would be far too large. Instead, we employ a strategy we call subset matching, an extension of the "substring matching" used by Fang et al. (2022) . This strategy, illustrated in detail in Fig. 1 , labels samples as follows; if a sample caption contains a matching term, then the corresponding class label is applied. If the sample caption contains no matching terms for any class, then no label is applied (the sample is filtered out of the dataset; hence we match only a subset of the dataset). Why subset matching? Because we aim to conduct a controlled comparison between loss functions, we wanted to choose a labeling strategy which was model-free, consistent, and whose error was approximately normally distributed; we experiment in Table 5 with a model generated by adding random gaussian label noise to ground truth labels, and find that its E.R.R. is very similar to the subset matched models in the same table. Captioning strategies; We explore multiple captioning techniques in this paper. Flickr captions refer to captions scraped from a combination of Flickr titles, tags and descriptions. Flickr captions are generated using either the title of the image, the tags associated with the image, the prose description (descr) of the image, or some combination. ttd captions are the union of title, tags and description; this was our default captioning strategy for Flickr-style captions. Alt-text captions are captions taken from alt-text image descriptions on websites. BLIP captions refer to captions generated using the BLIP model from Li et al. (2022) . Finally, annotated captions refer to human authored captions which describe an image. Figure 1 : Subset matching; an overview. Subset matching is a simple labeling strategy for unsupervised image-caption pairs. The caption is processed and converted to n-grams which are then matched against a database of terms which point to integer-label classes. Matching strategies; Subset matching requires a matching strategy, which amounts to a method for handling multiple labels on a single image. The basic idea of the three strategies, strict (strict), single-class (sc), multiclass (mc), can be gleaned from Fig. 1 ; for additional details, please see Section E. Matching terms; Subset matching depends on a database of matching terms associated with each integer label. Ours (ours), openai (oai) and default (def) are the three distinct sets of labels we used for the matching algorithm. More details on these different matching term sets can be found in Section E We also define a metric to compare filtration methods. Dataset utilization (Ds. Util) of a model on a dataset is the ratio of correctly labeled samples to correct + incorrect + unlabeled samples.

Measures of effective robustness.

There is no single ideal measure of effective robustness; but taken together, we can use multiple measures to get a complete picture. One common metric, and the one we use primarily, is average robustness (Avg. Rob.), the simple mean of our distribution shifts. Although this metric is easy to interpret, it does tend to gloss over the often substantial differences between shifts; therefore, we also include shift-specific accuracy measures in Table D . In addition, we discuss some shift-specific findings in Section A. Another measure we use is effective robustness, introduced by Taori et al. (2020) , primarily in order to compare our work to the existing literature. Effective robustness is described in Taori et al. (2020) as a measure of how robust a model is on a given shift, beyond what is expected from having higher accuracy on the original test set. It is generally represented as the deviation in performance with respect to a set of baseline models on ImageNet and its subsamples, or the performance of human evaluators on the shift, which is typically plotted as y = x for natural distribution shifts; typically this follows a linear trend. Finally, we leverage Effective Robustness Ratio (E.R.R.) (Feuer et al., 2022) , computed as average shift accuracy over base task accuracy. We find that this is an effective measure when we limit our comparisons to models with similar base accuracy, but do not recommend its use for comparing models whose base accuracy differs substantially. VL-loss models are not always more robust than CE-loss models. One of the major themes in recent research into distributional robustness has been the seemingly unmatched performance of VL 2021) Indeed, it is this performance which has driven much of the interest in these models. However, in Fig. 2 , we present a novel finding; depending on the underlying choice of dataset, CE-loss models using subset matching can actually be more robust than VL models. Specifically; 1. On LAION-15m and CC-12m, a simple subset matching labeling strategy produces CE-loss models which are sometimes more robust than VL models. 2. On YFCC, the same strategy produces CE-loss models which are always less robust than VL models. These results hold at both low and high accuracies and are unaffected by subsampling the dataset. Additional experiments in Table 2 show that even when we hold the loss function and dataset size constant, changing label matching (filtration) strategy alone can affect distributional robustness by as much as 20 percent. Table 2 : Subset matching strategy can impact effective robustness. Here, we compare models with our labels to models using the default ImageNet class labels. We find that under a strict subset matching strategy, our labels perform better than default labels and a random subsample of YFCC. Our findings highlight a fundamental challenge in evaluating results on distributional robustness; researchers cannot rely on observations made on logit-transformed linear fits (as in Recht et al. (2019) ) to hold both in the low-and high-accuracy regimes, unless model architecture, loss function, dataset, labeling strategy and evaluation metrics are fixed. Ratio-based measures such as the one described in Feuer et al. (2022) also cannot be used to effectively compare models whose base accuracy differs widely. Table 3 : An overview of the four main datasets in CaptionNet. Label sources lists the source(s) for integer labels in the dataset. Caption sources lists the sources for captions in the dataset. Supervised indicates when ground truth labels exist for the dataset. CE-loss models benefit most from supervised data. Filtered indicates when the dataset contents were processed in some way prior to inclusion in the dataset. VL-loss models struggle on unfiltered data. Balanced indicates whether the dataset is approximately class-balanced. CE-loss models struggle on unbalanced data. The ideal comparison, then, would take place in a setting where both VL and CE-loss models were capable of achieving high base accuracy. Unfortunately, ImageNet does not provide such a setting; even VL models trained on the largest publicly available unsupervised dataset, LAION, have yet to match the performance of CLIP, and training even one such model is enormously demanding.

3. CAPTIONNET

To resolve this challenge, we introduce CaptionNet, a collection of four new training datasets designed to be evaluated on a subset of ImageNet. Each dataset in CaptionNet is either a subset or a superset of an existing dataset, and each is fully captioned and fully labeled, either using ground-truth or synthetic labels. 1. ImageNet-100 (in100): A superset of 100 ImageNet-Captions (Fang et al. (2022) ) classes with over 50,000 new human-authored ground-truth labels, flickr-captions and blip-captions. 2. OpenImages-100 (oi100): A subset of the OpenImages (Kuznetsova et al. (2020) ) dataset with restored original flickr-captions, and new BLIP-captions; samples selected by mapping humanlabeled OpenImages-100 classnames to ImageNet-100 classnames. 3. LAION-100 (laion100): A subset of the unlabeled LAION (Schuhmann et al. (2021a) ) dataset with samples selected via subset matching on ImageNet-100 classes. 4. YFCC-100 (yfcc100): A subset of the unlabeled YFCC dataset (Thomee et al. (2016) ) with samples selected via subset matching on ImageNet-100 classes. We compare some of the key properties of each component of CaptionNet in Table 3 . More information on the process used to create CaptionNet is available in Section F. Additional details on the size and composition of each CaptionNet subset can be found in Section 4.2. Training on CaptionNet. In order to minimize differences in model architecture, we train two families of models: A ResNet-50 for CE-loss models, and a VL-loss model with a ResNet-50 vision backbone. The only difference in the two architectures is that for CE models, we append a ResNet-50 with a 1000-class linear head; we allow this since, as noted in (Radford et al., 2021; Santurkar et al., 2022) , this does not seem to affect CLIP performance. In order to control for dataset size, we train models on various subsets of CaptionNet and measure base accuracy and distributional robustness. CE-loss models are typically trained with full (32 bit floating point) precision with batch size of 128 and gradient clipping. VL models are typically trained with mixed precision (32-bit for gradients, and 16-bit for weights), for a batch size of 256, and do not use gradient clipping. Models are typically distributed across a single node with 4 NVIDIA GPUs; our largest models were trained on 16 NVIDIA GPUs. We use the AMP library to implement the training process. We train our larger models for 32 or 64 epochs unless otherwise specified. For CaptionNet models, we train all models for 256 epochs. In the CaptionNet experiments, we also experimented with models that trained on a subset-matched dataset matching on all 1000 classes in ImageNet. We refer to these datasets as YFCC-2.2Mn (yfcc22m) and YFCC-3.9Mn (yfcc39m). These datasets were trained for 64 epochs owing to computational constraints. 

4. RESULTS ON CAPTIONNET

Our main experiments, in Table 4 evaluated each subset of CaptionNet separately, followed by combinations of those subsets. Since prior experiments controlled for base accuracy, model architecture, dataset composition and dataset size, we focused on comparing VL-loss and CE-loss models. Separately, we also performed ablations on the effects of subset matching as a labeling strategy; those results can be found in Section E. ImageNet-100 performance. In order to ensure that the baseline performance of VL and CE models is essentially comparable on ImageNet-100 and the standard ImageNet despite the newly added images, we train a VL (using "A photo of a $CLASSNAME" captions) and CE model from scratch on ImageNet-100 and compare; we find that evaluation accuracy of the smaller models is very similar to their larger counterparts; ImageNet-100 is a good proxy for ImageNet. OpenImages-100 performance. The baseline performance of OpenImages-100 with CE-loss is considerably worse than ImageNet-100 with CE-loss, most likely because of the natural class imbalance in the dataset (arising from the fact that images were not filtered). The baseline performance of OpenImages-100 with VL-loss was highly dependent on captioning strategy; we discuss these results in Section 4.2. LAION-100, YFCC-100 performance. CE-loss models failed to learn anything on the unsupervised datasets. VL-loss models learned nearly as much as they did from in100, and more than oi100, indicating that dataset size, rather than dataset composition or label accuracy, is the most important factor for VL-loss. ImageNet-100 + Supervised data. ImageNet-100 + OpenImages-100 is about twice the size of ImageNet-100 alone, and every sample has a ground truth label. We find that in VL, this combination adds a small amount of distributional robustness compared to ImageNet-100 alone; in CE, we gain base accuracy and distributional robustness. From this, we conclude that CE-loss models benefit more from adding supervised data than VL-loss. CE-loss models appear to be capable of learning very accurate representations with relatively little data; the base accuracy of the best-performing subset-matched model in Table 4 , trained on fewer than 1 million samples, was higher than the best VL model, which trained on over 400 million samples. ImageNet-100 + Unsupervised data. When we scale up on noisily supervised data, we increase the dataset between 4x and 10x its original size. In this paradigm, we find that VL-loss models improve fairly smoothly with data scale, whereas CE-loss models improve most on the cleaner laion-100 data. ImageNet-100 + Both. What happens when we combine all of our supervised data and the cleanest source of unsupervised data? VL performance actually degrades slightly; this result is quite surprising, indicating again that scale matters more than anything else in VL. CE-loss, on the other hand, improves in distributional robustness without losing base accuracy, and comes close to rivaling CLIP, trained on 400m samples. Training on out of distribution classes. To test the effects of scaling up on out of distribution classes, our final set of experiments on CaptionNet involved training on ImageNet-100 + YFCC-3.9Mn and ImageNet-100 + YFCC-2.2Mn. These datasets were filtered by subset-matching on 1000 ImageNet-classes; therefore, the vast majority of the labels in these datasets map to classes which are not evaluated by ImageNet-100 validation. We train on the original captions for VL models, and subset matched labels for CE models and find vast discrepancies in performance. The VL model improves in both distributional robustness and accuracy, and the CE-loss model gets worse. This fits with a larger finding; VL-loss models perform better when new tokens are seen, even if those tokens are not used during evaluation. To verify this, in Table 6 , we train a model on ImageNet-100 in which we convert all tokens which are not in the prompt or classnames to 0. In this situation, we find that both accuracy and distributional robustness decline substantially. We see the same effect when we train VL-loss models on ImageNet-100 + YFCC3.9Mn, and improves disproportionately under shift. Caption supervision can be very noisy; how do VL and CE-loss compare in their handling of noise? In Table E .2, we compare the accuracy of subset matched labels to ground truth labels on OpenImages-100, we find that even the best subset matching techniques apply correct labels to less than 10 percent of the available data, suggesting that the caption text and the ground truth labels have very little overlap in this dataset. In this high-noise setting, the VL-loss model is able to reach 20 percent zero-shot accuracy with flickr-captions, whereas a subset-matched CE-loss model (not listed in the table ) does not learn at all. When we add synthetic captions to OpenImages-100, the VL-loss model improves to nearly 30 percent. Just how sensitive are CE-loss models to dataset noise? In Table 5 , we see that performance degrades significantly when we train using subset matching techniques and 10 percent of the labels are noisy, even when we control for the difference in dataset size. We also observe that not all noise is created equal; two models with similar error rates can have very different performance, depending on what those errors are. We discuss this surprising result in greater detail in Section E.4. Table 5 : Not all errors are created equal; CLIP labels outperform subset matched labels on ImageNet-100. This table compares six CE-loss models with different labeling strategies. Abbreviations are defined in Section 1. We find that labels generated by a ViT-L CLIP model perform better on ImageNet-100 (in100) than subset matched labels, even though the true accuracy (which we determine by comparing predicted to ground truth labels) of each labeling method is very similar. Label accuracy and match count cannot fully explain differences in model performance on a dataset. Label Source True Acc. Ds. Util. Val. Acc. (in100) in100-a in100-r in100-s in100-v2 Avg. Rob. E.R.R. Although the focus of our experiments was loss functions and labeling strategies, we also wanted to briefly address the difference of opinion in recent works on the importance of caption contents, which we describe in detail in Section 1.

Ground

In our experiments on CaptionNet, which can be found in Table 11 , we found that on ImageNet-100, there was no change in performance whatsoever when using BLIP+Title captioning. On OpenImages-100, model performance improved by 10% when using human-annotated+BLIP+Title captioning. The impact of synthetic captions, then, seems to be strongest when the pretraining dataset is labeled but unfiltered, suggesting that the machine captions act as a kind of pseudo machine-label. What makes a caption more robust? In Table G .1, we observe that by many common measures, robust and non-robust captions are very similar; to the extent they are dissimilar, the measures would seem to favor the non-robust YFCC captions rather than the robust LAION captions. Measures such as length, language of origin (as estimated by a common Python language detection package), or token diversity are unlikely to explain differences in model performance.

5. DISCUSSION AND RECOMMENDATIONS

Caption supervision enables distributionally robust model training. It primarily does so by indirect means; captions serve as a dataset filtration method which are interpreted as labels by the model. However, in the VL-loss paradigm, the contents of the captions themselves can also affect both accuracy and distributional robustness, and not necessarily in equal proportion. 1. Loss function matters. In Section 2, we showed that the choice of VL or CE-loss can have a substantial effect on model distributional robustness and base accuracy, even when dataset sizes are similar. 2. CE learns best when noise is low and classes are few. In Section 4 and its subsections, we showed that CE-loss models learn high-accuracy, robust representations from compact datasets, but are highly sensitive both to the amount and to the type of noise present in the labels. CE-loss models also tend to perform worse on in-distribution classes when training on out-of-distribution classes (classes which are not evaluated). 3. VL learns best when lots of data is available for many classes. Also in Section 4, we showed that VL-loss models tend to be more robust at low accuracy, and can learn a basic representation of a very large number of classes even when labels are less than 10 percent accurate; however, they do not benefit as much from ground-truth captions as CE-loss models benefit from ground-truth labels. Furthermore, they require far more data to attain high accuracy. 4. Image quality trumps caption quality. In Section 4.2, we showed that descriptive caption supervision can result in better performance when caption quality is low, but has no effect on accuracy or distributional robustness when caption quality is high. Future research directions. Our controlled study has helped illuminate some of the meaningful differences between CE-loss and VL-loss models. We believe that research into how to shore up the weaknesses of either approach, or discovery of a new method which blends their strengths, would be a useful direction for future efforts.

A DISTRIBUTION SHIFTS ON IMAGENET

ImageNet is a large-scale visual ontology of images built upon the backbone of the WordNet structure. ImageNet aims to populate the majority of the 80,000 synsets of WordNet with an average of 500-1000 clean and full resolution images, making it a roughly class-balanced, fully supervised dataset. Deng et al. (2009) There now exist a wide range of distribution shifts on ImageNet. These are novel test datasets designed to overcome some of the limitations of the original benchmark. While they cannot remedy issues with the labeling scheme, these datasets do provide challenging new contexts in which to analyze classifier performance. ImageNet-V2 was designed to duplicate, as closely as possible, the original ImageNet test set. It was intended to answer the question of whether ImageNet-trained classifiers could successfully generalize even to the most mild of distribution shifts. 2022) demonstrate the power of effective robustness as an explanatory tool for performance differences in VL models; Miller et al. (2021) showed that there exists a strong correlation between most models trained on random subsets of a data distribution, and the fully trained model. However, these authors also caution that it has significant limitations -Taori et al. ( 2020) and Nguyen et al. (2022) show that models trained on more (or different data) can significantly change the effective robustness line of a particular model, and also that these changes were shift-specific, with stronger fits on shifts like ImageNet-V2 and weaker fits on shifts like ImageNet-A. In Section C, we extend these findings to almost 1000 publicly available vision models, including almost 100 VL models, demonstrating, in granular detail, nonlinearities that appear in many common shifts as models train on larger datasets and at higher accuracies. While it is tempting to deal with distribution shifts as a kind of monolith, the truth is, perhaps unsurprisingly, more complex. We found that ImageNet-V2 seemed to respond more to model architecture than other shifts, with the handful of non-ResNet models we evaluated outperforming nearly all other models, regardless of training objective. ImageNet-R and ImageNet-Sketch both showed high sensitivity to the training data, with the CC12M and LAION-15m distributions considerably outperforming even the best YFCC-trained models. These types of shifts are particularly amenable to subset matching strategies.Fig. 6 , Fig. 4 On ImageNet-A, CE models significantly underperformed compared to VL models regardless of the data, and all models significantly underperformed compared to the ViT-L CLIP.Fig. 5 We also note that there is no readily apparent logit-scaled linear trend in these distribution shifts when one considers models trained on a wide range of different datasets, underscoring the importance of a well-chosen baseline for comparison. We find that different shifts tend to disadvantage different kinds of models, which makes improving on all of them simultaneously very challenging. The fact that ViT-L CLIP was able to do is both impressive and, given the vital importance of the underlying data distribution in such measures, a mystery which is unlikely to ever be solved. Even the massive public datasets such as LAION are unable to match the performance of the dataset CLIP was trained on, although other factors might possibly have played a role. A standardized benchmark of distribution shifts on ImageNet would be a welcome contribution to this area of research.

B PRETRAINING DATASETS

Today, many SOTA models are pretrained on web-scale unsupervised data. We utilized three such datasets in our experiments. We observe that one major challenge of conducting research on unsupervised datasets is that the links provided as part of the dataset fail more and more over time, leading to each group getting a different version of the dataset. Therefore, to the extent possible, we report the details of each dataset in the appendix, and encourage other researchers working with these datasets to do the same. CC-12M is a lightly supervised web-scale dataset created by Google. The image-caption pairs in CC-12M were filtered and selected for the purposes of training models to caption images. Changpinyo et al. (2021) Our version of CC12M contained 9703885 image-caption pairs. YFCC-15M is a subset of YFCC-100M, which is 100M image-metadata pairs taken from Yahoo-Flickr in 2016. The subset was selected by OpenAI. This dataset contains images and metadata, which includes a "title" and a "description" field. These fields are combined and processed in various ways by researchers in order to generate captions for models to train on. 

D COMPLETE TABLE OF MODEL RESULTS

In Table D and Table D , we present the collected results of all of our experiments, both on CaptionNet and on larger portions of YFCC, LAION and CC12M.

E SUBSET MATCHING STRATEGIES

In this section, we define and describe some important variations on the basic subset matching strategy as described in the main paper. All of our subset matching experiments utilized one of three matching strategies; Strict: Only match on samples which have exactly one class in them. Strict matching degrades as matching strategies grow more aggressive, and is also strongly affected by the number of classes evaluated; however, it tends to be the least noisy method, making it useful in some contexts. Single class: greedily take the first matching class as the true class and ignore all others. As a general matter, we found that single-class matching struck the best balance between dataset utilization and accuracy. Multi class: Match on all matching classes, up to 25 classes per sample. We found that this approach tended to decrease accuracy, and that under most strategies, multiclass matches were uncommon anyway. Figure 3 : The new picture of effective robustness. This plot, which contains inference results from nearly 1000 models, shows a more complicated landscape of effective robustness than previous investigations. Ablations on the caption space show that even when the information in captions is aggressively transformed or reduced, effective robustness is preserved. In the low-accuracy regime, RN50s trained on integer-captioned yfcc and LAION data are able to match or exceed VL effective robustness. In the high accuracy regime, LiT-tuned VL models and wise-ft models approach, but do not reach, the VL-robust line. CE loss models also approach the VL line at very high base accuracies. All of these labels were chosen heuristically, to the best of our knowledge. We found that the heuristic changes to term matches often had substantial effect on accuracy; however, we leave the algorithmic discovery of optimal subset matching terms to future work. We have released all of our matching terms in VLHub (EG: /vlhub/metadata/in1k ours.txt). Figure 5 : ImageNet-A is learnable by all models at extremely high base accuracy. Although VL models seem to learn ImageNet-A faster than CE models, CE models reach near-parity with VL models when base accuracy gets very high. Figure 6 : VL performance on ImageNet-R outstrips base accuracy. On ImageNet-R, which is a 200-class subset of ImageNet, VL models are able to achieve higher accuracy than on ImageNet itself. VL continues to outperform CE models on this dataset, even at very high accuracies.

E.1 DATASET UTILIZATION

Dataset utilization is defined in Section 1. We define accuracy as the number of correctly classified samples over correctly plus incorrectly classified samples. Because ImageNet-100 and OpenImages-100 have ground truth labels, the optimal model would have a dataset utilization of 1.0 (100 percent).

E.2 PERFORMANCE OF SUBSET MATCHING STRATEGIES ON IMAGENET-100 AND OPENIMAGES-100

When ground-truth labels do not exist, it is impossible to be certain how accurate any labeling strategy is. Therefore, we take advantage of the existence of ground-truth labels for OpenImages-100 and ImageNet-100 to compute the accuracy of a wide range of strategies. In Table E .2, we list exact accuracy results for a range of subset matching strategies by comparing them directly to the ground truth labels.

E.3 PERFORMANCE OF VARIATIONS ON SIMPLE SUBSET MATCHING

There are many ways one could conceivably improve on the simple strategies described above. We explore some of them and describe them here. Fuzzy subset matching uses the Levenshtein distance between tokens to locate potential matches. We used the standard Python library fuzzywuzzy with a setting of 55 (higher approaches found few matches). We found that this approach was considerably slower than simple subset matching, but offered only limited benefits. We explore synset matching as well; we match on noun synsets only, with the NLTK toolkit. We include all synonym nouns, hyponyms, hypernyms, alsosees and similartos, a broad matching strategy whose refinement we leave to future work. Synset matching also results in much slower subset matching, and also results in very small improvements; therefore, we use simple subset matching for the majority of our experiments.

E.4 PERFORMANCE COMPARISON OF SUBSET MATCHING STRATEGIES ON YFCC AND LAION

We train baseline 2M and 4M models on random subsamples of YFCC. We then use subset matching strategies as described in Section E. We note that in1k-ours is more robust than in1k-default matching on YFCC, controlling for size. We also try the opposite strategy -targeting the 10M samples in YFCC which were NOT matched. Even under this rather punishing transformation, we find that a VL model is able to learn a low-level representation of ImageNet (around 12 percent accuracy and an average of 8 percent under shift, around one third of a baseline YFCC model). The performance of a subset matched model under these circumstances would be zero, because it would not have any samples to train on.

Does subset matching favor certain datasets?

Subset matching strategies only reach parity with VL models when we approximately control for dataset utilization (the number of samples the model evaluates), indicating that this factor is important to consider when attempting to answer this question. Could it be that subset matching underperforms on non-robust datasets because it simply matches fewer samples? This turns out not to be the case; we find that on YFCC and LAION, a subset match is found around 20 percent of the time, and on CC12M, around 25 percent of the time, resulting in roughly similar dataset utilization.

Single-class subset matching usually outperforms other filtering strategies

As shown earlier, subset matching strategy can impact both distributional robustness and accuracy during training, and changes do not necessarily affect both measures in equal proportion. We consider a range of variations on simple subset matching to see if they offer any improvement, such as fuzzy matching using Levenshtein distance and synset matching, but in our experiments, there were few if any advantages to these techniques. We also experiment with changing the term-matching dictionary and the matching strategy; for more information on this, please refer to Table E .2. Overall, we find that single-class, non-strict matching on a relatively limited set of terms provides the best balance of accuracy, speed and dataset utilization, and that all subset matching strategies perform worse than ground-truth labels, even when controlling for dataset size.

Prefiltering samples improves caption quality

We find in Table E .2 that the caption noise profiles of ImageNet-100 and OpenImages-100 differ dramatically. • On ImageNet-100, simple subset matching on flickr-captions achieves high dataset utilization and accuracy • On OpenImages-100, the same strategy shows lower accuracy and much lower dataset utilization Since both datasets use the same caption source and the same image source, data alone is unlikely to explain this discrepancy. Instead, we suspect the difference exists because of the strategies used to assemble these datasets. OpenImages labelers applied labels to images which had not been selected with any particular objective in mind. Kuznetsova et al. (2020) ImageNet labelers applied labels to images which had been preselected with the intent of building a class-balanced dataset for 1000 classes chosen in advance. Deng et al. (2009) We cannot verify the accuracy of subset-matching on YFCC-100 or LAION-100 because there are no human labels for these datasets. However, we know that LAION samples used a much stronger prefiltering strategy than YFCC. Schuhmann et al. (2021b) ; Thomee et al. (2016) Therefore, it is possible that subset matching's strong performance on LAION can be attributed to this difference alone. Subset matching hit rate is a good signal for subset matching accuracy On OpenImages-100, we observed that as raw match count decrease dramatically, the accuracy of matched samples also decreased when matches took place. Table E .2 This fact offers one possible guideline for when subset matching is better than VL; it performs better when there are a relatively high number of subset matches, possibly because hit rate correlates with accuracy. The hit rate on YFCC-100 is around 2 percent, whereas the hit rate on LAION-100 is around 3.5 percent. This difference is certainly substantial enough that it could be a signal that subset matching will be a more successful approach.

Machine labels complement subset-matched labels

When classes are known in advance, machine labeling can be a good strategy for learning unsupervised image data. One advantage of machine labeling is that if labels are accurate, dataset utilization has the potential to be much higher. In Table 5 , we find that on ImageNet-100, 90 percent of the labels from a CLIP ViT-L model match the ground truth label. This is very comparable to that model's zero-shot validation accuracy on ImageNet-val-subset. However, when we train a CE model on the CLIP labels, we find the machine labels outperform subset matching with size control, performing nearly as well as ground-truth labels. The most likely explanation for this somewhat puzzling result is that the distribution of CLIP error is 'helpful' to model learning -results in this vein have been shown by Goh et al. (2021) . The distribution of error in subset matched models, as we note in Section 2, closely approximates random noise. Why might CLIP error be more helpful? One possible explanation for this difference is that VL-loss models always have available a very large space of potential classes for matching. Consider examples like "lioness" for the class "lion", or "lamp shade" for "lampshade" -subset matching approaches miss these positive matches (unless we use heuristic methods to correct them), but in a VL model, the most predictive token would likely remain the same. Since we know from Section G that bag-of-words contrastive loss VL models are not sensitive to token position in the string, this could result in a correct classification where a subset-matching model would fail. Machine labeling can also be useful for estimating the accuracy of subset matched labels. On LAION-100, we find that 60 percent of labels are in agreement between subset matching and labels from a CLIP ViT-L model. We leave to future work the question of how models with a wider range of predefined classes would perform on such a task. Overall, we find that machine labeling and caption supervision serve complementary roles; machine labeling does not directly rely on captions, but can still benefit from them. In particular, subset matching can prefilter incoming image data to target classes of interest. The use cases for subset matching We find that simple subset matching is a powerful technique for learning on unsupervised image / caption data. We make the following recommendations for applying this technique; 1. Both flickr-style tags and descriptions and alt-text can be effective when used for subset matching 2. Subset matching techniques are much more effective when image-caption data has been roughly filtered or supervised, even if it has not actually been labeled; for instance, all images that match terms in a search engine, or all images that maximize the dot product of a clip model. 3. Subset matching relies on terms which are relatively common, and so works best with objects with unique names that happen to often be the subject of an image; English cocker spaniel, for instance 4. If the terms are uncommon or difficult to nail down, but the captions are still expected to contain them, then a VL model may perform better 5. When data is unfiltered, the hit rate of subset matching is a good barometer for how accurate the matches will be 6. If hit rate is low, so is accuracy, and VL will probably perform better than subset matching 7. Augmenting or ensembling with machine labels works well in conjunction with captioning; captions generated from machine labels can provide additional signal for VL and subset matching when captions are noisy

F DETAILS ON CAPTIONNET

The most important new contribution in CaptionNet is ImageNet-100. To the best of our knowledge, ImageNet-100 is the only version of ImageNet which duplicates the original distribution's class balance and supervision properties (ImageNet is not perfectly class balanced, but it does not contain any long-tail classes; all classes in ImageNet have at least 750 samples), while also being fully captioned with original web-scraped labels. We find that both VL and CE models trained on relatively small amounts of data can achieve high base accuracy on some CaptionNet subsets, making it possible for the first time to compare model distributional robustness while controlling for base accuracy. In Table 10 , we discuss in detail the supervision strategy used for CaptionNet, with a per-class breakdown of each class. An overview of the supervision process follows; • All samples were supervised by the authors of the paper • Samples were sourced from flickr using the available API, sorted by 'interesting', with safesearch enabled, searching only samples with Creative Commons licenses • Additional filtering terms were passed to the API in order to eliminate commonly encountered confounds in the search terms • After the search term was selected, items were downloaded in bulk • All downloaded samples were then individually tagged by the researchers as either "in-class" or "out-of-class", using reference photographs from each class as a baseline comparison We found that classes varied widely along several vectors; • Some classes had far greater availability than others (ranging from 450,000 to 283 available samples) • Some classes were much cleaner than others (ranging from 100 percent clean to around 25 percent) • Some classes tended to be the 'subject' of photographs, such as dog breed, while others, such as mashed potato, tended to be featured as secondary items in the background of a photograph of something else 

F.1 DATASET CONSTRUCTION

The 100 classes in CaptionNet were selected randomly from a subset of all classes with more than 600 captions available in ImageNet-Captions (Fang et al., 2022) . The list of classes selected is available in Section I. We note that this approach introduces a potential bias in class selection, since it may be that captions were still available for those images ten years after ImageNet was originally constructed for some reason that correlates with properties we are interested in studying; however, we feel that the risk of this is outweighed by the many benefits of having such a dataset available for study. Since we could not find human-authored captions for ImageNet, we used BLIP Li et al. (2022) to generate descriptive captions on ImageNet-100. BLIP often uses word fragments to describe objects, so we used a spell checker as a simple intervention to improve the quality of BLIP captions. Finally, because BLIP's vocabulary does not include many of the specialized classes available in ImageNet, we augmented the BLIP captions with Flickr image titles, the form of text which is most commonly available for an image. We used top p=0.9, max length=40, min length=5, repetition penalty=1.1. trend line, indicating that the changes in effective robustness cannot be attributed to the properties of natural language which we ablate, such as the tokenizer, sentence-wise attention mechanisms, prompt ensembling or even class-independent caption content. In Table G .1, we observe that by many common measures, alt-text and flickr captions are quite similar; whatever explanations may exist for the performance differences, it is not easily summarized by these measures. Table 12 : Alt-text and flickr captions do not differ substantially by many measures. We observe that by many measures, alt-text captions, are very similar to flickr captions, making it difficult to determine why models trained on alt-text captions tend to be more robust. 

H ALTERNATE TRAINING SCHEMES

One fairly immediate explanation for the distributional robustness of VL models would be that we are witnessing a kind of overfitting which happens whenever a model is fine-tuned on a training dataset. If this is the case, then if we had a method for reverting the overfitting, we would be able to shift models to a different line with respect to effective distributional robustness. In Fig. H , we explore this possibility, focusing on two alternative training schemes in particular, Wise-FT and LiT-Tuning. Radford et al. (2021) ran fine-tuning experiments on certain datasets and found that robustness declined as base accuracy increased, indicating that fine-tuning makes CLIP models (somewhat) less robust, fitting a middle line in-between VL and CE models. Wortsman et al. (2022) ran a series of experiments interpolating the weights of zero-shot CLIP with its fine-tuned counterparts and showed that for certain distribution shifts, it is possible to find a 'sweet spot' where both i.d. and o.o.d. accuracy increase. We test the wise-ft method on ViT-L and find that under this intervention, distributional robustness does not scale proportionately with accuracy, instead holding essentially constant as base accuracy increases. We also evaluated the pre-trained LiT-tuned models released by Zhai et al, and trained several LiTtuned models ourselves in order to compare their performance to fully-trained VL models, extending the results in Beyer et al. Our findings are as follows -1. Like Wise-FT, LiT-tuning produces models whose i.d. / o.o.d accuracy trade-off fits a line between that of traditional models and VL models -more robust than the former, less robust than the latter. The only exception we found was when we LiT-tuned the vision tower of a ViT trained on the CLIP objective -in this case, LiT-tuning decreased base accuracy while holding effective robustness constant (the near-opposite effect of Wise-FT) 2. LiT-tuning offers negative benefit for fully trained VL models, suggesting that it can only hope to approach, rather than exceed, the accuracy of its baselines 3. LiT-tuning performance tends to closely correlate to the base accuracy of the underlying vision model 4. Intriguingly, we find that this is true regardless of the specific dataset used for LiT-tuning -LiT-tuned models trained on small amounts of data are able to recover accuracy on outof-distribution tasks even when very little data from that distribution shift appears in the pretraining data 5. These experiments suggest that some degree of effective robustness is "locked away" in many vision models, but is lost during the training process, but that certain techniques are able to increase effective robustness disproportionate to the loss in base accuracy, pushing the model 'above the line' we would normally expect. Furthermore, if the distribution shift of interest is known and well-defined, it is possible to select a tuning to optimize for that shift Taken together, we can conclude that effective robustness cannot be explained by overfitting aloneif it were, then we would expect interpolation-type interventions to be capable of lifting models to the effective robustness line, which they do not. clip idx are ImageNet labels chosen by a zero-shot CLIP ViT-L model from OpenAI. idx labels refer to labels generated using various subset-matching strategies. mc is multiclass, sc is single class, strict is strict. Ours, default, openai refer to the three different sets of class labels we experimented with throughout this paper.



Thomee et al. (2016) Our version of YFCC contained 14825134 image-caption pairs. LAION is a 5B image-caption dataset recently created by LAION.ai. It is the first publicly available dataset which matches the scale of the datasets used by the large companies to train their best models.Schuhmann et al. (2021a)  The subset of LAION we refer to as LAION-15m contained 13775512 image-caption pairs.C LARGE-SCALE EVALUATION OF CE VS VL DISTRIBUTIONAL ROBUSTNESSIn order to get a more complete picture of the current landscape of model distributional robustness, we evaluated nearly 1000 models on our suite of distribution shifts, including all of the models with metrics reported byTaori et al. (2020);Wightman (2019);Feuer et al. (2022), models trained with LiT and Wise-FT objectives as described inWortsman et al. (2021);Zhai et al. (2021) and all of the models trained for this paper. The results are shown in Fig.C. We find that in the very high accuracy regime, the logit-transformed linear fit of CE models fails to hold, and CE distributional robustness increases faster than predicted, approaching VL distributional robustness.

Figure 4: Non-linearities in ImageNet-Sketch. ImageNet-sketch performance is not linear, with only the very largest VL models showing a reliable improvement over CEly trained models, when controlling for dataset size.

Figure 7: ImageNet-100 samples from CaptionNet.

Figure 8: OpenImages-100 samples from CaptionNet.

Figure 9: LAION-100 samples from CaptionNet.

Figure 10: Wise-FT, optimized to balance id/ood accuracy, fits the LiT-tuned effective robustness line.

Figure 11: LiT-tuning on a VL-trained image tower reduces accuracy without altering effective robustness, suggesting that VL pretraining is at least as robust as LiT-tuning.Wise-FT tuning greatly increases base accuracy and slightly improves effective robustness, at the cost of zero-shot capability. CE from-scratch training matches Wise-FT accuracy, but sacrifices effective robustness and zero-shot.

A direct comparison of VL and CE-loss models on CaptionNet datasets highlights the importance of the loss function. The best-performing VL model is marked in boldface, and the best performing CE model is in italics. CE-loss models are supervised using ground-truth labels where they are available, and subset-matched labels where ground-truth labels are unavailable. VL-loss models are supervised using the best available captions for the dataset. We find that unfiltered datasets with only subset-matched labels are unlearnable on their own, but when blended them with ground-truth supervised data, the resulting model is more accurate and more robust than a VL model training on the same data with captions. VL distributional robustness is proportionately higher in low and medium regimes, but base task accuracy improves only after large amounts of additional data are added.

Eliminating all tokens not used in evaluation reduces evaluation accuracy. VL models perform worse on validation and under shift when all tokens which are not in the prompt or classnames are mapped to 0, showing that VL models learn about evaluation classes from non-evaluation classes.

Recht et al. (2019) Imagenet-Sketch is a distribution shift covering sketches, paintings, drawings and illustrations of ImageNet classes. This test set is very large and comprehensive.Wang et al. (2019)

Complete results table: non-CaptionNet models. This table includes results for all of the larger models in our study, including performance broken down by specific distribution shift.

Complete Results table: CaptionNet models. This table includes results for all of the CaptionNet models in our study, including performance broken down by specific distribution shift. VL models perform better on OpenImages when flickr-captions are replaced with synthetic captions (BLIP+Title captions), but the same captioning method provides no benefits on ImageNet-100.

Complete Results table: Subset Matching Strategies. This table contains direct comparisons of many of the subset matching strategies evaluated in our study; we consider a range of metrics, in particular raw accuracy and dataset utilization. TotalMatch is a count of how many samples were matched to some label by MC matching. MaxCorrS is count of how many samples were correctly matched to some label by the most successful strategy (typically SC). MaxDU is a measure of the highest dataset utilization possible using any strategy (typically SC).

CaptionNet Supervision: Search Terms and Sample Quality Since many of the findings in our paper highlight the importance of both the amount and type of label noise, this table records statistics pertaining to our filtration process for the new samples in in100. In the search term field, a -symbol indicates that all samples which included that word in the title, tags or description were NOT matched. Boolean OR, AND, and "" symbols behave as they typically do.



DatasetSize Non-english Capt. Freq. Avg. Capt. Len. Std. Dev. Capt. Len. Num. Uniq. Tokens Supervision Strat. Filtering Strat.

annex

We repeated the process for OpenImages-100. However, we used human-authored captions sourced from Pont-Tuset et al. (2020) instead of BLIP whenever available; around 16,000 out of the 135,000 OpenImages-100 samples had human-authored captions. 

G CLIP DOES NOT LEARN A LANGUAGE MODEL

We show that, evaluated by the CE definition of a language model in the literature, as well as by our expectations as human beings, CLIP's text tower does not learn a language model. Fang et al. (2022) noted that on most natural distribution shifts, models trained with language information from a captioned subset of ImageNet follow the same trend as models trained it, with neither coming close to the distributional robustness of VL models.We conducted a range of experiments in Table G to verify this.1. Zeroshot and fully trained scrambling: we randomly scrambled the word order of captions and trained a model for the full training duration. We saw only a 1% loss in zero-shot accuracy when the model was trained in this fashion, and no loss at all when we scrambled the word order of the prompts used to generate the text embeddings. This finding shows that despite the existence of a positional encoder, CLIP's text tower is invariant w.r.t. word position on its input captions.2. We train a "simple captions" model of yfcc, with "An image of a CLASSNAME", where the class name is all the wordnet-recognized nouns and adjectives in the captions, and see very little change in effective robustness. This indicates that the model pays very little attention to verbs, adverbs and non-wordnet tokens, at least for the purposes of ImageNet zero-shot accuracy.3. We train a "simpler captions" model on YFCC, captioned "An image of a CLASSNAME", where the class name is a single version of an ImageNet-1k classname. We find that accuracy decreases, but even under this highly destructive transformation, effective robustness remains on the line.4. We test prompt ensembling by eliminating all but one prompt during inference when computing CLIP's predictions. We try scrambling this prompt and leaving it unscrambled. We find very little change in accuracy and none in effective robustness. This indicates that prompt selection and prompting methods may improve accuracy, but do not affect effective robustness.5. We fully trained a language model with a shift cipher applied to all of the letters in the captions. While this has a significant effect on accuracy (a drop of 8%), effective robustness holds nearly constant. This shows that CLIP's tokenizer aids accuracy but offers no effective robustness benefit (since a shift cipher destroys nearly all word-wise tokenization and forces the model to tokenize one letter at a time) and that the model is not more robust because of some special understanding of the text data itself, such as learning the frequency distribution of letters or tokens as a way of guessing captions.Throughout our experiments, we find that interventions that affect the integrity of the caption space do have impacts on the overall accuracy of the model, they do not change the effective robustness 

L CAPTIONNET SPREADSHEET COLUMN EXPLANATIONS

CaptionNet contains many different kinds of metadata, and the meaning of some of the column labels used may not be immediately apparent to the reader.We do not provide explanations for metadata columns which are explained in one of the original dataset descriptions; for those, we recommend referring to the original authors of the datasets. (Deng et al., 2009; Fang et al., 2022; Schuhmann et al., 2021a; Thomee et al., 2016; Kuznetsova et al., 2020) BLIPCaption refers to captions generated by us using a BLIP captioning model. BLIPTitle captions are a combination of the BLIP caption and the title field of flickr captions. Li et al. (2022) FlickrCaption refers to captions sourced from flickr.annot caption refers to OpenImages captions that were authored by human image annotators. prose caption combines BLIP and annotator captions, favoring the latter when available.

