CAPTION SUPERVISION ENABLES ROBUST LEARNERS

Abstract

Vision language (VL) models like CLIP are robust to natural distribution shifts, in part because CLIP learns on unstructured data using a technique called caption supervision; the model inteprets image-linked texts as ground-truth labels. In a carefully controlled comparison study, we show that caption-supervised CNNs trained on a standard cross-entropy loss (with image labels assigned by scanning captions for class names) can exhibit greater distributional robustness than VL models trained on the same data. To facilitate future experiments with highaccuracy caption-supervised models, we introduce CaptionNet, which includes a class-balanced, fully supervised dataset with over 50,000 new human-labeled ImageNet-compliant samples which includes web-scraped captions. In a series of experiments on CaptionNet, we show how the choice of loss function, data filtration and supervision strategy enable robust computer vision. We also provide the codebase necessary to reproduce our experiments at VL Hub. Under review as a conference paper at ICLR 2023 Table 1: This table lists works which relate to ours own, evaluating distributional robustness in computer vision. We catalog the contributions of each paper with with respect to certain key factors. VL vs CE-loss indicates whether the paper conducted controlled comparisons on the effects of VL-loss (InfoNCE) and CE-loss. Captioning strategy is whether the study evaluated the effects of captioning strategy on model performance. Our paper is the first to compare CE-loss and VL-loss models trained and evaluated on multiple datasets at both low and high accuracies.

1. INTRODUCTION

Motivation. Real world uses of deep learning require predictable model behavior under data distribution shift. Our paper deals with distributional robustness, the effect of distribution shift on image classifiers. For brevity's sake, we may simply refer to this as robustness. Since 2019, numerous works have studied effective robustness on ILSVRC-2012 ImageNet classification, quantifying the impact of various interventions (Taori et al., 2020; Miller et al., 2021; Fang et al., 2022; Radford et al., 2021) . The majority of standard deep network models have been found to perform significantly worse under so-called natural distribution shifts, such as changes in lighting or stylized renderings of object classes (Hendrycks & Dietterich, 2019; Miller et al., 2021) , leading many researchers to reconsider how close computer vision models had truly come to human-level accuracy, and underscoring the need for more robust models and diverse evaluation datasets. The now-popular ViT-L CLIP model by Radford et al. (2021) was the first vision language (VL) model to show natural distributional robustness comparable to humans across a wide range of ImageNet shifts, at the cost of some basetask accuracy. Subsequent work from Jia et al. (2021); Pham et al. (2021) showed that human-level distributional robustness is possible even as base accuracy approaches SOTA, as long as sufficient data is available for training. The gains are not limited to CLIP; other VL-loss functions also achieve strong distributional robustness (Yu et al., 2022) . In our paper, we focus on CLIP, since it is publicly available and by far the most popular VL model. Since CLIP's design differs from typical models in several important ways (loss function, training dataset, the use of natural language captions as labels), it is of great interest to isolate the effect of these factors on model distributional robustness. Recent works have addressed this question, and have reached various interesting conclusions. Fang et al. (2022) posit that the intrinsic diversity of training image data is the main cause for the distributional robustness gains of VL models in the zero-shot setting, with other factors such as language supervision contributing little to no distributional robustness. On the other hand, Santurkar et al. ( 2022) seemingly provide a counterpoint; given a sufficiently large pretraining dataset and descriptive, low variability captions, contrastively trained VL models, AKA caption-supervised models, outperform models trained with the SIMCLR-loss in the transfer learning setting. Does caption supervision lead to models which perform better under distribution shift, or does it not? This question is difficult to answer conclusively, not only because of the aforementioned confounds, but also the often vast discrepancies in base accuracy (Taori et al., 2020) . VL-loss models are typically trained on massive uncurated datasets from the web, and perform poorly when restricted to training on standard benchmark datasets; collecting massive labeled datasets for CE models is an expensive undertaking. Therefore, it is difficult to conduct comparisons isolating the effects of dataset size, dataset composition, loss function, filtration method, and data supervision method. Our contributions. This paper addresses the aforementioned challenges in several important ways: 1. Following the lead of Miller et al. (2021) , some recent works (cf. Table 1 ) have extrapolated trends detected in low accuracy models to high accuracy regimes for which no models exist. Our experiments indicate that when the loss function is changed, the validity of these extrapolations may not apply, even when the dataset is the same. Given that several controlled studies are performed in the low accuracy regime, their results must be interpreted with appropriate caveats in mind. 2. In order to enable high-accuracy comparisons between models, we introduce CaptionNet, a 100-class image classification dataset composed of images, human-authored captions and humanannotated labels. We source these from four existing benchmark datasets, and augment with over 50,000 newly supervised creative commons samples sourced from Flickr. 3. We train cross-entropy (CE-loss) and VL-loss models models on CaptionNet, and find we are able to achieve high base accuracy with both architectures, allowing us to better isolate the effect of labelling strategy, as well as loss functions and architectures. 4. From our observations, we are able to conclude that the interaction between loss function and data filtration strategy contribute much more to distributional robustness than has previously been shown. 5. We also provide new insight on the impact of caption style in vision-language models, showing that improved caption supervision can have a positive impact on distributional robustness. work. Nguyen et al. (2022) were an important precursor to the work we present here; their extensive experiments on different caption-supervised datasets in the low accuracy regime made it evident that controlling for the pretraining dataset was absolutely essential for understanding distributional robustness. We extend this result and show that the loss function and labeling strategy can also play a pivotal role in distributional robustness, even when the size of the dataset and the source of image data are the same. Unlike Nguyen et al. ( 2022), our results are also presented in a high-accuracy regime.

Related

In addition to their work showing the importance of data to distributional robustness, Fang et al. ( 2022) introduced ImageNet-Captions, which added Flickr-captions to nearly 450,000 ImageNet images. We made use of this dataset when building CaptionNet, but added over 50,000 new human-supervised samples in order to rebalance the classes, as it has been shown that CE-loss models often struggle with class imbalances (Phan & Yamamoto, 2020) 

