CAPTION SUPERVISION ENABLES ROBUST LEARNERS

Abstract

Vision language (VL) models like CLIP are robust to natural distribution shifts, in part because CLIP learns on unstructured data using a technique called caption supervision; the model inteprets image-linked texts as ground-truth labels. In a carefully controlled comparison study, we show that caption-supervised CNNs trained on a standard cross-entropy loss (with image labels assigned by scanning captions for class names) can exhibit greater distributional robustness than VL models trained on the same data. To facilitate future experiments with highaccuracy caption-supervised models, we introduce CaptionNet, which includes a class-balanced, fully supervised dataset with over 50,000 new human-labeled ImageNet-compliant samples which includes web-scraped captions. In a series of experiments on CaptionNet, we show how the choice of loss function, data filtration and supervision strategy enable robust computer vision. We also provide the codebase necessary to reproduce our experiments at VL Hub.

1. INTRODUCTION

Motivation. Real world uses of deep learning require predictable model behavior under data distribution shift. Our paper deals with distributional robustness, the effect of distribution shift on image classifiers. For brevity's sake, we may simply refer to this as robustness. Since 2019, numerous works have studied effective robustness on ILSVRC-2012 ImageNet classification, quantifying the impact of various interventions (Taori et al., 2020; Miller et al., 2021; Fang et al., 2022; Radford et al., 2021) . The majority of standard deep network models have been found to perform significantly worse under so-called natural distribution shifts, such as changes in lighting or stylized renderings of object classes (Hendrycks & Dietterich, 2019; Miller et al., 2021) , leading many researchers to reconsider how close computer vision models had truly come to human-level accuracy, and underscoring the need for more robust models and diverse evaluation datasets. The now-popular ViT-L CLIP model by Radford et al. (2021) was the first vision language (VL) model to show natural distributional robustness comparable to humans across a wide range of ImageNet shifts, at the cost of some basetask accuracy. Subsequent work from Jia et al. ( 2021); Pham et al. (2021) showed that human-level distributional robustness is possible even as base accuracy approaches SOTA, as long as sufficient data is available for training. The gains are not limited to CLIP; other VL-loss functions also achieve strong distributional robustness (Yu et al., 2022) . In our paper, we focus on CLIP, since it is publicly available and by far the most popular VL model. Since CLIP's design differs from typical models in several important ways (loss function, training dataset, the use of natural language captions as labels), it is of great interest to isolate the effect of these factors on model distributional robustness. Recent works have addressed this question, and have reached various interesting conclusions. Fang et al. (2022) posit that the intrinsic diversity of training image data is the main cause for the distributional robustness gains of VL models in the zero-shot setting, with other factors such as language supervision contributing little to no distributional robustness. On the other hand, Santurkar et al. ( 2022) seemingly provide a counterpoint; given a sufficiently large pretraining dataset and descriptive, low variability captions, contrastively trained VL models, AKA caption-supervised models, outperform models trained with the SIMCLR-loss in the transfer learning setting. Does caption supervision lead to models which perform better under distribution shift, or does it not? This question is difficult to answer conclusively, not only because of the aforementioned confounds, but also the often vast discrepancies in base accuracy (Taori et al., 2020) . VL-loss models are

