IS A CAPTION WORTH A THOUSAND IMAGES? A STUDY ON REPRESENTATION LEARNING

Abstract

The development of CLIP (Radford et al., 2021) has sparked a debate on whether adding language supervision can yield vision models with more transferable representations than traditional image-only methods. Our work studies this question through a carefully controlled comparison of two approaches, in terms of their ability to learn representations that generalize to downstream classification tasks. We find that when the pre-training data meets certain criteria-it is sufficiently large and contains descriptive captions with low variability--image-only methods do not match CLIP's performance even when they are trained with more image data. However, contrary to what one might expect, there are practical settings in which these criteria are not met, wherein added supervision through captions is actually detrimental. Motivated by our findings, we devise simple data and algorithmic interventions to improve the transfer performance of CLIP-style models.

1. INTRODUCTION

Image-based contrastive learning approaches have shown promise in building models that generalize beyond the data distributions they are trained on (Wu et al., 2018; He et al., 2020; Chen et al., 2020a; Caron et al., 2020; Chen et al., 2020b; Caron et al., 2021) . By leveraging large (unlabelled) data sources via self-supervised training, these models learn representations that transfer to diverse image classification tasks-more so than their supervised counterparts (Ericsson et al., 2021) . Recently, Radford et al. (2021) showed that a different approach-contrastive learning with language supervision-can yield models (CLIP) with remarkable transfer capabilities. This development has garnered significant interest in the vision and natural language processing communities alike, leading to a debate on the utility of multi-modality in visual representation learning (Zhai et al., 2022; Devillers et al., 2021; Fang et al., 2022) . Our work focuses on a specific question within this debate: Does added language supervision lead to more transferable visual representations than using images alone? It might seem like the answer to this question is obvious. After all, CLIP utilized caption information unavailable to traditional image-based approaches and showed substantial gains over them (Radford et al., 2021) . However, CLIP is drastically different from these approaches in many ways, from training data to fine-grained implementation choices, which makes it difficult to isolate the contribution of language supervision (see Section 5). Further, recent studies on CLIP's zero-shot classification and robustness properties cast doubt on whether adding language supervision is always beneficial (Fang et al., 2022) . Resolving the aforementioned debate thus requires a carefully controlled comparison of the two approaches in which the only difference is the form of supervision. Our contributions. We devise a methodology to assess the utility of language supervision in CLIPfoot_0 from a visual representation learning standpoint. "choosing the parking meters on this street should be very difficult" "the car is parked on the side of the road by the tall buildings" Image-only supervision (e.g., SimCLR) Image-language supervision (e.g., CLIP)

∼

Objective: consistency model via image augmentations or image-caption matching (see Figure 1 )-which is precisely the quantity we want to study. Thus, we can disentangle the effect of language supervision on visual representations by comparing matched versions of SimCLR and CLIP (trained from scratch). Our focus, in particular, is on how well the learned representations transfer to varied image classification tasks. We find that the picture is nuanced and depends on three properties of the pre-training data: x x + ∼ T(x) 1. When the scale of the dataset is sufficiently large, CLIP's visual representations indeed transfer better than their matched image-only SimCLR counterparts. In fact, this gap is not bridged by training SimCLR with more (image) data, suggesting that a caption can be worth more than any number of images. However, in the low-data regime, language supervision actually hurts model performance both in and out-of-distribution. 2. The descriptiveness (Kreiss et al., 2021) of captions-i.e., the extent to which they refer to what is contained in an image-directly determines how well CLIP models transfer. In fact, we find that a single descriptive image-caption pair (e.g., from COCO (Lin et al., 2014) ) is worth five less descriptive, uncurated captions (e.g., from YFCC (Thomee et al., 2016)). 3. The variability of captions (e.g. stylistic or lexical) within a dataset can impair CLIP's performance. We find that a modification to standard CLIP training-performing text augmentations by sampling from a pool of captions for each image-can alleviate this drop. These properties have inter-twined effects on CLIP's performance: e.g., dataset scale can, to some extent, compensate for less-descriptive and/or varied captions. Guided by our findings, we devise simple datasets interventions that can lead to more-transferrable CLIP models: (i) filtering out lowquality captions with a text-based classifier, and (ii) applying data augmentation to captions by paraphrasing them using pre-trained language models.

2. AN APPLES-TO-APPLES COMPARISON

Prior works have studied image-only and image-language pre-training methods in isolation (Wu et al., 2018; He et al., 2020; Chen et al., 2020a; Caron et al., 2020; Chen et al., 2020b; b; Chen & He, 2021; Caron et al., 2021; Radford et al., 2021) and side-by-side (Desai & Johnson, 2021; Devillers et al., 2021; Fang et al., 2022 ). Yet, they provide incomplete (and often contradictory) answers to our motivating question of the value of language supervision relative to using images alone (Section 5). Crucially, this is due to various confounders such as: (i) bespoke algorithmic optimizations within the two methods, and (ii) differing pre-training datasets. In this section, we outline a series of steps that we take to mitigate these confounders and compare the two methods on equal footing.

2.1. FINDING COMMON GROUND

Our approach for studying the value of language supervision is guided by the following insight: CLIP pre-training is strikingly similar to the popular image-only SimCLR method (Chen et al.,



We use CLIP to refer to models trained with Radford et al. (2021)'s approach, not their pre-trained model.



Figure 1: A conceptual view of contrastive image-only and image-language pre-training. Both methods rely on the same self-supervised objective: aligning the representations of positive examples (x, x + ) while distinguishing them from negative ones (x n ). The transformation T (•) used to obtain x + ∼ T (x) (augmented image or caption) encodes the equivalences the model must satisfy.

