IS A CAPTION WORTH A THOUSAND IMAGES? A STUDY ON REPRESENTATION LEARNING

Abstract

The development of CLIP (Radford et al., 2021) has sparked a debate on whether adding language supervision can yield vision models with more transferable representations than traditional image-only methods. Our work studies this question through a carefully controlled comparison of two approaches, in terms of their ability to learn representations that generalize to downstream classification tasks. We find that when the pre-training data meets certain criteria-it is sufficiently large and contains descriptive captions with low variability--image-only methods do not match CLIP's performance even when they are trained with more image data. However, contrary to what one might expect, there are practical settings in which these criteria are not met, wherein added supervision through captions is actually detrimental. Motivated by our findings, we devise simple data and algorithmic interventions to improve the transfer performance of CLIP-style models.

1. INTRODUCTION

Image-based contrastive learning approaches have shown promise in building models that generalize beyond the data distributions they are trained on (Wu et al., 2018; He et al., 2020; Chen et al., 2020a; Caron et al., 2020; Chen et al., 2020b; Caron et al., 2021) . By leveraging large (unlabelled) data sources via self-supervised training, these models learn representations that transfer to diverse image classification tasks-more so than their supervised counterparts (Ericsson et al., 2021) . Recently, Radford et al. (2021) showed that a different approach-contrastive learning with language supervision-can yield models (CLIP) with remarkable transfer capabilities. This development has garnered significant interest in the vision and natural language processing communities alike, leading to a debate on the utility of multi-modality in visual representation learning (Zhai et al., 2022; Devillers et al., 2021; Fang et al., 2022) . Our work focuses on a specific question within this debate: Does added language supervision lead to more transferable visual representations than using images alone? It might seem like the answer to this question is obvious. After all, CLIP utilized caption information unavailable to traditional image-based approaches and showed substantial gains over them (Radford et al., 2021) . However, CLIP is drastically different from these approaches in many ways, from training data to fine-grained implementation choices, which makes it difficult to isolate the contribution of language supervision (see Section 5). Further, recent studies on CLIP's zero-shot classification and robustness properties cast doubt on whether adding language supervision is always beneficial (Fang et al., 2022) . Resolving the aforementioned debate thus requires a carefully controlled comparison of the two approaches in which the only difference is the form of supervision. Our contributions. We devise a methodology to assess the utility of language supervision in CLIPfoot_0 from a visual representation learning standpoint. To do so, we recognize that CLIP pretraining and popular image-based methods share the same underlying primitive of contrastive learning. Specifically, Radford et al. ( 2021)'s approach is strikingly similar to SimCLR (Chen et al., 2020a) . The only irreducible difference between them is whether supervision is provided to the



We use CLIP to refer to models trained with Radford et al. (2021)'s approach, not their pre-trained model. 1

