WHEN AND WHY VISION-LANGUAGE MODELS BE-HAVE LIKE BAGS-OF-WORDS, AND WHAT TO DO ABOUT IT?

Abstract

Despite the use of large vision and language models (VLMs) in many downstream applications, it is unclear how well they encode the compositional relationships between objects and attributes. Here, we create the Attribution, Relation, and Order (ARO) benchmark to systematically evaluate the ability of VLMs to understand different types of relationships, attributes, and order information. ARO consists of Visual Genome Attribution, to test the understanding of objects' properties; Visual Genome Relation, to test for relational understanding; and COCO-Order & Flickr30k-Order, to test for order sensitivity in VLMs. ARO is orders of magnitude larger than previous benchmarks of compositionality, with more than 50,000 test cases. We present the settings in which state-of-the-art VLMs behave like bagsof-words-i.e. when they have poor relational understanding, can blunder when linking objects to their attributes, and demonstrate a severe lack of order sensitivity. VLMs are predominantly trained and evaluated on large scale datasets with rich compositional structure in the images and captions. Yet, training on these datasets has not been enough to address the lack of compositional understanding, and evaluating on these datasets has failed to surface this deficiency. To understand why these limitations emerge and are not represented in the standard tests, we zoom into the training and evaluation procedures. We demonstrate that it is possible to perform well on image-text retrieval over existing datasets without using the composition and order information. This further motivates the value of using ARO to benchmark VLMs. Given that contrastive pretraining optimizes for retrieval on large datasets with similar shortcuts, we hypothesize that this can explain why the models do not need to learn to represent compositional information. This finding suggests a natural solution: composition-aware hard negative mining. We show that a simple-to-implement modification of contrastive learning significantly improves the performance on tasks requiring an understanding of order and compositionality.

1. INTRODUCTION

Vision and language models (VLMs) have demonstrated high performance on dozens of wellestablished benchmarks (Radford et al., 2021; Li et al., 2022; Singh et al., 2022; Alayrac et al., 2022; Wang et al., 2022a; b; Zhai et al., 2022 ). Yet it is unclear whether performance on these benchmarks indicates rich compositional understanding of either text or images. For example, does CLIP distinguish between "the horse is eating the grass" and "the grass is eating the horse"? Natural scenes are complex, composed of many objects and attributes, in relationships with one another. While there have been important efforts to test compositional representations of objects, attributes, and relations (Thrush et al., 2022) , such efforts are based on small sets of hand-crafted examples, often combined with testing many other types of knowledge. This makes it hard to evaluate the role of relational and attributional knowledge in isolation and lacks the statistical power to quantify how well VLMs perform on granular subtypes of compositions. Here, we provide a large-scale test bed to evaluate VLMs' attribution, relation, and order understanding. Using the test bed we create, we find significant deficiencies: many models fail to perform beyond chance level at simple tasks requiring compositional understanding. Many VLMs are pretrained and tested on large datasets with complex scenes and detailed captions with rich compositional structure. Yet, training on these datasets has not been enough to address the lack of compositional understanding, and evaluating on these datasets has failed to surface this deficiency. In the recent literature, the dominant VLM training paradigm is image-text contrastive pretraining (Jia et al., 2021; Radford et al., 2021; Zhang et al., 2020) over these large pretraining datasets. Contrastive pretraining optimizes for the task of image-text retrieval, and naturally many VLMs are tested in the retrieval task. In this work, we provide an analysis of retrieval, as an evaluation and objective. We propose experiments to analyze how these models are evaluated and trained, to understand the underlying issues. 1Our main contributions are three-fold: 1. Introducing the Attribution, Relation, and Order benchmark (ARO) for fine-grained evaluation of VLMs' relation, attribution, and order understanding. We present four new tasks: Visual Genome Attributions and Visual Genome Relations, to test the understanding of objects' attributes and relations in complex natural scenes; and COCO Order and Flickr30k Order, to test the models' ability to identify the correct ordering of the words in a caption (Section 2). Using these evaluations, we show that state-of-the-art VLMs fail to represent simple relations such as "to the right of" and "behind", fail to represent the attributive difference between "the black jacket and the blue sky" versus "the blue jacket and the black sky", and fail to represent the difference between correct and permuted captions. We provide fine-grained insights into the types of attributions and relations that models most frequently fail to understand. 2. A critique of retrieval and contrastive pretraining. Given we find VLMs exhibit poor compositional understanding, why have these issues not surfaced in many previous evaluations? Existing retrieval datasets are equipped with complex scenes and detailed descriptions as captions, typically full of rich compositional structure. Intriguingly, the models can perform well on retrieval without having a good compositional understanding. Our experiments (Section 3) show that models can achieve a high performance on retrieval even when the order and composition cues are removed from captions or images. Hence, it is natural that models with compositional deficiencies can still perform well on the standard evaluations. This suggests that standard retrieval tasks are limited in their ability to assess compositional understanding of the model, further motivating the need for our comprehensive ARO benchmark. Since contrastive pretraining optimizes for retrieval, our findings also show that models can perform well on contrastive pretraining without learning compositional information. Given our results, we argue that not learning the compositional information is a valid shortcut strategy (Geirhos et al., 2020) , and VLMs have little incentive to learn to encode compositionality during contrastive pretraining. 3. Composition-aware hard negatives can go a long way. We propose a simple fix: mining of composition-aware hard negatives (Section 4). First, we introduce hard negatives consisting of the nearest neighboring images into each batch, to force models to represent fine-grained differences between very similar scenes. Second, we introduce hard negative captions into each batch, consisting of the true captions with word order perturbed, to force models to distinguish between correct and incorrect order. Finally, we show that this simple finetuning modification provides significant improvements in model understanding of attributes and relations.

2. ATTRIBUTION, RELATION, AND ORDER (ARO) BENCHMARK: WHEN DO MODELS BEHAVE LIKE A BAG-OF-WORDS?

Whereas humans effortlessly parse natural scenes containing rich objects in relation to one another, it is unclear whether machines understand the complexity of these scenes. To do so, models must be able to correctly represent objects, their attributes, and the relations between objects. Recent research has started to probe VLMs' for such information. 



Code is available at github.com/mertyg/vision-language-models-are-bows.



Thrush et al. (2022)  proposed Winoground, a dataset of test cases documenting a clear lack of compositional and pragmatic understanding in VLMs. The dataset is high quality but relatively small scale; its 400 test cases cover a wide range of linguistic phenomena (e.g., relation, pragmatics, world knowledge), making it hard to render statistically significant results about fine-grained relational and attributive abilities. In concurrent work,Diwan et al. (2022)  suggest that Winoground has further challenges beyond compositionality

