WHEN AND WHY VISION-LANGUAGE MODELS BE-HAVE LIKE BAGS-OF-WORDS, AND WHAT TO DO ABOUT IT?

Abstract

Despite the use of large vision and language models (VLMs) in many downstream applications, it is unclear how well they encode the compositional relationships between objects and attributes. Here, we create the Attribution, Relation, and Order (ARO) benchmark to systematically evaluate the ability of VLMs to understand different types of relationships, attributes, and order information. ARO consists of Visual Genome Attribution, to test the understanding of objects' properties; Visual Genome Relation, to test for relational understanding; and COCO-Order & Flickr30k-Order, to test for order sensitivity in VLMs. ARO is orders of magnitude larger than previous benchmarks of compositionality, with more than 50,000 test cases. We present the settings in which state-of-the-art VLMs behave like bagsof-words-i.e. when they have poor relational understanding, can blunder when linking objects to their attributes, and demonstrate a severe lack of order sensitivity. VLMs are predominantly trained and evaluated on large scale datasets with rich compositional structure in the images and captions. Yet, training on these datasets has not been enough to address the lack of compositional understanding, and evaluating on these datasets has failed to surface this deficiency. To understand why these limitations emerge and are not represented in the standard tests, we zoom into the training and evaluation procedures. We demonstrate that it is possible to perform well on image-text retrieval over existing datasets without using the composition and order information. This further motivates the value of using ARO to benchmark VLMs. Given that contrastive pretraining optimizes for retrieval on large datasets with similar shortcuts, we hypothesize that this can explain why the models do not need to learn to represent compositional information. This finding suggests a natural solution: composition-aware hard negative mining. We show that a simple-to-implement modification of contrastive learning significantly improves the performance on tasks requiring an understanding of order and compositionality.

1. INTRODUCTION

Vision and language models (VLMs) have demonstrated high performance on dozens of wellestablished benchmarks (Radford et al., 2021; Li et al., 2022; Singh et al., 2022; Alayrac et al., 2022; Wang et al., 2022a; b; Zhai et al., 2022 ). Yet it is unclear whether performance on these benchmarks indicates rich compositional understanding of either text or images. For example, does CLIP distinguish between "the horse is eating the grass" and "the grass is eating the horse"? Natural scenes are complex, composed of many objects and attributes, in relationships with one another. While there have been important efforts to test compositional representations of objects, attributes, and relations (Thrush et al., 2022) , such efforts are based on small sets of hand-crafted examples, often combined with testing many other types of knowledge. This makes it hard to evaluate the role of relational and attributional knowledge in isolation and lacks the statistical power to quantify how well VLMs perform on granular subtypes of compositions. Here, we provide a large-scale test bed to evaluate VLMs' attribution, relation, and order understanding. Using the test bed we create, we find significant deficiencies: many models fail to perform beyond chance level at simple tasks requiring compositional understanding. 1

