DELVING INTO THE OPENNESS OF CLIP Anonymous authors Paper under double-blind review

Abstract

Contrastive Language-Image Pre-training (CLIP) has demonstrated great potential in realizing open-vocabulary visual recognition in a matching style, due to its holistic use of natural language supervision that covers unconstrained real-world visual concepts. However, it is, in turn, also difficult to evaluate and analyze the openness of CLIP-like models, since they are in theory open to any vocabulary but the actual accuracy varies. To address the insufficiency of conventional studies on openness, we resort to an incremental perspective and define the extensibility, which essentially approximates the model's ability to deal with new visual concepts, by evaluating openness through vocabulary expansions. Our evaluation based on extensibility shows that CLIP-like models are hardly truly open and their performances degrade as the vocabulary expands to different degrees. Further analysis reveals that the over-estimation of openness is not because CLIP-like models fail to capture the general similarity of image and text features of novel visual concepts, but because of the confusion among competing text features, that is, they are not stable with respect to the vocabulary. In light of this, we propose to improve the openness of CLIP in feature space by enforcing the distinguishability of text features. Our method retrieves relevant texts from the pre-training corpus to enhance prompts for inference, which boosts the extensibility and stability of CLIP even without fine-tuning.

1. INTRODUCTION

The seek for an intrinsically open mechanism of visual recognition (Deng et al., 2009; He et al., 2016) has always been a shared goal in the computer vision community (Scheirer et al., 2013; Geng et al., 2021; Bendale & Boult, 2015) . It requires models to maintain flexibility to cope with the scaling of the recognition target, where both input images and the corresponding classes will dynamically expand according to actual needs. For example, in medical diagnosis (Razzak et al., 2017) , new diseases emerge constantly and in e-commerce, new categories of products appear daily (Xu et al., 2019) , which cannot be predefined in a finite class set and remain fixed during inference. Faced with this challenging open-world recognition problem, traditional supervised classifiers and algorithms have struggled, as they only learn to discriminate limited classes in a closed set, and cannot adapt to the scaling of target classes. However, the emergence of Contrastive Language-Image Pre-training (CLIP) (Radford et al., 2021) and its open-vocabulary learning paradigm has reversed this situation. CLIP models visual recognition as a task of image-text matching rather than the classic image classification. It is pre-trained on web-scale collections of image-text pairs, learning unconstrained visual concepts from natural language supervision with contrastive learning. During inference, it devises a textual prompt like "a photo of a [CLASS]", where the class token can be replaced by any potential class name from a vocabulary. The prompt-formed class description with the highest similarity to the input image is predicted as the target class. This modeling paradigm makes CLIP operationally suitable for open tasks in the real world. When input images and the target classes change, CLIP can still conduct zero-shot inference by adaptively adjusting the class names in the vocabulary and then modifying the corresponding class descriptions for matching, sparing re-training the entire model on new data like the traditional classification-based methods. Nevertheless, contrary to the note "CLIP has a wide range of capabilities due to its ability to carry out arbitrary image classification tasks" in (Radford et al., 2021) , previous evaluation of CLIP is still limited in the closed set, leaving its actual performance on open tasks in shadow. 1

