DELVING INTO THE OPENNESS OF CLIP Anonymous authors Paper under double-blind review

Abstract

Contrastive Language-Image Pre-training (CLIP) has demonstrated great potential in realizing open-vocabulary visual recognition in a matching style, due to its holistic use of natural language supervision that covers unconstrained real-world visual concepts. However, it is, in turn, also difficult to evaluate and analyze the openness of CLIP-like models, since they are in theory open to any vocabulary but the actual accuracy varies. To address the insufficiency of conventional studies on openness, we resort to an incremental perspective and define the extensibility, which essentially approximates the model's ability to deal with new visual concepts, by evaluating openness through vocabulary expansions. Our evaluation based on extensibility shows that CLIP-like models are hardly truly open and their performances degrade as the vocabulary expands to different degrees. Further analysis reveals that the over-estimation of openness is not because CLIP-like models fail to capture the general similarity of image and text features of novel visual concepts, but because of the confusion among competing text features, that is, they are not stable with respect to the vocabulary. In light of this, we propose to improve the openness of CLIP in feature space by enforcing the distinguishability of text features. Our method retrieves relevant texts from the pre-training corpus to enhance prompts for inference, which boosts the extensibility and stability of CLIP even without fine-tuning.

1. INTRODUCTION

The seek for an intrinsically open mechanism of visual recognition (Deng et al., 2009; He et al., 2016) has always been a shared goal in the computer vision community (Scheirer et al., 2013; Geng et al., 2021; Bendale & Boult, 2015) . It requires models to maintain flexibility to cope with the scaling of the recognition target, where both input images and the corresponding classes will dynamically expand according to actual needs. For example, in medical diagnosis (Razzak et al., 2017) , new diseases emerge constantly and in e-commerce, new categories of products appear daily (Xu et al., 2019) , which cannot be predefined in a finite class set and remain fixed during inference. Faced with this challenging open-world recognition problem, traditional supervised classifiers and algorithms have struggled, as they only learn to discriminate limited classes in a closed set, and cannot adapt to the scaling of target classes. However, the emergence of Contrastive Language-Image Pre-training (CLIP) (Radford et al., 2021) and its open-vocabulary learning paradigm has reversed this situation. CLIP models visual recognition as a task of image-text matching rather than the classic image classification. It is pre-trained on web-scale collections of image-text pairs, learning unconstrained visual concepts from natural language supervision with contrastive learning. During inference, it devises a textual prompt like "a photo of a [CLASS]", where the class token can be replaced by any potential class name from a vocabulary. The prompt-formed class description with the highest similarity to the input image is predicted as the target class. This modeling paradigm makes CLIP operationally suitable for open tasks in the real world. When input images and the target classes change, CLIP can still conduct zero-shot inference by adaptively adjusting the class names in the vocabulary and then modifying the corresponding class descriptions for matching, sparing re-training the entire model on new data like the traditional classification-based methods. Nevertheless, contrary to the note "CLIP has a wide range of capabilities due to its ability to carry out arbitrary image classification tasks" in (Radford et al., 2021) , previous evaluation of CLIP is still limited in the closed set, leaving its actual performance on open tasks in shadow. In this work, we rethink openness, the intriguing but under-explored property of CLIP, and present a protocol for evaluating the openness of CLIP-like models (Radford et al., 2021; Li et al., 2021b; Mu et al., 2021; Yao et al., 2021; Zhou et al., 2021) from an incremental view. Specifically, we define extensibility, which essentially approximates the models' ability in dealing with new visual concepts through vocabulary expansion. Our experimental results based on extensibility show that CLIP and its variants have a significant drop in accuracy, e.g., 12.9% of CLIP (RN101) on CIFAR100 as the vocabulary size expands from 5 to 100, indicating that the limited zero-shot capability of CLIP-like models is not sufficient to support its deployment in the open world. Different from previous opennessrelated work, we focus on analyzing how the new class descriptions introduced with vocabulary expansion affect the stability of classification on the old input images. Our investigation reveals that the small margin between text features of different classes leads to the prediction shift. To improve the distinguishability of text features and the semantic alignment between images and their textual description, we propose a non-parametric method named Retrieval-enhanced Prompt Engineering (REPE), which retrieves relevant captions from the pre-training corpus to customize the prompt for each class during zero-shot inference. To summarize, our contribution is three-fold: (1) To our best knowledge, we are the first to systematically investigate the openness of CLIP, for which we design the evaluation protocol and two indicators of extensibility and stability. Through an analysis of the prediction shift during vocabulary expansion, we find that the performance of CLIP is greatly reduced by adding a small number of adversarial non-target classes, exposing the huge risk of its deployment in the open world. (2) We further dissect the feature space of CLIP from the perspectives of representation alignment and uniformity, observing that the uniformity of the textual space is critical for better extensibility. (3) We propose a simple yet effective method, REPE, to improve the extensibility and stability of CLIP without fine-tuning.

2. OPENNESS, EXTENSIBILITY, AND STABILITY

In this section, we first review CLIP's visual recognition paradigm based on image-text matching, and then demonstrate how it realizes open-vocabulary image classification in theory by vocabulary expansion ( § 2.1). To quantify the actual performance of CLIP-like models as the vocabulary expands, we define the metric of extensibility and propose a systematical evaluation protocol ( § 2.2). The experimental results and further analysis reveal that, as the vocabulary expands, the predictions of CLIP are unstable and prone to drift to the competing class descriptions that are newly introduced, which limits its extensibility and leaves a huge security risk when deployed in real-world applications ( § 2.3).

2.1. OPENNESS OF CLIP

In contrast to the classic supervised methods (He et al., 2016; Dosovitskiy et al., 2021) , CLIP (Radford et al., 2021) models visual recognition as an image-text matching task with self-supervised contrastive pre-training. Formally, let f be the CLIP model, it takes an image x and a target vocabulary V (T ) = {w i } of the class names w i as inputs, and outputs the predicted label ŷ of the image as: ŷ = f x, V (T ) = arg max i P (y = i | x) = arg max i exp sim(fT (ti), fI (x)) |V (T ) | j=1 exp sim(fT (tj), fI (x)) , where t i is the textual description of the class name w i in a prompt format, e.g., "a photo of a w i ", sim(•, •) denotes cosine similarity, f T and f I is the text and image encoder in CLIP, respectively. Such a modeling paradigm can realize the open-world image classification in theory by extending the target vocabulary V (T ) to arbitrary degrees. However, in most previous work (Radford et al., 2021; Li et al., 2021b; Mu et al., 2021; Yao et al., 2021; Zhou et al., 2021) , CLIP is evaluated with a fixed V (T ) depending on the target classes of the downstream dataset D (T ) : Acc V (T ) = 1 |D (T ) | (x,y)∈D (T ) I f x, V (T ) = y , where |D (T ) | denotes the size of the dataset, and I(•) is the indicator function. This vanilla evaluation setting with restricted input images and classes is insufficient for the open recognition tasks, as it

