LEARNING TO DECOMPOSE VISUAL FEATURES WITH LATENT TEXTUAL PROMPTS

Abstract

Recent advances in pre-training vision-language models like CLIP (Radford et al., 2021) have shown great potential in learning transferable visual representations. Nonetheless, for downstream inference, CLIP-like models suffer from either 1) degraded accuracy and robustness when inferring by retrieving textual class names (the zero-shot protocol); or 2) breaking the well-established vision-language alignment (linear probing). To combine the best of both worlds, we propose Decomposed Feature Prompting (DeFo). DeFo maintains the dual-model architecture yet leverages learnable embeddings as textual input and performs classification with an additional linear layer. As a result, we find DeFo to be able to extract decomposed visual features with the help of textual prompts and to allow a scalable size of language inputs. Our empirical study shows DeFo's significance in improving the vision-language models. For example, DeFo obtains 73.2% test accuracy on ImageNet with a ResNet-50 backbone without tuning any pretrained weights of both the vision and language encoder, outperforming zero-shot CLIP by a large margin of 15.0%, and outperforming state-of-the-art vision-language prompt tuning by 7.6%.

1. INTRODUCTION

Language-guided visual pretraining has gained a lot of attention and shows great promise in learning transferable image representations. By establishing a connection between images and natural language, recent vision-language models are able to turn visual inference over a restricted number of classes into zero-shot open-vocabulary inference (Radford et al., 2021; Jia et al., 2021; Pham et al., 2021) . One of the recent successes for zero-shot inference is the contrastive language-image pretraining (CLIP) model (Radford et al., 2021) . It uses 400 million image-text pairs to learn an alignment between visual and textual representations obtained from a vision encoder and a language encoder respectively. In downstream applications, CLIP-like models (Radford et al., 2021; Jia et al., 2021; Pham et al., 2021) then perform zero-shot inference by hard-target retrieval, i.e., they directly compute the distance between a vectorial image representation obtained from the vision encoder, and representations of text prompts (e.g., "a photo of an airplane" or "a photo of an automobile") obtained from the language encoder. The target class (e.g., "airplane" or "automobile") corresponding to the text prompt with the smallest distance to the vector representing the image constitutes the zero-shot inference result. When annotations are given, simple linear probing (i.e., removing the language encoder, fine-tuning of the vision encoder and training of a classifier on top of the vision encoder) further improves the results (Radford et al., 2021) . Moreover, context optimization (CoOp) (Zhou et al., 2021) replaces the hand-crafted prefix or suffix (e.g., "a photo of a") of the text prompts by trainable embedding vectors. However, the zero-shot CLIP and CoOp infer using hard textual targets, i.e., the class names, which results in two main challenges. First, class names in text prompts (e.g., "airplane" or "automobile"), as used in zero-shot CLIP and CoOp inference, do not permit to accurately summarize the semantic information of an image. Therefore, inference is very sensitive to the words chosen for class names. We refer to this challenge as expressive sensitivity. Empirically, this challenge causes zero-shot CLIP and CoOp to struggle to achieve as competitive results as linear probing with the same image encoder when downstream training data is available (e.g., 58.2% accuracy vs. 72.3%

