LEARNING TO DECOMPOSE VISUAL FEATURES WITH LATENT TEXTUAL PROMPTS

Abstract

Recent advances in pre-training vision-language models like CLIP (Radford et al., 2021) have shown great potential in learning transferable visual representations. Nonetheless, for downstream inference, CLIP-like models suffer from either 1) degraded accuracy and robustness when inferring by retrieving textual class names (the zero-shot protocol); or 2) breaking the well-established vision-language alignment (linear probing). To combine the best of both worlds, we propose Decomposed Feature Prompting (DeFo). DeFo maintains the dual-model architecture yet leverages learnable embeddings as textual input and performs classification with an additional linear layer. As a result, we find DeFo to be able to extract decomposed visual features with the help of textual prompts and to allow a scalable size of language inputs. Our empirical study shows DeFo's significance in improving the vision-language models. For example, DeFo obtains 73.2% test accuracy on ImageNet with a ResNet-50 backbone without tuning any pretrained weights of both the vision and language encoder, outperforming zero-shot CLIP by a large margin of 15.0%, and outperforming state-of-the-art vision-language prompt tuning by 7.6%.

1. INTRODUCTION

Language-guided visual pretraining has gained a lot of attention and shows great promise in learning transferable image representations. By establishing a connection between images and natural language, recent vision-language models are able to turn visual inference over a restricted number of classes into zero-shot open-vocabulary inference (Radford et al., 2021; Jia et al., 2021; Pham et al., 2021) . One of the recent successes for zero-shot inference is the contrastive language-image pretraining (CLIP) model (Radford et al., 2021) . It uses 400 million image-text pairs to learn an alignment between visual and textual representations obtained from a vision encoder and a language encoder respectively. In downstream applications, CLIP-like models (Radford et al., 2021; Jia et al., 2021; Pham et al., 2021) then perform zero-shot inference by hard-target retrieval, i.e., they directly compute the distance between a vectorial image representation obtained from the vision encoder, and representations of text prompts (e.g., "a photo of an airplane" or "a photo of an automobile") obtained from the language encoder. The target class (e.g., "airplane" or "automobile") corresponding to the text prompt with the smallest distance to the vector representing the image constitutes the zero-shot inference result. When annotations are given, simple linear probing (i.e., removing the language encoder, fine-tuning of the vision encoder and training of a classifier on top of the vision encoder) further improves the results (Radford et al., 2021) . Moreover, context optimization (CoOp) (Zhou et al., 2021) replaces the hand-crafted prefix or suffix (e.g., "a photo of a") of the text prompts by trainable embedding vectors. However, the zero-shot CLIP and CoOp infer using hard textual targets, i.e., the class names, which results in two main challenges. First, class names in text prompts (e.g., "airplane" or "automobile"), as used in zero-shot CLIP and CoOp inference, do not permit to accurately summarize the semantic information of an image. Therefore, inference is very sensitive to the words chosen for class names. We refer to this challenge as expressive sensitivity. Empirically, this challenge causes zero-shot CLIP and CoOp to struggle to achieve as competitive results as linear probing with the same image encoder when downstream training data is available (e.g., 58.2% accuracy vs. 72.3% on ImageNet (Deng et al., 2009) ). Moreover, this sensitivity can be observed by modifying class names. Fore example, for zero-shot inference on CIFAR-10 (Krizhevsky et al., 2009) , CLIP obtains an accuracy of 63.7% when the original class names are used. Notably, simply replacing or extending the class names with suitable synonymsfoot_0 (e.g., "plane" and "car" rather than "airplane" and "automobile") can improve accuracy to 79.6%, which highlights the challenge of expressive sensitivity. Second, despite the fact that hundreds of millions of pretraining samples cover a large number of concepts that can possibly appear in downstream datasets, zero-shot inference continues to struggle to recognize rare objects. We refer to this as the conceptual sensitivity. For example, zero-shot CLIP is only 38.5% accurate when classifying EuroSAT satellite images (Helber et al., 2019) , which is much lower than the result of a supervised ResNet-50 (He et al., 2016) encoder (93.4%). Also, zero-shot CLIP with a ResNet-50 encoder achieves less than 90% accuracy on MNIST (LeCun, 1998) , which can even be outperformed by a simple logistic regression model. While linear probing is a straightforward way to improve results, removing of the language encoder breaks the visionlanguage alignment that is learned from the pretraining data, and therefore degrades few-shot and transfer learning performance. In this paper, we propose Decomposed Feature Prompting (DeFo), which turns the hard-targetretrieval paradigm of CLIP and CoOp into dual-model feature prompting. Specifically, DeFo 1) provides to the language encoder a set of learnable embedding sequences which are independent of the hard semantic targets; and 2) performs classification by tuning an additional layer. As a result, DeFo does not rely on the textual representations of class names being classification targets, which addresses the issues of expressive sensitivity and conceptual sensitivity. Meanwhile, DeFo maintains the dual-model architecture, which enables the model to leverage the language information, so that few-shot and transfer learning performance can be boosted. DeFo results show the significance of addressing the sensitivity challenges of CLIP-like models. For example, with a ResNet-50 backbone, DeFo achieves 73.2% test accuracy on ImageNet without modifying any pretrained weight of the image and text encoders, outperforming vanilla CLIP by a large margin of 15.0% and outperforming CoOp by 7.6%. In a variety of visual contexts, DeFo attains an average accuracy of 79.9% over 11 image classification benchmarks, which is 21.0% higher than that of zero-shot CLIP and 6.2% higher than CoOp.

2. RELATED WORK

Pretraining-finetuning has long been a dominant paradigm of transfer learning in machine learning, computer vision, and natural language processing. Generally, pretraining a vision encoder by generative objectives (Bao et al., 2021; He et al., 2022) or discriminative objectives (He et al., 2020; Chen et al., 2020; Grill et al., 2020; Caron et al., 2021) at the scale of one to ten million images (Deng et al., 2009) is sufficient to yield good visual representations and strong predictive performance in downstream visual tasks. However, without the supervision from other modalities, such pretrained models require task-specific finetuning (Bao et al., 2021; He et al., 2022; O Pinheiro et al., 2020; Wang et al., 2022a; Lin et al., 2022a) The contrastive language-image pretraining (CLIP) (Radford et al., 2021) method instead jointly pretrains a vision encoder and a text encoder on 400 million curated image-text pairs, with a contrastive objective (Gutmann & Hyvärinen, 2010) that matches the visual and textual representations. In downstream applications, CLIP achieves competitive results in various vision or vision-language tasks such as image classification (Zhou et al., 2021; Gao et al., 2021 ), dense prediction (Rao et al., 2022 ), video-language tasks (Luo et al., 2021; Lin et al., 2022b; Wang et al., 2022b) , image manipulation (Patashnik et al., 2021) , and multimedia event extraction (Li et al., 2022) . Following the success of CLIP, the ALIGN (Jia et al., 2021) model leverages a noisy dataset of 1.8 billion image-text pairs to scale up vision-language representation learning, and the BASIC (Pham et al., 2021) model further scales up this approach in terms of data and model size. Based on the success of CLIP-like vision-language pretraining, a series of follow-up inference approaches are proposed to improve classification results. For example, Zhou et al. (2021) propose CoOp to learn



We use WordNet(Fellbaum, 2010) to find synonyms.



or linear probing He et al. (2020); Chen et al. (2020) for reasonably domain-adapted predictions.

