F-VLM: OPEN-VOCABULARY OBJECT DETECTION UPON FROZEN VISION AND LANGUAGE MODELS

Abstract

We present F-VLM, a simple open-vocabulary object detection method built upon Frozen Vision and Language Models. F-VLM simplifies the current multi-stage training pipeline by eliminating the need for knowledge distillation or detectiontailored pretraining. Surprisingly, we observe that a frozen VLM: 1) retains the locality-sensitive features necessary for detection, and 2) is a strong region classifier. We finetune only the detector head and combine the detector and VLM outputs for each region at inference time. F-VLM shows compelling scaling behavior and achieves +6.5 mask AP improvement over the previous state-of-the-art on LVIS open-vocabulary detection benchmark at system level. In addition, we demonstrate very competitive results on COCO open-vocabulary detection benchmark and cross-dataset transfer detection, in addition to significant training speedup and compute savings. The code will be released 1 .

1. INTRODUCTION

Detection is a fundamental vision task that aims to localize and recognize objects in an image. However, the data collection process of manually annotating bounding boxes or instance masks is tedious and costly, which limits the modern detection vocabulary size to an order of 10 3 . This is orders of magnitude smaller than the vocabulary humans use to describe the visual world. To overcome such limitation, we focus on open-vocabulary object detection (Zareian et al., 2021; Gu et al., 2022) to take detection beyond a fixed set of vocabulary. Recently, vision and language models (VLMs) have gained strong open-vocabulary visual recognition capability by learning from Internet-scale image-text pairs (Radford et al., 2021; Jia et al., 2021) . They are typically applied to zero-shot classification (e.g., on ImageNet) using frozen weights without finetuning, which stands in stark contrast to the existing paradigms of retraining or finetuning when applying VLMs for open-vocabulary detection. Intuitively, in order to align the image content with the text description during training, VLMs may learn locality sensitive and discriminative features that are transferable to object detection. Observations in Figure 1 support our intuition. Surprisingly, features of a frozen VLM contain rich information that are both locality sensitive for describing object shapes (col. 2) and discriminative for region classification (col. 3). This motivates us to explore using frozen VLM features for openvocabulary detection, which entails accurate localization and classification of objects in the wild. We propose F-VLM -a simple and scalable open-vocabulary detection approach built upon frozen VLMs. For localization, we simply attach a detector head to predict object regions. For openvocabulary recognition, we apply the VLM feature pooler (e.g., a self-attention layer) on the region features from frozen backbones at test time. We train only the detector head upon a frozen VLM backbone, and combine the detection scores with the corresponding VLM predictions at test time. Our recipe reduces the training complexity of an open-vocabulary detector to below that of a standard detector, obviating the need for knowledge distillation, detection-tailored pretraining, or weakly supervised learning. By preserving the knowledge of pretrained VLMs completely, F-VLM maintains a similar philosophy as ViTDet (Li et al., 2022c) to decouple the detector-specific learning from the more task-agnostic vision knowledge in the backbone. We hope these findings will facilitate the research community to further explore frozen VLMs for a broader range of computer vision tasks.

2. RELATED WORK

Zero-shot/Open-vocabulary visual recognition and representation learning. Zero-shot and open-vocabulary recognition has been a long-standing problem in computer vision. Earlier works use the visual attributes to represent categories as binary codebooks and learn to predict the attributes for novel categories (Jayaraman & Grauman, 2014; Rohrbach et al., 2011 ). DeViSE (Frome et al., 2013) and ConSE (Norouzi et al., 2014) pioneer to learn a joint image-text embedding space using deep learning. Many works have shown the promise of representation learning from natural language associated with images, such as image tags (Chen & Gupta, 2015; Divvala et al., 2014; Joulin et al., 2016) or text descriptions (Desai & Johnson, 2021; He & Peng, 2017; Sariyildiz et al., 2020; Wang et al., 2009; Zhong et al., 2021) . Recently, popular large VLMs scale up by training on billions of image-text pairs and acquire strong image-text representation by contrastive learning (Radford et al., 2021; Jia et al., 2021; Pham et al., 2021; Zhai et al., 2022) . These models achieve strong zero-shot performance on many classification benchmarks and show clear benefits in scaling model capacity.



Project page: https://sites.google.com/view/f-vlm/home



Figure 1: We explore the potential of frozen VLM (e.g., CLIP) features for open-vocabulary detection. The feature grouping reveals rich semantic and locality-sensitive information where object boundaries are nicely delineated (col. 2, see Appendix C for more details). The same frozen features can classify groundtruth regions well without finetuning (col. 3). Therefore, we propose to build a open-vocabulary detector on top of a frozen VLM (col. 4) without a need for knowledge distillation, detection-tailored pretraining, or weakly supervised learning. F-VLM significantly reduces training complexity and compute requirement, and achieves the state-of-the-art performance at system level. We demonstrate the efficacy of F-VLM on LVIS (Gupta et al., 2019), COCO (Lin et al., 2014) and Objects365 (Shao et al., 2019). Here is a summary of our contributions and observations: • We propose F-VLM -a simple open-vocabulary detection method upon frozen VLMs without knowledge distillation, detection-tailored pretraining, or weakly supervised learning. • Despite its simplicity, F-VLM achieves strong performance, surpassing the previous stateof-the-art on LVIS open-vocabulary detection benchmark by 6.5 mask AP r at system level and outperforming existing approaches in cross-dataset transfer (COCO, Objects365). • F-VLM shows compelling scaling behavior with consistent performance improvements by increasing the backbone capacity (e.g., +14.2 LVIS mask AP r with our largest backbone). • F-VLM has much fewer trainable parameters, allowing it to train significantly faster. Compared with a strong open-vocabulary detection method ViLD (Gu et al., 2022), F-VLM not only achieves better performance, but also provides up to 200× training compute savings.

