IMAGINARYNET: LEARNING OBJECT DETECTORS WITHOUT REAL IMAGES AND ANNOTATIONS

Abstract

Without the demand of training in reality, humans are able of detecting a new category of object simply based on the language description on its visual characteristics. Empowering deep learning with this ability undoubtedly enables the neural network to handle complex vision tasks, e.g., object detection, without collecting and annotating real images. To this end, this paper introduces a novel challenging learning paradigm Imaginary-Supervised Object Detection (ISOD), where neither real images nor manual annotations are allowed for training object detectors. To resolve this challenge, we propose IMAGINARYNET, a framework to synthesize images by combining pretrained language model and text-to-image synthesis model. Given a class label, the language model is used to generate a full description of a scene with a target object, and the text-to-image model is deployed to generate a photo-realistic image. With the synthesized images and class labels, weakly supervised object detection can then be leveraged to accomplish ISOD. By gradually introducing real images and manual annotations, IMAGINARYNET can collaborate with other supervision settings to further boost detection performance. Experiments show that IMAGINARYNET can (i) obtain about 75% performance in ISOD compared with the weakly supervised counterpart of the same backbone trained on real data, (ii) significantly improve the baseline while achieving state-of-the-art or comparable performance by incorporating IMAGINARYNET with other supervision settings. Our code will be publicly available at https://github.com/kodenii/ImaginaryNet.

1. INTRODUCTION

Without the demand of training in reality, humans are able of detecting a new category of object simply based on the language description on its visual characteristics. Equipping this ability to deep learning may allow the neural network to handle complex vision tasks, e.g., object detection, without real images and annotations. Recently, we witness the rise of Contrastive Language-Image Pre-training (CLIP) (Radford et al., 2021) , where general knowledge can be learned by pre-training and then be applied to various downstream tasks via zero-shot learning or task-specific fine-tuning. Unlike image classification, object detection is more challenging and has a larger gap than pretraining tasks. Several methods, such as RegionCLIP (Zhong et al., 2022a ), ViLD (Gu et al., 2021 ), and Detic (Zhou et al., 2022) , have been suggested to transfer knowledge from pre-trained CLIP (Radford et al., 2021) to some modules of detection. However, real images and annotations are still required for some key modules of the object detectors, such as Region Proposal Network (RPN) or Region of Interest (RoI) heads. In this work, we aim to raise and answer a question: given suitable pre-trained models, can we learn object detectors without real images and manual annotations? To this end, we introduce a novel learning paradigm, i.e., Imaginary-Supervised Object Detection (ISOD), where no real images and manual annotations can be used for training object detection. Fortunately, benefited from the progress in vision-language pre-training, ISOD is practically feasible. Here we propose IMAG-INARYNET, a framework to learn object detectors by combining pretrained language model as well

