IMAGINARYNET: LEARNING OBJECT DETECTORS WITHOUT REAL IMAGES AND ANNOTATIONS

Abstract

Without the demand of training in reality, humans are able of detecting a new category of object simply based on the language description on its visual characteristics. Empowering deep learning with this ability undoubtedly enables the neural network to handle complex vision tasks, e.g., object detection, without collecting and annotating real images. To this end, this paper introduces a novel challenging learning paradigm Imaginary-Supervised Object Detection (ISOD), where neither real images nor manual annotations are allowed for training object detectors. To resolve this challenge, we propose IMAGINARYNET, a framework to synthesize images by combining pretrained language model and text-to-image synthesis model. Given a class label, the language model is used to generate a full description of a scene with a target object, and the text-to-image model is deployed to generate a photo-realistic image. With the synthesized images and class labels, weakly supervised object detection can then be leveraged to accomplish ISOD. By gradually introducing real images and manual annotations, IMAGINARYNET can collaborate with other supervision settings to further boost detection performance. Experiments show that IMAGINARYNET can (i) obtain about 75% performance in ISOD compared with the weakly supervised counterpart of the same backbone trained on real data, (ii) significantly improve the baseline while achieving state-of-the-art or comparable performance by incorporating IMAGINARYNET with other supervision settings. Our code will be publicly available at https://github.com/kodenii/ImaginaryNet.

1. INTRODUCTION

Without the demand of training in reality, humans are able of detecting a new category of object simply based on the language description on its visual characteristics. Equipping this ability to deep learning may allow the neural network to handle complex vision tasks, e.g., object detection, without real images and annotations. Recently, we witness the rise of Contrastive Language-Image Pre-training (CLIP) (Radford et al., 2021) , where general knowledge can be learned by pre-training and then be applied to various downstream tasks via zero-shot learning or task-specific fine-tuning. Unlike image classification, object detection is more challenging and has a larger gap than pretraining tasks. Several methods, such as RegionCLIP (Zhong et al., 2022a ), ViLD (Gu et al., 2021 ), and Detic (Zhou et al., 2022) , have been suggested to transfer knowledge from pre-trained CLIP (Radford et al., 2021) to some modules of detection. However, real images and annotations are still required for some key modules of the object detectors, such as Region Proposal Network (RPN) or Region of Interest (RoI) heads. In this work, we aim to raise and answer a question: given suitable pre-trained models, can we learn object detectors without real images and manual annotations? To this end, we introduce a novel learning paradigm, i.e., Imaginary-Supervised Object Detection (ISOD), where no real images and manual annotations can be used for training object detection. Fortunately, benefited from the progress in vision-language pre-training, ISOD is practically feasible. Here we propose IMAG-INARYNET, a framework to learn object detectors by combining pretrained language model as well as text-to-image synthesis models. In particular, the text-to-image synthesis model is adopted to generate photo-realistic images, and the language model can be used to improve the diversity and provide class labels for the synthesized images. Then, ISOD can be conducted by leveraging weakly supervised object detection (WSOD) algorithms on the synthesized images with class labels to learn object detectors. We set up a strong CLIP-based model as the baseline to verify the effectiveness of IMAGINARYNET. Experiments show that IMAGINARYNET can outperform the CLIP-based model with a large margin. Moreover, IMAGINARYNET obtains about 75% performance in ISOD compared with the weakly supervised model of the same backbone trained on real data, clearly showing the feasibility of learning object detection without any real images and manual annotations. By gradually introducing real images and manual annotations, IMAGINARYNET can collaborate with other supervision settings to further boost detection performance. It is worthy noting that the performance of existing object detection models may be constrained by the limited amount of training data. As a result, we use IMAGINARYNET as a dataset expansion approach to incorporate with real images and manual annotations. Further experiments show that IMAGINARYNET significantly improves the performance of the baselines while achieving state-of-the-art or comparable performance in the supervision setting. To sum up, the contributions of this work are as follows: • We propose IMAGINARYNET, a framework to generate synthesized images as well as supervision information for training object detector. To the best of our knowledge, we are among the first work to train deep object detectors solely based on synthesized images. • We propose a novel paradigm of object detection, Imaginary-Supervised Object Detection (ISOD), where no real images and annotations can be used for training object detectors. We set up the benchmark of ISOD and obtain about 75% performance in ISOD when compared with the WSOD model of the same backbone trained on real data. • By incorporating with real images and manual annotations, ImaginaryNet significantly improves the baseline model while achieving state-of-the-art or comparable performance.

2.1. OBJECT DETECTION

Most fully-supervised object detection methods (FSOD) (Ren et al., 2015; Redmon et al., 2016; Tian et al., 2019; Carion et al., 2020) rely on large amount of training data with box-level annotations. To reduce the annotation costs, some works attempt to train a detector with incompletely supervised training data. For example, weakly-supervised object detection (WSOD) (Huang et al., 2022; Dong et al., 2021; Bilen & Vedaldi, 2016; Tang et al., 2017) requires only image-level labels. While semisupervised object detection (SSOD) (Liu et al., 2021; Xu et al., 2021; Chen et al., 2022) leverages unlabeled data combining with box-level labeled data. Although these works use relatively less or weak supervision, all of these works still rely on real images and manual annotations. In this paper, we propose ISOD, where no real images and manual annotations can be used for training object detection, thereby saving the demand for data acquisition and annotation costs.

2.2. SIM2REAL

Sim2real aims to use the simulator engines to simulate images for the model. Some works (Akhyani et al., 2022; Sadeghi & Levine, 2016; Wang et al., 2018) 



Because of the large domain gap caused by limitation of simulating engine, Sim2real and subsequent domain adaption methods focus on reducing domain gap. However, with the progress in text-to-image synthesis, domain gap of images generated by such models has been largely reduced. The key problem is how to effectively employ text-to-image synthesis model for generating diverse images with proper content and quality. To this end, we propose IMAGINARYNET, which can even archive 75% performance in ISOD compared with the weakly supervised counterpart of the same backbone.

