CROSS-SUPERVISED OBJECT DETECTION

Abstract

After learning a new object category from image-level annotations (with no object bounding boxes), humans are remarkably good at precisely localizing those objects. However, building good object localizers (i.e., detectors) currently requires expensive instance-level annotations. While some work has been done on learning detectors from weakly labeled samples (with only class labels), these detectors do poorly at localization. In this work, we show how to build better object detectors from weakly labeled images of new categories by leveraging knowledge learned from fully labeled base categories. We call this learning paradigm cross-supervised object detection. While earlier works investigated this paradigm, they did not apply it to realistic complex images (e.g., COCO), and their performance was poor. We propose a unified framework that combines a detection head trained from instance-level annotations and a recognition head learned from image-level annotations, together with a spatial correlation module that bridges the gap between detection and recognition. These contributions enable us to better detect novel objects with image-level annotations in complex multi-object scenes such as the COCO dataset.

1. INTRODUCTION

Deep architectures have achieved great success in many computer vision tasks including object recognition and the closely related problem of object detection. Modern detectors, such as the Faster RCNN (Ren et al., 2015) , YOLO (Redmon et al., 2016), and RetinaNet (Lin et al., 2017) , use the same network backbone as popular recognition models. However, even with the same backbone architectures, detection and recognition models require different types of supervision. A good detector relies heavily on precise bounding boxes and labels for each instance (we shall refer to these as instance-level annotations), whereas a recognition model needs only image-level labels. Needless to say, it is more time consuming and expensive to obtain high quality bounding box annotations than class labels. As a result, current detectors are limited to a small set of categories relative to their object recognition counterparts. To address this limitation, it is natural to ask, "Is it possible to learn detectors with only class labels?" This problem is commonly referred to as weakly supervised object detection (WSOD). Early WSOD work (Hoffman et al., 2014) showed fair performance by directly applying recognition networks to object detection. More recently, researchers have used multiple instance learning methods (Dietterich et al., 1997) to recast WSOD as a multi-label classification problem (Bilen & Vedaldi, 2016) . However, these weakly supervised detectors perform poorly at localization. Most WSOD experiments have been conducted on the ILSVRC (Russakovsky et al., 2015) data set, in which images have only a single object, or on the PASCAL VOC (Everingham et al., 2010) data set, which has only 20 categories. The simplicity of these data sets limits the number and types of distractors in an image, making localization substantially easier. Learning from only class labels, it is challenging to detect objects at different scales in an image that contains many distractors. In particular, as shown in our experiments, weakly supervised object detectors do not work well in complex multi-object scenes, such as the COCO dataset (Lin et al., 2014) . To address this challenge, we focus on a form of learning in which the localization of classes with only object labels (weakly labeled classes) can benefit from other classes that have ground truth bounding boxes (fully labeled classes). We refer to this interesting learning paradigm as crosssupervised object detection (CSOD). While several works (Hoffman et al., 2014; Tang et al., 2016; Yang et al., 2019a; Redmon & Farhadi, 2017) have explored this problem before, they still have the same limitation as the WSOD work we mentioned above. Those cross-supervised object detectors work under simplified scenarios (e.g., ILSVRC data set) where images contain single objects and are object-centered. They struggle to learn under more complex and realistic scenarios, where there are multiple objects from potentially very different classes, and objects could be small and appear anywhere in the images. In this work, we show that by doing multi-task learning on both weaklysupervised base classes and fully-supervised novel classes, our model is able to learn a good detector under the CSOD setting. More formally, we define CSOD as follows. At training time, we are given 1) images contain objects from both base and novel classes, 2) both class labels and ground truth bounding boxes for base objects, and 3) only class labels for novel objects. Our goal is to detect novel objects. In CSOD, base classes and novel classes are disjoint. Thus, it can be seen as performing fullysupervised detection on the base classes and weakly supervised detection on the novel classes. It has similarities to both transfer learning and semi-supervised learning, since it transfer knowledge from base class to novel class and have more information about some instances than other instances. However, CSOD represents a distinct and novel paradigm for learning. The current weakly-supervised method has several drawbacks to learn from a multi objects image. As shown in Fig. 1 , a weakly supervised object detector tends to detect only the most discriminating part of novel objects instead of the whole object. Notice how only the head of the person, and not the whole body, is detected. Another issue is that the localizer for one object (e.g., the horse) may be confused by the occurrence of another object, such as the person on the horse. This example illustrates the gap between detection and recognition: without ground truth bounding boxes, the detector acts like a standard recognition model -focusing on discriminating rather than detecting. In this paper, we explore two major mechanisms for improving on this. Our first mechanism is unifying detection and recognition. Using the same network backbone architecture, recognition and detection can be seen as image-level classification and region-level classification respectively, suggesting a strong relation between them. In particular, it suggests a shared training framework in which the same backbone is used with different heads for detection and recognition. Thus, we combine a detection head learned from ground truth bounding boxes, and a recognition head learned in a weakly supervised fashion from class labels. Unlike a traditional recognition head, our recognition head produces a class score for multiple proposals and is capable of detecting objects. The second mechanism is learning a spatial correlation module to reduce the gap between detection and recognition. It takes several high-confidence bounding boxes produced by the recognition head as input, and learns to regress ground truth bounding boxes. By combining these mechanisms together, our model outperforms all previous models when all novel objects are weakly labeled. In summary, our contributions are three-fold. First, we define a new task-cross-supervised object detection, which enables us to leverage knowledge from fully labeled base categories to help learn a robust detector from novel object class labels only. Second, we propose a unified framework in which two heads are learned from class labels and detection labels respectively, along with a spatial correlation module bridging the gap between recognition and detection. Third, we significantly outperform existing methods (Zhang et al. (2018a); Tang et al. (2017; 2018) ) on PASCAL VOC and COCO, suggesting that CSOD could be a promising approach for expanding object detection to a much larger number of categories. 



Weakly supervised object detection. WSOD (Kosugi et al. (2019); Zeng et al. (2019); Yang et al. (2019b); Wan et al. (2019); Arun et al. (2019); Wan et al. (2018); Zhang et al. (2018b); Ren et al. (2020); Zhang et al. (2018c); Li et al. (2019); Gao et al. (2019b); Kosugi et al. (2019)) attempts to learn a detector with only image category labels. Most of these methods adopt the idea of Multiple Instance Learning (Dietterich et al. (1997)) to recast WSOD as a multi-label classification task. Bilen & Vedaldi (2016) propose an end-to-end network by modifying a classifier to operate at the level of image regions, serving as a region selector and a classifier simultaneously. Tang et al. (2017) and Tang et al. (2018) find that several iterations of online refinement based on the outputs of previous iterations boosts performance. Wei et al. (2018) and Diba et al. (2017) use semantic segmentation based on class activation maps (Zhou et al. (2016)) to help generate tight bounding boxes. However,

