CROSS-SUPERVISED OBJECT DETECTION

Abstract

After learning a new object category from image-level annotations (with no object bounding boxes), humans are remarkably good at precisely localizing those objects. However, building good object localizers (i.e., detectors) currently requires expensive instance-level annotations. While some work has been done on learning detectors from weakly labeled samples (with only class labels), these detectors do poorly at localization. In this work, we show how to build better object detectors from weakly labeled images of new categories by leveraging knowledge learned from fully labeled base categories. We call this learning paradigm cross-supervised object detection. While earlier works investigated this paradigm, they did not apply it to realistic complex images (e.g., COCO), and their performance was poor. We propose a unified framework that combines a detection head trained from instance-level annotations and a recognition head learned from image-level annotations, together with a spatial correlation module that bridges the gap between detection and recognition. These contributions enable us to better detect novel objects with image-level annotations in complex multi-object scenes such as the COCO dataset.

1. INTRODUCTION

Deep architectures have achieved great success in many computer vision tasks including object recognition and the closely related problem of object detection. Modern detectors, such as the Faster RCNN (Ren et al., 2015) , YOLO (Redmon et al., 2016) , and RetinaNet (Lin et al., 2017) , use the same network backbone as popular recognition models. However, even with the same backbone architectures, detection and recognition models require different types of supervision. A good detector relies heavily on precise bounding boxes and labels for each instance (we shall refer to these as instance-level annotations), whereas a recognition model needs only image-level labels. Needless to say, it is more time consuming and expensive to obtain high quality bounding box annotations than class labels. As a result, current detectors are limited to a small set of categories relative to their object recognition counterparts. To address this limitation, it is natural to ask, "Is it possible to learn detectors with only class labels?" This problem is commonly referred to as weakly supervised object detection (WSOD). Early WSOD work (Hoffman et al., 2014) showed fair performance by directly applying recognition networks to object detection. More recently, researchers have used multiple instance learning methods (Dietterich et al., 1997) to recast WSOD as a multi-label classification problem (Bilen & Vedaldi, 2016) . However, these weakly supervised detectors perform poorly at localization. Most WSOD experiments have been conducted on the ILSVRC (Russakovsky et al., 2015) data set, in which images have only a single object, or on the PASCAL VOC (Everingham et al., 2010) data set, which has only 20 categories. The simplicity of these data sets limits the number and types of distractors in an image, making localization substantially easier. Learning from only class labels, it is challenging to detect objects at different scales in an image that contains many distractors. In particular, as shown in our experiments, weakly supervised object detectors do not work well in complex multi-object scenes, such as the COCO dataset (Lin et al., 2014) . To address this challenge, we focus on a form of learning in which the localization of classes with only object labels (weakly labeled classes) can benefit from other classes that have ground truth bounding boxes (fully labeled classes). We refer to this interesting learning paradigm as crosssupervised object detection (CSOD). While several works (Hoffman et al., 2014; Tang et al., 2016; Yang et al., 2019a; Redmon & Farhadi, 2017) have explored this problem before, they still have the same limitation as the WSOD work we mentioned above. Those cross-supervised object detectors work under simplified scenarios (e.g., ILSVRC data set) where images contain single objects and are object-centered. They struggle to learn under more complex and realistic scenarios, where there are multiple objects from potentially very different classes, and objects could be small and appear anywhere in the images. In this work, we show that by doing multi-task learning on both weaklysupervised base classes and fully-supervised novel classes, our model is able to learn a good detector under the CSOD setting. More formally, we define CSOD as follows. At training time, we are given 1) images contain objects from both base and novel classes, 2) both class labels and ground truth bounding boxes for base objects, and 3) only class labels for novel objects. Our goal is to detect novel objects. In CSOD, base classes and novel classes are disjoint. Thus, it can be seen as performing fullysupervised detection on the base classes and weakly supervised detection on the novel classes. It has similarities to both transfer learning and semi-supervised learning, since it transfer knowledge from base class to novel class and have more information about some instances than other instances. However, CSOD represents a distinct and novel paradigm for learning. The current weakly-supervised method has several drawbacks to learn from a multi objects image. As shown in Fig. 1 , a weakly supervised object detector tends to detect only the most discriminating part of novel objects instead of the whole object. Notice how only the head of the person, and not the whole body, is detected. Another issue is that the localizer for one object (e.g., the horse) may be confused by the occurrence of another object, such as the person on the horse. This example illustrates the gap between detection and recognition: without ground truth bounding boxes, the detector acts like a standard recognition model -focusing on discriminating rather than detecting. In this paper, we explore two major mechanisms for improving on this. Our first mechanism is unifying detection and recognition. Using the same network backbone architecture, recognition and detection can be seen as image-level classification and region-level classification respectively, suggesting a strong relation between them. In particular, it suggests a shared training framework in which the same backbone is used with different heads for detection and recognition. Thus, we combine a detection head learned from ground truth bounding boxes, and a recognition head learned in a weakly supervised fashion from class labels. Unlike a traditional recognition head, our recognition head produces a class score for multiple proposals and is capable of detecting objects. The second mechanism is learning a spatial correlation module to reduce the gap between detection and recognition. It takes several high-confidence bounding boxes produced by the recognition head as input, and learns to regress ground truth bounding boxes. By combining these mechanisms together, our model outperforms all previous models when all novel objects are weakly labeled. In summary, our contributions are three-fold. First, we define a new task-cross-supervised object detection, which enables us to leverage knowledge from fully labeled base categories to help learn a robust detector from novel object class labels only. Second, we propose a unified framework in which two heads are learned from class labels and detection labels respectively, along with a spatial correlation module bridging the gap between recognition and detection. Third, we significantly outperform existing methods (Zhang et al. (2018a) ; Tang et al. (2017; 2018) ) on PASCAL VOC and COCO, suggesting that CSOD could be a promising approach for expanding object detection to a much larger number of categories.

2. RELATED WORK

Weakly supervised object detection. WSOD (Kosugi et al. (2019) 2019)) attempts to learn a detector with only image category labels. Most of these methods adopt the idea of Multiple Instance Learning (Dietterich et al. (1997) ) to recast WSOD as a multi-label classification task. Bilen & Vedaldi (2016) propose an end-to-end network by modifying a classifier to operate at the level of image regions, serving as a region selector and a classifier simultaneously. Tang et al. (2017) and Tang et al. (2018) find that several iterations of online refinement based on the outputs of previous iterations boosts performance. Wei et al. (2018) and Diba et al. (2017) use semantic segmentation based on class activation maps (Zhou et al. (2016) ) to help generate tight bounding boxes. However,

Weakly supervised object detector

Cross-supervised object detector 2015) design a three-step framework to learn a feature representation from weakly supervised classes and strongly supervised classes jointly. However, these methods can only perform object localization in single object scenes such as ILSVRC, whereas our method can perform object detection in complex multi-object scenes as well, e.g. COCO. Also, it is worth noting that we are doing multi-task learning, which means that we jointly learn from base and novel classes. In comparison, some works (Uijlings et al., 2018) are doing transfer learning. They first learn a model on base classes and then transfer and fine-tune the model on novel classes. Gao et al. (2019a) use a few instance-level labels and a large scale of image-level labels for each category in a training-mining framework, which is referred to as semisupervised detection. Zhang et al. (2018a) propose a framework named MSD that learn objectness on base categories and use it to reject distractors when learning novel objects. In comparison, our spatial correlation module not only learns objectness, but also refines coarse bounding boxes. Further, our model learns from both base and novel classes instead of only novel classes.

3. CROSS-SUPERVISED OBJECT DETECTION

CSOD requires us to learn from instance-level annotations (detection labels) and image-level annotations (recognition labels). In this section, we explain the unification of detection and recognition and introduce our framework. In the next section, we describe our novel spatial correlation module.

3.1. UNIFYING DETECTION AND RECOGNITION

How to learn a detector from both instance-level and image-level annotations? Since detection and recognition can be seen as region-level and image-level classification respectively, a natural choice is to design a unified framework that combines a detection head and a recognition head that can learn from image-level and instance-level annotations respectively. Here we exploit several baselines to unify the detection and recognition head. (1) Finetune. We first learn through the detection head on base classes with fully labeled samples. Then, we finetune our model using the recognition head on novel classes with only class labels. (2) Two Head. We simultaneously learn the detection and recognition head on base and novel classes, respectively. The weights of the 2018)) find that re-train a new detector taking the top-scoring bounding boxes from a weakly supervised object detector as ground truth marginally improve the performance. Even with coarse and noisy pseudo bounding boxes, a standard object detector produces better detection results than a weakly supervised object detector. Keeping this hypothesis in mind, we introduce a guidance from the recognition head to the detection head. For each of the novel categories existing in a training sample, the recognition head outputs the top-scoring bounding box, which are then used by the detection head as supervision in that sample.

3.2. DETECTION-RECOGNITION NETWORK

The structure of our Detection-Recognition Network (DRN) is shown in Fig. 2 . Given an image, we first generate 2000 object proposals by Selective Search (Uijlings et al. (2013) ) or RPN (Ren et al. (2015) ) trained on base classes. The image and proposals are fed into several convolutional (conv) layers followed by a region-of-interest (RoI) pooling layer (Girshick (2015) ) to output fixed-size feature maps. Then, these feature maps are fed into two fully connected (fc) layers to produce a collection of proposal features, which are further branched into the recognition and detection head. Recognition Head. We followed previous WSOD methods to design our recognition head. Since OICR (Tang et al. (2017) ) is simple, neat, and commonly being used, we make our recognition head the same as OICR, but with fewer refinement branches to reduce the computation cost. However, our recognition head can be replaced by any WSOD structure as shown in § 5.3. Within the recognition head as shown in Fig. 2 , the proposal features are branched into three streams producing three matrices x c , x d , x e ∈ R C×|R| , where C is the number of novel classes and |R| is the number of proposals. Then the two matrices x c and x d are passed through a softmax function cr . Then we calculate a standard multi-class cross-entropy loss as shown in the first term of Eq.1. Another matrix x e is passed through a softmax function over classes, the result of which is expresses as a weighted multi-class cross entropy loss as shown in the second term of Eq.1. We set the pseudo label for each proposal r based on its IoU (or overlap) with the top-scoring proposal of c th class, y cr = 1 if IoU > 0.5 and y cr = 0 otherwise. The weight w r for each proposal r is its IoU with the top-scoring proposal. The total loss for the recognition head is L rec = [- C c=1 y c logφ c + (1 -y c )log(1 -φ c )] + [- 1 |R| |R| r=1 C+1 c=1 w r y cr logx e cr ] Supervision from our recognition head. We use the matrix x e to propose pseudos bounding boxes to guide the detection head. Specifically, we select one top-scoring proposal for each object category that appears in the image as a pseudo bounding box, as done in OICR. We introduce the spatial correlation module in § 4, to further refine this pseudo ground truth. Detection Head. Now that we have pseudo bounding boxes for novel objects and ground truth bounding boxes for base objects, we train our detection head like a standard detector. For simplicity and efficiency, our detection head use the same structure of Faster R-CNN (Ren et al. (2015) ). At inference time, the detection head produces detection results for both base categories and novel categories.

4. LEARNING TO MODEL SPATIAL CORRELATION

Our intuition is that there exists spatial correlation among high-confidence bounding boxes, and such spatial correlation can be captured to predict ground truth bounding boxes. By representing the spatial correlation in a class-agnostic heatmap, we can easily learn a mapping from recognitionbased bounding boxes to ground truth bounding boxes for base categories, and then transfer this mapping to novel categories. Thus, we propose a spatial correlation module (SCM). SCM is used as a guidance refinement technique, taking sets of high-confidence bounding boxes from the recognition head, and correspondingly returning pseudo ground truth bounding boxes to the detection head. These pseudo ground truth boxes act as supervision while training on novel categories. The framework of SCM is showed in Fig. 3 . Within this module, we first generate a class agnostic heatmap based on the high-confidence bounding boxes predicted by our recognition head, and then we perform detection on top of the heatmap. Heatmap synthesis. We want to capture the information about how the high-confidence bounding boxes interact amongst themselves. Here, we introduce a simple way of achieving this using a class-agnostic heatmap. For each category existing in the image y c = 1, c ∈ C, we first threshold and select high-confidence bounding boxes of class c. Then we synthesize a corresponding classagnostic heatmap, which is essentially a two-channel feature map of the same size as the original image. The value at each pixel is the sum and the maximum of confidence over all selected bounding boxes covering that pixel. Heatmap detection. We consider each class-agnostic heatmap as a two-channel image, and perform detection on it. Specifically, we learn a class-agnostic detector on base classes, that we further use to produce pseudo ground truth bounding boxes for novel objects. For this task, we use a lightweight one-stage detector, consisting of only five convolutional layers. We follow the same network architecture and loss as FCOS (Tian et al. ( 2019)), replacing the backbone and feature pyramid network with five max pooling layers. In our experiments, we also compare this tiny detector to a baseline: using three fully-connected layers to regress the groundtruth location taking the coordinates of high-confidence bounding boxes as input. Loss of DRN. After introducing our SCM, we can formulate the full loss function for DRN. We use L rec , L det , and L scm to indicate the losses from our recognition head, detection head, and spatial correlation module respectively. λ rec , λ det , and λ scm are the regularization hyperparameters used to balance the three separate loss functions. We train our DRN using the following loss: To evaluate our methods, we calculate mean of Average Precision (mAP) based on the PASCAL criteria, i.e., IOU>0.5 between predicted boxes and ground truths. L = λ rec L rec + λ det L det + λ scm L scm Implementation details. All our baselines, competitors and our framework are based on VGG16 (Simonyan & Zisserman (2015) ) followed most of weakly supervised object detection methods. We set λ rec = 1, λ det = 10, and λ scm = 10. We train the whole framework for 20 epochs using SGD with a momentum of 0.9, a weight decay of 0.0005 and a learning rate of 0.001, which is reduced by a factor of 10 at 14 th epoch. For a stable learning process, we don't provide supervision from recognition head to detection head in the first 9 epochs. Table 2 : The results on COCO. We compare our method with several strong baselines in § 3.1 and competitors. Our method significantly outperforms these approaches, showing that our crosssupervised object detector is capable of detecting novel objects in complex multi-object scenes. Baselines and competitors. We compare against several baselines as mentioned in § 3.1, two WSOD methods: OICR (Tang et al. (2017) ) and PCL (Tang et al. (2018) ), and two cross-supervised object detector: MSD (Zhang et al. (2018a) ), weight transfer (Kuen et al. (2019) ). Results. As shown in Table 1 , our method outperforms all other approaches by a large margin (over 7% relative increase in mAP on novel classes). The results are consistent with our discussion in § 3.1. We note that (1) sharing backbone for the recognition and detection head learns a more discriminative embedding for novel objects. In Table 1 , Two Head * boosts the performance by 5 points as compared to only using the recognition head (OICR). ( 2) A supervision from recognition head to detection head exploits the full potential of a detection model. By adding the supervision (Ours * w/o SCM ), the result is improved by 5 points as compared to Two Head. (3) Our spatial correlation module successfully captures the spatial correlation between high-confidence proposals. It further boosts the performance by 3 points. Implementation details. The implementation details are the same as § 5.1 by default. We train the whole framework for 13 epochs. There is no supervision from recognition head to detection head in the first 5 epochs. The learning rate is reduced by a factor of 10 at 8 th , and 12 th epochs. Baselines and competitors. Most baselines and competitors are the same as § 5.1. 'Rec. Head' represents only using our recognition head structure as a weakly supervised object detector. Results. The results on COCO still support our discussion in § 5.1. Even in complex multi objects scenes, our DRN outperforms all baselines and competitors by a large margin.

5.3. ABLATION EXPERIMENTS

Heatmap synthesis. In Table 3a , we compare the different methods to synthesize the heatmaps in the spatial correlation module. For each position in the heatmap, we consider three kinds of values: the maximum of confidence, the sum of confidence, and the number of proposals covering the position. This result informs us to use max and sum to create a two-channel heatmap.

Structure of SCM.

In Table 3b , we compare different implementations of SCM. We compare the FCOS (Tian et al. ( 2019)) with 5 convolutional layers and the standard FCOS with a ResNet-50 (He et al. (2016) ) backbone. We also compare to the regression baseline mentioned in § 4. Considering the computation cost, we choose FCOS with 5 convolutional layer as our heatmap detector. Structure of the Recognition head. In Table 3c , we compare different structures for the recognition head. WSDDN (Bilen & Vedaldi (2016) ) and OICR are compared to our structure. The results support that our model can benefit from a stronger recognition head. Different proposal generation methods. Table 3d shows the ablation of different ways to generate proposals. In PASCAL VOC with only 10 base classes, RPN performs worse than selective search. In COCO with 60 base classes, RPN performs better than selective search. Visualization. Fig. 4 shows detection results on novel objects. Images in the first row, the second row, and the third row are detected by our model from the recognition head, the SCM, and the detection head respectively. The images in the first row tend to focus on the discriminating parts of the objects, e.g. the first and the second images contain only a part of the person. It also tends to detect co-occurring objects, e.g. the fourth image not only detects horse but also a large part of the person. Our SCM alleviates these problems. It tends to focus on the whole object, e.g. the first and the third samples detect the whole person instead of only the head. Also, it can correct unsatisfactory bounding boxes distracted by co-occurring objects, e.g. SCM correctly localizes the horse instead of localizing both the person and the horse in the fourth example. Obviously, bounding boxes in the third row are the best, indicating the efficacy of our framework.



; Zeng et al. (2019); Yang et al. (2019b); Wan et al. (2019); Arun et al. (2019); Wan et al. (2018); Zhang et al. (2018b); Ren et al. (2020); Zhang et al. (2018c); Li et al. (2019); Gao et al. (2019b); Kosugi et al. (

Figure1: A comparison between weakly supervised object detector and our detector. Weakly supervised object detector only detects the most discriminating part of an object, e.g., focus on head of a person when detecting a person; or being distracted by co-occurring instances, e.g., distracted by the person on the horse when detecting a horse. Our detector can address these issues.

Figure 2: Our Detection-Recognition Network (DRN) without the spatial correlation module. In this illustration, Person belongs to novel classes and Boat belongs to base classes. The recognition head learns from the class label Person and outputs the top-scoring bounding box to help the detection head learn to detect the person. The spatial correlation module, discussed in § 4, can be added to further refine the top-scoring bounding boxes.

Figure 3: Our spatial correlation module (SCM). Our SCM learns to capture spatial correlation among high-confidence bounding boxes, generating a class-agnostic heatmap for the whole image. A heatmap detector is then trained to learn ground truth bounding boxes.

Setup. PASCAL VOC 2007 and 2012 datasets contain 9, 962 and 22, 531 images respectively for 20 object classes. They are divided into train, val, and test sets. Here we follow previous work(Tang et al. (2017)) to choose the trainval set (5, 011 images from 2007 and 11, 540 images from 2012). We divide the first 10 classes into base classes and the other 10 classes into novel classes.

Figure 4: Detection results on novel objects. The results are from our proposed model but with different heads. The first row shows the results of the recognition head. The second row lists the results from SCM. The third row displays the results from the detection head.

Object Detection performance (mAP %) on PASCAL VOC 2007 test set. * indicates using the structure of OICR in the recognition head. "MSD-Ens" is the ensemble of AlexNet and VGG16. "MSD-Ens+FRCN" indicates using an ensemble model to predict pseudo ground truths and then learn a Fast-RCNN(Girshick (2015)) using VGG-16.

Ablation study of our method.

6. CONCLUSION

In this paper, we have focused on cross-supervised object detection in realistic settings with complex imagery. We explore two major ways to build a good cross-supervised object detector: sharing network backbone between a recognition head and a detection head, and learning a spatial correlation module to bridge the gap between recognition and detection. Significant improvement on PASCAL VOC and COCO suggests a novel and promising approach for expanding object detection to a much larger number of categories.

