CROSS-SUPERVISED OBJECT DETECTION

Abstract

After learning a new object category from image-level annotations (with no object bounding boxes), humans are remarkably good at precisely localizing those objects. However, building good object localizers (i.e., detectors) currently requires expensive instance-level annotations. While some work has been done on learning detectors from weakly labeled samples (with only class labels), these detectors do poorly at localization. In this work, we show how to build better object detectors from weakly labeled images of new categories by leveraging knowledge learned from fully labeled base categories. We call this learning paradigm cross-supervised object detection. While earlier works investigated this paradigm, they did not apply it to realistic complex images (e.g., COCO), and their performance was poor. We propose a unified framework that combines a detection head trained from instance-level annotations and a recognition head learned from image-level annotations, together with a spatial correlation module that bridges the gap between detection and recognition. These contributions enable us to better detect novel objects with image-level annotations in complex multi-object scenes such as the COCO dataset.

1. INTRODUCTION

Deep architectures have achieved great success in many computer vision tasks including object recognition and the closely related problem of object detection. Modern detectors, such as the Faster RCNN (Ren et al., 2015) , YOLO (Redmon et al., 2016), and RetinaNet (Lin et al., 2017) , use the same network backbone as popular recognition models. However, even with the same backbone architectures, detection and recognition models require different types of supervision. A good detector relies heavily on precise bounding boxes and labels for each instance (we shall refer to these as instance-level annotations), whereas a recognition model needs only image-level labels. Needless to say, it is more time consuming and expensive to obtain high quality bounding box annotations than class labels. As a result, current detectors are limited to a small set of categories relative to their object recognition counterparts. To address this limitation, it is natural to ask, "Is it possible to learn detectors with only class labels?" This problem is commonly referred to as weakly supervised object detection (WSOD). Early WSOD work (Hoffman et al., 2014) showed fair performance by directly applying recognition networks to object detection. More recently, researchers have used multiple instance learning methods (Dietterich et al., 1997) to recast WSOD as a multi-label classification problem (Bilen & Vedaldi, 2016). However, these weakly supervised detectors perform poorly at localization. Most WSOD experiments have been conducted on the ILSVRC (Russakovsky et al., 2015) data set, in which images have only a single object, or on the PASCAL VOC (Everingham et al., 2010) data set, which has only 20 categories. The simplicity of these data sets limits the number and types of distractors in an image, making localization substantially easier. Learning from only class labels, it is challenging to detect objects at different scales in an image that contains many distractors. In particular, as shown in our experiments, weakly supervised object detectors do not work well in complex multi-object scenes, such as the COCO dataset (Lin et al., 2014) . To address this challenge, we focus on a form of learning in which the localization of classes with only object labels (weakly labeled classes) can benefit from other classes that have ground truth bounding boxes (fully labeled classes). We refer to this interesting learning paradigm as crosssupervised object detection (CSOD). While several works (Hoffman et al., 2014; Tang et al., 2016; 1 

