COOPERATING RPN'S IMPROVE FEW-SHOT OBJECT DETECTION

Abstract

Learning to detect an object in an image from very few training examples -fewshot object detection -is challenging, because the classifier that sees proposal boxes has very little training data. A particularly challenging training regime occurs when there are one or two training examples. In this case, if the region proposal network (RPN) misses even one high intersection-over-union (IOU) training box, the classifier's model of how object appearance varies can be severely impacted. We use multiple distinct yet cooperating RPN's. Our RPN's are trained to be different, but not too different; doing so yields significant performance improvements over state of the art for COCO and PASCAL VOC in the very few-shot setting. This effect appears to be independent of the choice of classifier or dataset.

1. INTRODUCTION

Achieving accurate few-shot object detection is difficult, because one must rely on a classifier building a useful model of variation in appearance with very few examples. This paper identifies an important effect that causes existing detectors to have weaker than necessary performance in the few-shot regime. By remediating this difficulty, we obtain substantial improvements in performance with current architectures. The effect is most easily explained by looking at the "script" that modern object detectors mostly follow. As one would expect, there are variations in detector structure, but these do not mitigate the effect. A modern object detector will first find promising image locations; these are usually, but not always, boxes. We describe the effect in terms of boxes reported by a region proposal network (RPN) (Ren et al., 2015) , but expect that it applies to other representations of location, too. The detector then passes the promising locations through a classifier to determine what, if any, object is present. Finally, it performs various cleanup operations (non-maximum suppression, bounding box regression, etc.), aimed at avoiding multiple predictions in the same location and improving localization. The evaluation procedure for reported labeled boxes uses an intersection-over-union (IOU) test as part of determining whether a box is relevant. A detector that is trained for few-shot detection is trained on two types of categories. Base categories have many training examples, and are used to train an RPN and the classifier. Novel categories have one (or three, or five, etc.) examples per category. The existing RPN is used to find boxes, and the classifier is fine-tuned to handle novel categories. Now assume that the detector must learn to detect a category from a single example. The RPN is already trained on other examples. It produces a collection of relevant boxes, which are used to train the classifier. The only way that the classifier can build a model of the categories variation in appearance is by having multiple high IOU boxes reported by the RPN. In turn, this means that an RPN that behaves well on base categories may create serious problems for novel categories. Imagine that the RPN reports only a few of the available high IOU boxes in training data. For base categories, this is not a problem; many high IOU boxes will pass to the classifier because there is a lot of training data, and so the classifier will be able to build a model of the categories variation in appearance. This variation will be caused by effects like aspect, in-class variation, and the particular RPN's choice of boxes. But for novel categories, an RPN must report as many high IOU boxes as possible, because otherwise the classifier will have too weak a model of appearance variation -for example, it might think that the object must be centered in the box. This will significantly depress (Wang et al., 2020) . On the right: the top 10 proposals from our cooperating RPN's. Cat is in the novel classes {boat, cat, mbike, sheep, sofa} (i.e., both models are cat detectors). Note that the classifier in TFA will not see a cat box, and so it cannot detect the cat. This is a disaster when this image is used in training. For our approach, the cat box is found and is high in the top 10 proposals. accuracy. As Figure 1 and our results illustrate, this effect (which we call proposal neglect) is present in the state-of-the-art few-shot detectors. One cannot escape this effect by simply reporting lots of boxes, because doing so will require the classifier to be very good at rejecting false positives. Instead, one wants the box proposal process to not miss high IOU boxes, without wild speculation. We offer a relatively straightforward strategy. We train multiple RPN's to be somewhat redundant (so that if one RPN misses a high IOU box, another will get it), without overpredicting. In what follows, we demonstrate how to do so and show how to balance redundancy against overprediction. Our contributions are three-fold: (1) We identify an important effect in few-shot object detection that causes existing detectors to have weaker than necessary performance in the few-shot regime. (2) We propose to overcome the proposal neglect effect by utilizing RPN redundancy. (3) We design an RPN ensemble mechanism that trains multiple RPN's simultaneously while enforcing diversity and cooperation among RPN's. We achieve state-of-the-art performance on COCO and PASCAL VOC in very few-shot settings.

2. BACKGROUND

Object Detection with Abundant Data. The best-performing modern detectors are based on convolutional neural networks. There are two families of architecture, both relying on the remarkable fact that one can quite reliably tell whether an image region contains an object independent of category (Endres & Hoiem, 2010; van de Sande et al., 2011) . In serial detection, a proposal process (RPN/s in what follows) offers the classifier a selection of locations likely to contain objects, and the classifier labels them, with the advantage that the classifier "knows" the likely support of the object fairly accurately. This family includes R-CNN and its variants (for R-CNN (Girshick et al., 2014) ; Fast R-CNN (Girshick, 2015) ; Faster R-CNN (Ren et al., 2015) ; Mask R-CNN (He et al., 2017) ; SPP-Net (He et al., 2015) ; FPN (Lin et al., 2017); and DCN (Dai et al., 2017) ). In parallel, the proposal process and classification process are independent; these methods can be faster, but the classifier "knows" very little about the likely support of the object, which can affect accuracy. This family includes YOLO and its variants (for YOLO versions (Redmon et al., 2016; Redmon & Farhadi, 2017; 2018; Bochkovskiy et al., 2020) ; SSD (Liu et al., 2016); Cornernet (Law & Deng, 2018); and ExtremeNet (Zhou et al., 2019) ). This paper identifies an issue with the proposal process that can impede strong performance when there is very little training data (the few-shot case). The effect is described in the context of serial detection, but likely occurs in parallel detection too. Few-Shot Object Detection. Few-shot object detection involves detecting objects for which there are very few training examples. There is a rich few-shot classification literature (roots in (Thrun, 1998; Fei-Fei et al., 2006) ). Dvornik et al. (2019) uses ensemble procedures for few-shot classifi-



Few-shot detectors are subject to the proposal neglect effect. On the left: an image (from the PASCAL VOC split 3 novel classes test set) showing the top 10 proposals from a state-of-the-art few-shot detector TFA

