COOPERATING RPN'S IMPROVE FEW-SHOT OBJECT DETECTION

Abstract

Learning to detect an object in an image from very few training examples -fewshot object detection -is challenging, because the classifier that sees proposal boxes has very little training data. A particularly challenging training regime occurs when there are one or two training examples. In this case, if the region proposal network (RPN) misses even one high intersection-over-union (IOU) training box, the classifier's model of how object appearance varies can be severely impacted. We use multiple distinct yet cooperating RPN's. Our RPN's are trained to be different, but not too different; doing so yields significant performance improvements over state of the art for COCO and PASCAL VOC in the very few-shot setting. This effect appears to be independent of the choice of classifier or dataset.

1. INTRODUCTION

Achieving accurate few-shot object detection is difficult, because one must rely on a classifier building a useful model of variation in appearance with very few examples. This paper identifies an important effect that causes existing detectors to have weaker than necessary performance in the few-shot regime. By remediating this difficulty, we obtain substantial improvements in performance with current architectures. The effect is most easily explained by looking at the "script" that modern object detectors mostly follow. As one would expect, there are variations in detector structure, but these do not mitigate the effect. A modern object detector will first find promising image locations; these are usually, but not always, boxes. We describe the effect in terms of boxes reported by a region proposal network (RPN) (Ren et al., 2015) , but expect that it applies to other representations of location, too. The detector then passes the promising locations through a classifier to determine what, if any, object is present. Finally, it performs various cleanup operations (non-maximum suppression, bounding box regression, etc.), aimed at avoiding multiple predictions in the same location and improving localization. The evaluation procedure for reported labeled boxes uses an intersection-over-union (IOU) test as part of determining whether a box is relevant. A detector that is trained for few-shot detection is trained on two types of categories. Base categories have many training examples, and are used to train an RPN and the classifier. Novel categories have one (or three, or five, etc.) examples per category. The existing RPN is used to find boxes, and the classifier is fine-tuned to handle novel categories. Now assume that the detector must learn to detect a category from a single example. The RPN is already trained on other examples. It produces a collection of relevant boxes, which are used to train the classifier. The only way that the classifier can build a model of the categories variation in appearance is by having multiple high IOU boxes reported by the RPN. In turn, this means that an RPN that behaves well on base categories may create serious problems for novel categories. Imagine that the RPN reports only a few of the available high IOU boxes in training data. For base categories, this is not a problem; many high IOU boxes will pass to the classifier because there is a lot of training data, and so the classifier will be able to build a model of the categories variation in appearance. This variation will be caused by effects like aspect, in-class variation, and the particular RPN's choice of boxes. But for novel categories, an RPN must report as many high IOU boxes as possible, because otherwise the classifier will have too weak a model of appearance variation -for example, it might think that the object must be centered in the box. This will significantly depress

