IMPROVING ASPECT RATIO DISTRIBUTION FAIRNESS IN FEW-SHOT DETECTOR PRETRAINING VIA COOPER-ATING RPN'

Abstract

Region proposal networks (RPN) are a key component of modern object detectors. An RPN identifies image boxes likely to contain objects, and so worth further investigation. An RPN false negative is unrecoverable, so the performance of an object detector can be significantly affected by RPN behavior, particularly in low-data regimes. The RPN for a few shot detector is trained on base classes. Our experiments demonstrate that, if the distribution of box aspect ratios for base classes is different from that for novel classes, errors caused by RPN failure to propose a good box become significant. This is predictable: for example, an RPN trained on base classes that are mostly square will tend to miss short wide boxes. It has not been noticed to date because the (relatively few) standard base/novel class splits on current datasets do not display this effect. But changing the base/novel split highlights the problem. We describe datasets where the distribution shift is severe using PASCAL VOC, COCO, and LVIS datasets. We show that the effect can be mitigated by training multiple distinct but cooperating specialized RPNs. Each specializes in a different aspect ratio, but cooperation constraints reduce the extent to which the RPNs are tuned. This means that if a box is missed by one RPN, it has a good chance of being picked up by another. Experimental evaluation confirms this approach results in substantial improvements in performance on the ARShift benchmarks, while remaining comparable to SOTA on conventional splits. Our approach applies to any few-shot detector and consistently improves performance of detectors.

1. INTRODUCTION

Most state-of-the-art object detectors follow a two-stage detection paradigm. A region proposal network (RPN) finds promising locations, and these are passed through a classifier to determine what, if any, object is present. In this architecture, if an RPN makes no proposal around an object, the object will not be detected. For a few-shot detector, one splits the classes into base and novel, then trains the RPN and classifier on base classes, fixes the RPN, and finally fine-tunes the classifier on novel classes using the RPN's predictions. Objects in large-scale object detection datasets (e.g. COCO (Lin et al., 2014) ; LVIS (Gupta et al., 2019) ) have typical aspect ratio that varies somewhat from instance to instance, and often differs sharply from category to category. As a result, the few-shot training procedure has a built-in problem with distribution shift. This phenomenon is illustrated in Figure 1 . Imagine all base classes are roughly square, and all novel classes are either short and wide, or tall and narrow. The RPN trained on the base classes should miss some novel class boxes. These boxes will have two effects: the training data the classifier sees will be biased against the correct box shape; and, at run time, the detector may miss objects because of RPN failures. We refer to this problem as the bias (the RPN does not deal fairly with different aspect ratios). The bias occurs because the RPN sees few or no examples of the novel classes during training (Kang et al., 2019; Wang et al., 2020; Yan et al., 2019) . To date, this bias has not been remarked on. This is an accident of dataset construction: the standard base/novel splits in standard datasets do not result in a distribution shift. But other base/novel splits do result in a distribution shift large enough to have notable effects, and Section 3 shows our evidence that this effect occurs in practice. In particular, we describe ARShift benchmarks that simulate the real-world scenario where the aspect ratio distribution shift is severe. RPNs in state-of-the-art fewshot detectors are heavily biased towards familiar aspect ratio distributions, and so have weaker than necessary performance on non-standard splits because their RPNs are unfair. Evaluation practice should focus on performance under hard splits. In few-shot detection applications, a more robust RPN will be more reliable, because applications typically offer no guarantees about the aspect ratio of novel classes. We show how to build a more robust RPN by training multiple RPN classifiers to be specialized but cooperative. Our CoRPN's can specialize (and a degree of specialization emerges naturally), but our cooperation constraints discourage individual RPN classifiers from overspecializing and so face generalization problems. CoRPN's are competitive with SOTA on widely used conventional benchmarks of few-shot detection, using conventional splits. But on our ARShift benchmarks with hard splits based on PASCAL VOC, MS-COCO, and LVIS (Everingham et al., 2010; Lin et al., 2014; Kang et al., 2019; Wang et al., 2020) , they beat SOTA, because they are more robust to shifts in aspect ratio distribution. Our contributions: (1) We show the bias has severe effects on detector performance, and describe ARShift benchmarks that evaluate these effects. (2) We describe a general approach to improving RPN robustness to distribution shifts. Our CoRPN construction works with many types of few-shot detectors. (3) We show that performance improvements resulting from CoRPN's results from improved fairness. (4) Our CoRPN's are competitive with SOTA on widely used conventional benchmarks. But on the hard splits in ARShift, they beat SOTA, because they are fair.

2. RELATED WORK

Object Detection with Abundant Data. There are two families of detector architecture, both relying on the fact that one can quite reliably tell whether an image region contains an object independent of category (Endres & Hoiem, 2010; van de Sande et al., 2011) . In serial detection, a proposal process (RPN in what follows) offers the classifier a selection of locations likely to contain objects, and the classifier labels them. This family includes R-CNN and its variants (Girshick, 2015; Girshick et al., 2014; He et al., 2017; Ren et al., 2015) In parallel detection, there is no explicit proposal step; these methods can be faster but the accuracy may be lower. This family includes YOLO and its variants (Bochkovskiy et al., 2020; Redmon & Farhadi, 2017; Redmon et al., 2016; Redmon & Farhadi, 2018) , SSD (Liu et al., 2016) , point-based detectors such as CornerNet (Law & Deng, 2018) and ExtremeNet (Zhou et al., 2019) , and emerging transformer-based methods exemplified by DETR (Carion et al., 2020) . This paper identifies an issue with the proposal process that can impede strong performance when there is very little training data (the few-shot case). The effect is described in the context of two-stage detection, but likely occurs in one-stage detection too.



Figure 1: RPN is severely affected by the distribution shift of object aspect ratios from base to novel classes, leading to degenerated few-shot detection performance. After training on the base classes which are mostly boxy objects (bike, chair, table, tv, animals, etc.), (Left) state-of-the-art few-shot detector DeFRCN (Qiao et al., 2021) built on the conventional RPN misses the elongated novel class (train) object at the proposal stage and generates no proposal with IoU gt > 0.7 -this is a disaster (the classifier of DeFRCN will not see a train box proposal, and so it cannot detect the train). By contrast, (Right) our CoRPN's remedy this issue and thus improve few-shot detection for DeFRCN. Red boxes are the groundtruth box of the novel class train object; green boxes are box proposals output by the model. We plot positive proposals with IoU gt > 0.7 following Qiao et al. (2021).

