CLOSING THE GENERALIZATION GAP IN ONE-SHOT OBJECT DETECTION Anonymous

Abstract

Despite substantial progress in object detection and few-shot learning, detecting objects based on a single example -one-shot object detection -remains a challenge. A central problem is the generalization gap: Object categories used during training are detected much more reliably than novel ones. We here show that this generalization gap can be nearly closed by increasing the number of object categories used during training. Doing so allows us to beat the state-of-the-art on COCO by 5.4 %AP 50 (from 22.0 to 27.5) and improve generalization from seen to unseen classes from 45% to 89%. We verify that the effect is caused by the number of categories and not the amount of data and that it holds for different models, backbones and datasets. This result suggests that the key to strong fewshot detection models may not lie in sophisticated metric learning approaches, but instead simply in scaling the number of categories. We hope that our findings will help to better understand the challenges of few-shot learning and encourage future data annotation efforts to focus on wider datasets with a broader set of categories rather than gathering more samples per category.

Search Task

Generalization to Novel Categories A B 

1. INTRODUCTION

It's January 2021 and your long awaited household robot finally arrives. Equipped with the latest "Deep Learning Technology", it can recognize over 21,000 objects. Your initial excitement quickly vanishes as you realize that your casserole is not one of them. When you contact customer service they ask you to send some pictures of the casserole so they can fix this. They tell you that the fix will be some time, though, as they need to collect about a thousand images of casseroles to retrain the neural network. While you are making the call your robot knocks over the olive oil because the steam coming from the pot of boiling water confused it. You start filling out the return form ... While not 100% realistic, the above story highlights an important obstacle towards truly autonomous agents such as household robots: such systems should be able to detect novel, previously unseen objects and learn to recognize them based on (ideally) a single example. Solving this one-shot object detection problem can be decomposed into three subproblems: (1) designing a class-agnostic object proposal mechanism that detects both known and previously unseen objects; (2) learning a suitably general visual representation (metric) that supports recognition of the detected objects; (3) continuously updating the classifier to accommodate new object classes or training examples of existing classes. In this paper, we focus on the detection and representation learning part of the pipeline, and we ask: what does it take to learn a visual representation that allows detection and recognition of previously unseen object categories based on a single example? We operationalize this question using an example-based visual search task (Fig. 1 ) that has been investigated before using handwritten characters To test the hypothesis that wider datasets improve generalization, we increase the number of object categories during training by using datasets (LVIS, Objects365) that have a larger number of categories annotated. Our experiments support this hypothesis and suggest the following conclusions: • The generalization gap between training and novel categories is a key problem in one-shot object detection. • This generalization gap can be almost closed by increasing the number of categories used for training: going from 80 classses in COCO to 1200 in LVIS improves relative performance from 45% to 89%. • The number of categories, not the amount of data, is the driving force behind this effect. • Closing the generalization gap allows us to use established methods from the object detection community (like e.g. stronger backbones) to make further progress. • We use these insights to improve state-of-the-art performance on COCO by 5.4 %AP 50 (from 22 %AP 50 to 27.5 %AP 50 ) using annotations from LVIS.

2. RELATED WORK

Object detection Object detection has seen huge progress since the widespread adoption of DNNs (Girshick et al., 2014; Ren et al., 2015; He et al., 2017; Lin et al., 2017a; Chen et al., 2019a; Wu et al., 2019b; Carion et al., 2020) . Similarly the number of datasets has grown steadily, fueled by the importance this task has for computer vision applications (Everingham et al., 2010; Russakovsky et al., 2015; Lin et al., 2014; Zhou et al., 2017; Neuhold et al., 2017; Krasin et al., 2017; Gupta et al., 2019; Shao et al., 2019) . However most models and datasets focus on scenarios where abundant examples per category are available. Few-shot learning The two most common approaches to few-shot learning have been, broadly speaking, based on metric learning (Koch et al., 2015; Vinyals et al., 2016; Snell et al., 2017) and meta learning: Learn a good way to learn a new task (Finn et al., 2017; Rusu et al., 2018) , or combinations thereof (Sun et al., 2019) . However, recent work has shown that much simpler approaches based on transfer learning achieve competitive performance (Chen et al., 2019b; Nakamura & Harada, 2019; Dhillon et al., 2019) . A particularly impressive example of this line of work is Big Transfer (Kolesnikov et al., 2019) , which uses transfer learning from a huge architecture trained on a huge dataset to perform one-shot ImageNet classification.



Figure 1: A. One-shot object detection: Identify and localize all objects of a certain category within a scene based on a (single) instructive example. B. Increasing the number of categories used during training reduces the generalization gap to novel categories presented at test time (in parenthesis: number of categories in each dataset).

(Omniglot; Michaelis et al. (2018a)) and real-world image datasets (Pascal VOC, COCO; Michaelis et al. (2018b); Hsieh et al. (2019); Zhang et al. (2019); Fan et al. (2020); Li et al. (2020)). Our central hypothesis is that scaling up the number of object categories used for training should improve the generalization capabilities of the learned representation. This hypothesis is motivated by the following observations. On (cluttered)Omniglot (Michaelis et al., 2018a), recognition of novel characters works almost as well as for characters seen during training. In this case, sampling enough categories during training relative to the visual complexity of the objects is sufficient to learn a metric that generalizes to novel categories. In contrast, models trained on visually more complex datasets like Pascal VOC and COCO exhibit a large generalization gap: novel categories are detected much less reliably than ones seen during training. This result suggests that on the natural image datasets, the number of categories is too small given the visual complexity of the objects and the models retreat to a shortcut (Geirhos et al., 2020) -memorizing the training categories.

