BREAKING BEYOND COCO OBJECT DETECTION

Abstract

COCO dataset has become the de facto standard for training and evaluating object detectors. According to the recent benchmarks, however, performance on this dataset is still far from perfect, which raises the following questions, a) how far can we improve the accuracy on this dataset using deep learning, b) what is holding us back in making progress in object detection, and c) what are the limitations of the COCO dataset and how can they be mitigated. To answer these questions, first, we propose a systematic approach to determine the empirical upper bound in AP over COCOval2017, and show that this upper bound is significantly higher than the state-of-the-art mAP (78.2% vs. 58.8%). Second, we introduce two complementary datasets to COCO: i) COCO_OI, composed of images from COCO and OpenImages (from 80 classes in common) with 1,418,978 training bounding boxes over 380,111 images, and 41,893 validation bounding boxes over 18,299 images, and ii) ObjectNet_D containing objects in daily life situations (originally created for object recognition known as ObjectNet; 29 categories in common with COCO). We evaluate models on these datasets and pinpoint the annotation errors on the COCO validation set. Third, we characterize the sources of errors in modern object detectors using a recently proposed error analysis tool (TIDE) and find that models behave differently on these datasets compared to COCO. For instance, missing objects are more frequent in the new datasets. We also find that models lack out of distribution generalization. Code and data will be shared.

1. INTRODUCTION

Object recognition is believed, although debatable, to be solved in computer vision witnessed by the "superhuman" performance of the state-of-the-art (SOTA) models (∼3% vs. ∼5% top-5 error rate human vs. machine on ImageNet (He et al., 2016; Dosovitskiy et al., 2020; Russakovsky et al., 2015) ). Unlike object recognition, however, object detection remains largely unsolved and models perform far below the theoretcial upper bound (mAP=1). The best performance on COCOval2017 and COCOtest-dev2017 are 58.8% and 61%, respectively based on the COCO detection leaderboard & codalab results. According to the results from the 2019 OpenImages detection challenge, the best mAP on this dataset is 65.9%. Inspired by the recent stream of work such as Recht et al. (2019); Shankar et al. (2020); Beyer et al. (2020) examining the accuracy and out of distribution generalization of recognition models, we strive to understand why detection performance is poor and study the limitations of object detectors and the ways models and datasets can be improved. Several years of extensive research in object detection has resulted in accumulation of an overwhelming amount of knowledge regarding model backbones, tips and tricks for model training, optimization, data collection, augmentation, annotation, model evaluation, and comparison, to a point that separating the wheat from the chaff is very difficult (Zou et al., 2019; Zhang et al., 2019) . For example, getting all details right in implementing average precision (AP) is frustratingly difficult (see supplement). A quick Google search returns several blogs and codes with discrepant explanations of average precision (AP). In addition, it is not clear whether AP has started to saturate, whether a small improvement in AP (e.g., 56.1 vs. 56 mAP) is meaningful, and more importantly how much we can improve following the current trend, making one wonder maybe we have reached the peak of detection performance using deep learningfoot_0 . A critical concern here is that maybe detection datasets are not big enough to capture variations in object sizefoot_1 , viewpoint, occlusion, and spatial relationships among objects. In other words, scaling object detection seems to be much more challenging compared to scaling object recognition. Due to these, object detection can be considered as a key task to assess the promises and limits of deep learning in computer vision. Contributions. To shed light on blind spots that could be holding back progress, we carefully and systematically approximate the empirical upper bound in AP (UAP). We argue that UAP is the score of the object detector with access to ground-truth bounding boxes. An object recognition model is trained on the training target bounding boxes and is then used to label the test target boxes (i.e., localization is assumed to be solved). In a nutshell, we find that there is a large gap between the best model mAP and the empirical upper bound as shown in Fig. 1 . The gap is wider at higher IOUs and over small objects. Using the latest results on COCO dataset (Fig. 1 ), the gap is narrow at IOU=0.5 (∼2 points). The computed UAP entails that there is a hope to reach this peak with the current tools if we can find smarter ways to adopt the object recognition models and backbones for object detection, but going beyond it may require major breakthroughs. The current trend in deep learning, in particular in recognition and detection, has been to scale the datasets. Less effort, however, has been spent to systematically inspect errors in datasets. Here, we perform such inspection, and investigate out of distribution generalization of object detectors by introducing new validation sets. We also introduce an extension of COCO by integrating it with OpenImages. Finally, we identify the bottlenecks in detectors and characterize the type of errors they make on these new sets. The second line of work concerns research in comparing object detection models. Some works have analyzed and reported statistics and performances over benchmark datasets such as PASCAL VOC (Everingham et al., 2010; 2015) , COCO (Lin et al., 2014) 



This is also known as ceiling analysis. The median scale of the object relative to the image in ImageNet vs. COCO is 554 and 106, respectively. Therefore, most object instances in COCO are smaller than 1% of the image area(Singh & Davis, 2018).



Figure1: Upper bound AP (UAP) shown in red and scores of the best models using COCO evaluation tool (shown in black; numbers come from different models). Results over COCOval2017 are compiled from the top entries of the latest challenge on COCO available at COCO detectoin leaderboard. See also paperswithcode.com coco_minival. Notice that the gap at AP50 is almost closed in COCO. There is, however, still a large gap between best models and UAP. The gap is wider over higher IOUs and small objects. Note: Higher mAPs have been reported over the VOC benchmark paperswithcode.com VOC using the VOC evaluation tool (89.1 AP50 Ghiasi et al. (2020) which is close to 89.5 UAP computed by us using the VOC tool; please see also Appx A).

lines of work relate to our study. The first one includes works that strive to understand object detectors, identify their shortcomings, and pinpoint where more research is needed. Parikh & Zitnick (2011) aimed to find the weakest links in person detectors by replacing different components in a pipeline (e.g., part detection, non-maxima-suppression) with human annotations. Mottaghi et al. (2015) proposed human-machine CRFs for identifying bottlenecks in scene understanding models. Hoiem et al. (2012) inspected detection models in terms of their localization errors and confusion with other classes and the background over the PASCAL VOC dataset. They also conducted a meta-analysis to measure the impact of object properties such as color, texture, and real-world size on detection performance. To overcome the shortcomings of the Hoiem et al. 's approach and COCO analysis tools, recently Bolya et al. (2020) proposed to analyse the models by emphasizing the order in which the errors are considered. Russakovsky et al. (2013) analyzed the ImageNet localization task and emphasized on fine-grained recognition. Zhang et al. (2016) measured how far we are from solving pedestrian detection. Vondrick et al. (2013) proposed a method for visualizing object detection features to gain insights into their functioning. Some other related works in this regard include Li et al. (2019), Zhu et al. (2012), Zhang et al. (2014), Goldman et al. (2019), and Petsiuk et al. (2020).

, CityScapes (Cordts et al., 2016), and OpenImages (Kuznetsova et al., 2018). Recently, Huang et al. (2017) performed a speed/accuracy trade-off analysis of modern object detectors. Dollar et al. (2011) and Borji et al. (2015) compared

