BREAKING BEYOND COCO OBJECT DETECTION

Abstract

COCO dataset has become the de facto standard for training and evaluating object detectors. According to the recent benchmarks, however, performance on this dataset is still far from perfect, which raises the following questions, a) how far can we improve the accuracy on this dataset using deep learning, b) what is holding us back in making progress in object detection, and c) what are the limitations of the COCO dataset and how can they be mitigated. To answer these questions, first, we propose a systematic approach to determine the empirical upper bound in AP over COCOval2017, and show that this upper bound is significantly higher than the state-of-the-art mAP (78.2% vs. 58.8%). Second, we introduce two complementary datasets to COCO: i) COCO_OI, composed of images from COCO and OpenImages (from 80 classes in common) with 1,418,978 training bounding boxes over 380,111 images, and 41,893 validation bounding boxes over 18,299 images, and ii) ObjectNet_D containing objects in daily life situations (originally created for object recognition known as ObjectNet; 29 categories in common with COCO). We evaluate models on these datasets and pinpoint the annotation errors on the COCO validation set. Third, we characterize the sources of errors in modern object detectors using a recently proposed error analysis tool (TIDE) and find that models behave differently on these datasets compared to COCO. For instance, missing objects are more frequent in the new datasets. We also find that models lack out of distribution generalization. Code and data will be shared.

1. INTRODUCTION

Object recognition is believed, although debatable, to be solved in computer vision witnessed by the "superhuman" performance of the state-of-the-art (SOTA) models (∼3% vs. ∼5% top-5 error rate human vs. machine on ImageNet (He et al., 2016; Dosovitskiy et al., 2020; Russakovsky et al., 2015) ). Unlike object recognition, however, object detection remains largely unsolved and models perform far below the theoretcial upper bound (mAP=1). The best performance on COCOval2017 and COCOtest-dev2017 are 58.8% and 61%, respectively based on the COCO detection leaderboard & codalab results. According to the results from the 2019 OpenImages detection challenge, the best mAP on this dataset is 65.9%. Inspired by the recent stream of work such as Recht et al. 2020) examining the accuracy and out of distribution generalization of recognition models, we strive to understand why detection performance is poor and study the limitations of object detectors and the ways models and datasets can be improved. Several years of extensive research in object detection has resulted in accumulation of an overwhelming amount of knowledge regarding model backbones, tips and tricks for model training, optimization, data collection, augmentation, annotation, model evaluation, and comparison, to a point that separating the wheat from the chaff is very difficult (Zou et al., 2019; Zhang et al., 2019) . For example, getting all details right in implementing average precision (AP) is frustratingly difficult (see supplement). A quick Google search returns several blogs and codes with discrepant explanations of average precision (AP). In addition, it is not clear whether AP has started to saturate, whether a small improvement in AP (e.g., 56.1 vs. 56 mAP) is meaningful, and more importantly how much we can improve following the current trend, making one wonder maybe we have reached the peak of detection performance using deep learningfoot_0 . A critical concern here is that maybe detection datasets are not big enough to capture variations in object sizefoot_1 , viewpoint, occlusion, and spatial relationships



This is also known as ceiling analysis. The median scale of the object relative to the image in ImageNet vs. COCO is 554 and 106, respectively. Therefore, most object instances in COCO are smaller than 1% of the image area(Singh & Davis, 2018).



(2019); Shankar et al. (2020); Beyer et al. (

