TOWARDS NOISE-RESISTANT OBJECT DETECTION WITH NOISY ANNOTATIONS

Abstract

Training deep object detectors requires large amounts of human-annotated images with accurate object labels and bounding box coordinates, which are extremely expensive to acquire. Noisy annotations are much more easily accessible, but they could be detrimental for learning. We address the challenging problem of training object detectors with noisy annotations, where the noise contains a mixture of label noise and bounding box noise. We propose a learning framework which jointly optimizes object labels, bounding box coordinates, and model parameters by performing alternating noise correction and model training. To disentangle label noise and bounding box noise, we propose a two-step noise correction method. The first step performs class-agnostic bounding box correction, and the second step performs label correction and class-specific bounding box refinement. We conduct experiments on PASCAL VOC and MS-COCO dataset with both synthetic noise and machine-generated noise. Our method achieves state-of-the-art performance by effectively cleaning both label noise and bounding box noise 1 .

1. INTRODUCTION

The remarkable success of modern object detectors largely relies on large-scale datasets with extensive bounding box annotations. However, it is extremely expensive and time-consuming to acquire high-quality human annotations. For example, annotating each bounding box in ILSVRC requires 42s on Mechanical Turk (Su et al., 2012) , whereas the recent OpenImagesV4 Kuznetsova et al. (2018) reports 7.4 seconds with extreme clicking (Papadopoulos et al., 2017b) . On the other hand, there are ways to acquire annotations at lower costs, such as limiting the annotation time, reducing the number of annotators, or using machine-generated annotations. However, these methods would yield annotations with both label noise (i.e. wrong classes) and bounding box noise (i.e. inaccurate locations), which could be detrimental for learning. Learning with label noise has been an active area of research. Some methods perform label correction using the predictions from the model and modify the loss accordingly (Reed et al., 2015; Tanaka et al., 2018) . Other methods treat samples with small loss as those with clean labels, and only allow clean samples to contribute to the loss (Jiang et al., 2018b; Han et al., 2018) . However, most of those methods focus on the image classification task where the existence of an object is guaranteed. Several recent works have studied object detection with noisy annotations. Zhang et al. ( 2019) focus on the weakly-supervised (WS) setting where only image-level labels are available, and find reliable bounding box instances as those with low classification loss. Gao et al. ( 2019) study a semisupervised (SS) setting where the training data contains a small amount of fully-labeled bounding boxes and a large amount of image-level labels, and propose to distill knowledge from a detector pretrained on clean annotations. However, these methods require access to some clean annotations. In this work, we address a more challenging and practical problem, where the annotation contains an unknown mixture of label noise and bounding box noise. Furthermore, we do not assume access to any clean annotations. The entanglement of label noise and bounding box noise increases the difficulty to perform noise correction. A commonly used noise indicator, namely the classification loss, is incapable to distinguish label noise from bounding box noise. Furthermore, it is problematic to correct noise directly using the model predictions, because label correction requires accurate 1 Code will be released. 1

