TOWARDS NOISE-RESISTANT OBJECT DETECTION WITH NOISY ANNOTATIONS

Abstract

Training deep object detectors requires large amounts of human-annotated images with accurate object labels and bounding box coordinates, which are extremely expensive to acquire. Noisy annotations are much more easily accessible, but they could be detrimental for learning. We address the challenging problem of training object detectors with noisy annotations, where the noise contains a mixture of label noise and bounding box noise. We propose a learning framework which jointly optimizes object labels, bounding box coordinates, and model parameters by performing alternating noise correction and model training. To disentangle label noise and bounding box noise, we propose a two-step noise correction method. The first step performs class-agnostic bounding box correction, and the second step performs label correction and class-specific bounding box refinement. We conduct experiments on PASCAL VOC and MS-COCO dataset with both synthetic noise and machine-generated noise. Our method achieves state-of-the-art performance by effectively cleaning both label noise and bounding box noise 1 .

1. INTRODUCTION

The remarkable success of modern object detectors largely relies on large-scale datasets with extensive bounding box annotations. However, it is extremely expensive and time-consuming to acquire high-quality human annotations. For example, annotating each bounding box in ILSVRC requires 42s on Mechanical Turk (Su et al., 2012) , whereas the recent OpenImagesV4 Kuznetsova et al. (2018) reports 7.4 seconds with extreme clicking (Papadopoulos et al., 2017b) . On the other hand, there are ways to acquire annotations at lower costs, such as limiting the annotation time, reducing the number of annotators, or using machine-generated annotations. However, these methods would yield annotations with both label noise (i.e. wrong classes) and bounding box noise (i.e. inaccurate locations), which could be detrimental for learning. Learning with label noise has been an active area of research. Some methods perform label correction using the predictions from the model and modify the loss accordingly (Reed et al., 2015; Tanaka et al., 2018) . Other methods treat samples with small loss as those with clean labels, and only allow clean samples to contribute to the loss (Jiang et al., 2018b; Han et al., 2018) . However, most of those methods focus on the image classification task where the existence of an object is guaranteed. Several recent works have studied object detection with noisy annotations. Zhang et al. (2019) focus on the weakly-supervised (WS) setting where only image-level labels are available, and find reliable bounding box instances as those with low classification loss. Gao et al. ( 2019) study a semisupervised (SS) setting where the training data contains a small amount of fully-labeled bounding boxes and a large amount of image-level labels, and propose to distill knowledge from a detector pretrained on clean annotations. However, these methods require access to some clean annotations. In this work, we address a more challenging and practical problem, where the annotation contains an unknown mixture of label noise and bounding box noise. Furthermore, we do not assume access to any clean annotations. The entanglement of label noise and bounding box noise increases the difficulty to perform noise correction. A commonly used noise indicator, namely the classification loss, is incapable to distinguish label noise from bounding box noise. Furthermore, it is problematic to correct noise directly using the model predictions, because label correction requires accurate bounding box coordinates to crop the object, whereas bounding box correction requires accurate class labels to produce the regression offset. To overcome these difficulties, we propose a two-step noise correction procedure. In the first step, we perform class-agnostic bounding box correction (CA-BBC), which seeks to decouple bounding box noise from label noise, and optimize the noisy ground-truth (GT) bounding box regardless of its class label. An illustration of CA-BBC is shown in Figure 1 . It is based on the following intuition: if a bounding box tightly covers an object, then two diverged classifiers would agree with each other and produce the same prediction. Furthermore, both classifiers would have low scores for the background class, i.e., high objectness scores. Therefore, we directly regress the noisy GT bounding box to minimize both classifier discrepancy and background scores. CA-BBC also has the option to reject a bounding box as false positive if the objectness score is too low. In the second step, we leverage the model's output for label noise correction and class-specific bounding box refinement. It has been shown that co-training two models can filter different types of noise and help each other learn (Blum & Mitchell, 1998; Han et al., 2018; Yu et al., 2019; Chadwick & Newman, 2019) . Therefore, we distil knowledge from the ensemble of dual detection heads for noise correction, by generating soft labels and bounding box offsets. We show that soft labels with well-adjusted temperature lead to better performance even for a clean dataset. To summarize, this paper proposes a noise-resistant learning framework to train object detectors with noisy annotations. The proposed framework jointly optimizes object labels, bounding box coordinates, and model parameters by performing alternating noise correction and model training. We conduct experiments on two benchmarks: PASCAL VOC and MS-COCO, which contain different levels of synthetic noise as well as machine-generated noise. The proposed method outperforms previous methods by a large margin. We also provide qualitative results to demonstrate the efficacy of the two-step noise correction, and ablation studies to examine the effect of each component.

2.1. CROWDSOURCING FOR OBJECT DETECTION

Crowdsourcing platforms such as Amazon Mechanical Turk (AMT) have enabled the collection of large-scale datasets. Due to the formidable cost of human annotation, many efforts have been devoted to reduce the annotation cost. However, even an efficient protocol still report 42.4s to annotate one object in an image (Su et al., 2012) . Other methods have been proposed which trade off annotation quality for lower cost, by using click supervision (Papadopoulos et al., 2017a) , human-inthe-loop labeling (Russakovsky et al., 2015; Papadopoulos et al., 2016; Konyushkova et al., 2018) , or exploiting eye-tracking data (Papadopoulos et al., 2014) . These methods focus on reducing human effort, rather than combating the annotation noise as our method does.



Code will be released.



Figure 1: Our Class-Agnostic Bounding Box Correction (CA-BBC) disentangles bounding box (bbox) noise from label noise, by directly optimizing the noisy bbox coordinates regardless of its class label. We use two diverged classifiers to predict the same image region, and update the bbox b to b ⇤ by minimizing classifier discrepancy and maximizing region objectness. Boxes with very low objectness are rejected as false positives.

