ACTIVE LEARNING FOR OBJECT DETECTION WITH EVIDENTIAL DEEP LEARNING AND HIERARCHICAL UNCERTAINTY AGGREGATION

Abstract

Despite the huge success of object detection, the training process still requires an immense amount of labeled data. Although various active learning solutions for object detection have been proposed, most existing works do not take advantage of epistemic uncertainty, which is an important metric for capturing the usefulness of a sample. Also, previous works pay little attention to the attributes of each bounding box (e.g., nearest object, box size) when computing the informativeness of an image. In this paper, we propose a new active learning strategy for object detection that overcomes the shortcomings of prior works. To make use of epistemic uncertainty, we adopt evidential deep learning (EDL) and propose a new module termed model evidence head (MEH), that makes EDL highly compatible with object detection. Based on the computed epistemic uncertainty of each bounding box, we propose hierarchical uncertainty aggregation (HUA) for obtaining the informativeness of an image. HUA realigns all bounding boxes into multiple levels based on the attributes and aggregates uncertainties in a bottom-up order, to effectively capture the context within the image. Experimental results show that our solution outperforms existing state-of-the-art methods by a considerable margin.

1. INTRODUCTION

Deep learning contributes to huge success in computer vision problems such as semantic segmentation (Long et al., 2015; Ronneberger et al., 2015; Chen et al., 2018) and object detection (Liu et al., 2016; Lin et al., 2017; Redmon et al., 2016) . However, training a deep neural network typically comes with a cost of large labeled datasets. Labeling data for complex vision problems requires intensive labor of human experts, which makes preparing for practical application challenging. Active learning, which gradually labels a set of samples based on the informativeness (e.g., uncertainty), is a promising solution for this problem due to its simplicity and high performance. Although active learning has been extensively studied on image classification, only a few prior works focused on object detection (Yuan et al., 2021; Su et al., 2020; Haussmann et al., 2020; Yu et al., 2021) despite its practical importance. Furthermore, existing works on active learning for object detection have two limitations. First, when computing the informativeness of an image, most previous works only use the aleatoric uncertainty, not taking the epistemic uncertainty into account. Epistemic uncertainty, also known as knowledge uncertainty, captures the lack of knowledge of a model (caused by a lack of data) and can be reduced when large amounts of data are available. Aleatoric uncertainty, on the other hand, captures the noise inherent in the observed data and is irreducible. As stated in (Nguyen et al., 2022; Hafner et al., 2018; Hüllermeier & Waegeman, 2021) , epistemic uncertainty can reflect the usefulness of samples and support active learning better than aleatoric uncertainty. Secondly, previous works on active learning for object detection generally ignore the attributes of bounding boxes (e.g., nearest object, box size) when computing the informativeness of an image: informativeness is often defined as the maximum or mean of the uncertainty values of all bounding boxes in the image. This can be a problem because a cluttered image with many objects belonging to various categories can be enforced to have a similar uncertainty value relative to just a simple image with only a few objects belonging to a single category. Goal and challenge. The general goal of this paper is to propose an active learning strategy for object detection, that handles the above limitations of existing works. First, we aim to build an algorithm that can compute epistemic uncertainty quickly yet correctly in object detection. Some existing works on object detection (Haussmann et al., 2020; Feng et al., 2019) calculate epistemic uncertainty using multi-model based methods (e.g., model ensemble (Beluch et al., 2018) , Monte Carlo (MC) dropout (Gal & Ghahramani, 2016) ). However, these methods require multiple models or repetitive forward propagations for MC integration, consequently making practical applications difficult. Secondly, we aim to design an uncertainty aggregation scheme which can consider attributes of bounding boxes and understand the context within images. This goal is meaningful since aggregation schemes of previous works (Yuan et al., 2021; Roy et al., 2018; Choi et al., 2021) , which simply rely on the maximum/mean of all bounding boxes, are hard to reflect the context in images. Main contributions. Our first key idea is to adopt Evidential Deep Learning (EDL) to effectively compute epistemic uncertainty, and to propose a new module that makes EDL highly compatible with object detection. Introduced by (Sensoy et al., 2018; Amini et al., 2020) , EDL is a useful tool to compute epistemic uncertainty for detecting unfamiliar data (e.g., unseen unlabeled data) since it samples model ensembles almost instantly. However, previous works on EDL have mainly focused on image classification and induce unconfident prediction and unstable training when simply applied to object detection. To this end, we propose a new module named as Model Evidence Head (MEH) to enable confident prediction and stable training of EDL. Specifically, MEH predicts the expected difficulty, or model evidence, and is optimized independently of the object detector. To our knowledge, this is the first research to make EDL compatible with object detection on 2D images. Another key ingredient for our solution is Hierarchical Uncertainty Aggregation (HUA), which makes use of attributes in bounding boxes for computing the informativeness of an image. According to the attributes, HUA realigns boxes into multiple levels and aggregates uncertainties in a bottom-up order. This helps to capture the context within the image and improves the quality of the expected informativeness of images. Overall, our main contributions are summarized as follows: • We make use of EDL to effectively compute epistemic uncertainty in object detection, and design a new module termed Model Evidence Head (MEH) which solely predicts the model evidence independently of the class confidence to make EDL adaptable to object detection. • We propose Hierarchical Uncertainty Aggregation (HUA), which reorganizes all bounding boxes into several levels and aggregate uncertainties of each level in a bottom-up manner, to better capture the context within the image. We validate the efficacy of proposed methods using RetinaNet and SSD as base models on wellknown datasets: PASCAL VOC, MS-COCO. Extensive experiments demonstrate that the proposed methods significantly improve performance achieving new state-of-the-art results.

2. RELATED WORKS

Active learning for object detection. Active learning (Sinha et al., 2019; Yoo & Kweon, 2019; Sener & Savarese, 2017; Gal et al., 2017; Wang et al., 2016) aims to select a small subset of informative unlabeled samples which is expected to be most effective. Although active learning has been extensively studied for the classification problem, only a few works focus on object detection (Yuan et al., 2021; Su et al., 2020; Haussmann et al., 2020; Yu et al., 2021) , despite its practical importance. (Yuan et al., 2021; Su et al., 2020) train discriminators using unlabeled data to predict whether an image is from the labeled set or the unlabeled set. LL4AL (Yoo & Kweon, 2019) trains an auxiliary module to predict loss of data samples, where a sample with high predicted loss is considered as the one with high informativeness. The concurrent work CDAL (Agarwal et al., 2020) introduced a distance measure to select diverse samples in semantic and spatial context. Recently, (Choi et al., 2021) predict the parameters of Gaussian mixture models and computes epistemic uncertainty as the variance of Gaussian modes. However, there is no guarantee that model uncertainty leads to variance in GMM, and the size of the model ensemble is fixed to be small since the number of Gaussian modes is not allowed to change after training. Overall, existing approaches do not fully take advantage of epistemic uncertainty and pay little attention to attributes of bounding boxes during uncertainty aggregation. Our work resolves both limitations using novel and effective methods. Bayesian deep learning. The goal of Bayesian deep learning is to build a credible machine that measures uncertainty on its decision as well. In contrast to the frequentist approach where model

