LA-BALD: AN INFORMATION-THEORETIC IMAGE LABELING TASK SAMPLER

Abstract

Large-scale visual recognition datasets with high-quality labels enable many computer vision applications, but also come with enormous annotation costs, especially since multiple annotators are typically queried per image to obtain a more reliable label. Recent work in label aggregation consolidates human annotations by combining them with the predictions of an online-learned predictive model. In this work, we devise an image labeling task sampler that actively selects image-worker pairs to efficiently reduce the noise in the human annotations and improve the predictive model at the same time. We propose an information-theoretic task sampler, Label Aggregation BALD (LA-BALD), to maximize the information contributing to the labeled dataset via human annotations and the model. The simulated experiments on ImageNet100-sandbox show that LA-BALD reduces the number of annotations by 19% and 12% on average compared to the two types of baselines. Our analysis shows that LA-BALD provides both more accurate annotations and better online-learned predictive model, leading to a better labeling efficiency over the baselines.

1. INTRODUCTION

Machine learning has led to large advances in a wide range of applications such as machine translation, early cancer detection, virtual reality, and autonomous driving. Large-scale labeled datasets play a vital role in the success of modern ML. The CheXNet dataset Rajpurkar et al. (2017) Human annotators play an essential role in large-scale and high-quality dataset creation Vaughan (2018) . The level of inter-human variabilities, such as workers' interest and familiarity with the topic, and perceived task difficulty Kazai et al. (2012) , are major factors that govern dataset quality Giuffrida et al. (2018); Jungo et al. (2018) . However, the inter-human variability is a double-edged sword for data labeling. With an unlimited monetary budget, we can approximate true label distributions by sampling multiple diverse workers per each example Peterson et al. (2019) . On the other hand, we might obtain high-variance noisy labels when we operate with a limited budget and thus need to trust a single or very few annotators per example. Given the collected human annotations, label aggregation is a common way to infer the latent true labels Zheng et al. (2017) . Usually, the aggregator captures each annotation's quality by estimating the data difficulty and the workers' competencies. Recent success in label aggregation for image labeling Branson et al. (2017); Liao et al. (2021) leverages data similarity via an online-learned predictive model to infer the true labels, increasing the labeling efficiency by a large margin. Under this framework, an aggregator's label quality is blocked by the individual annotation quality and the predictive model's performance. In this work, adopting the state-of-the-art label aggregator Liao et al. ( 2021), we propose an information-theoretic image labeling task sampler, Label Aggregation BALD (LA-BALD), targeting on both blockers: i) which image-worker pair provides the best expected quality, and ii) which image labels benefit the predicted model the most. We formulate two different goals into an information maximization problem. As shown in Fig. 1 , each annotation provides information to the 



The code is released in the anonymized repository at https://anonymous.4open.science/r/LA-BALD-8B7D/README.md



benefits automatic chest radiograph interpretation by providing clinical decision support. The Waymo Open dataset Sun et al. (2020) advances machine perception for self-driving vehicles by collecting data from diverse geographic locations. Diverse animal datasets Beery et al. (2018); Swanson et al. (2015) facilitate automatic animal monitoring.

Figure 1: Goal of an Image Labeling Task Sampler. State-of-the-art label aggregator Liao et al. (2021) leverages an online-learned predictive model to increase the labeling efficiency. In this framework, each annotation influences the labeled dataset both locally (via individual human annotation) and globally (via the online-learned predictive model). The image labeling task sampler needs to maximize the information contributing to the labeled dataset via the local and global influence by choosing image-worker pairs in an iterative fashion.

