MCAL: MINIMUM COST HUMAN-MACHINE ACTIVE LABELING

Abstract

Today, ground-truth generation uses data sets annotated by cloud-based annotation services. These services rely on human annotation, which can be prohibitively expensive. In this paper, we consider the problem of hybrid human-machine labeling, which trains a classifier to accurately auto-label part of the data set. However, training the classifier can be expensive too. We propose an iterative approach that minimizes total overall cost by, at each step, jointly determining which samples to label using humans and which to label using the trained classifier. We validate our approach on well known public data sets such as Fashion-MNIST, CIFAR-10, CIFAR-100, and ImageNet. In some cases, our approach has 6× lower overall cost relative to human labeling the entire data set, and is always cheaper than the cheapest competing strategy.

1. INTRODUCTION

Ground-truth is crucial for training and testing ML models. Generating accurate ground-truth was cumbersome until the recent emergence of cloud-based human annotation services (SageMaker (2021); Google (2021); Figure-Eight (2021) ). Users of these services submit data sets and receive, in return, annotations on each data item in the data set. Because these services typically employ humans to generate ground-truth, annotation costs can be prohibitively high especially for large data sets. Hybrid Human-machine Annotations. In this paper, we explore using a hybrid human-machine approach to reduce annotation costs (in $) where humans only annotate a subset of the data items; a machine learning model trained on this annotated data annotates the rest. The accuracy of a model trained on a subset of the data set will typically be lower than that of human annotators. However, a user of an annotation service might choose to avail of this trade-off if (a) targeting a slightly lower annotation quality can significantly reduce costs, or (b) the cost of training a model to a higher accuracy is itself prohibitive. Consequently, this paper focuses on the design of a hybrid human-machine annotation scheme that minimizes the overall cost of annotating the entire data set (including the cost of training the model) while ensuring that the overall annotation accuracy, relative to human annotations, is higher than a pre-specified target (e.g., 95%). Challenges. In this paper, we consider a specific annotation task, multi-class labeling. We assume that the user of an annotation service provides a set X of data to be labeled and a classifier D to use for machine labeling. Then, the goal is to find a subset B ⊂ X human-labeled samples to train D, and use the classifier to label the rest, minimizing total cost while ensuring the target accuracy. A straw man approach might seek to predict human-labeled subset B in a single shot. This is hard to do because it depends on several factors: (a) the classifier architecture and how much accuracy it can achieve, (b) how "hard" the dataset is, (c) the cost of training and labeling, and (d) the target accuracy. Complex models may provide a high accuracy, their training costs may be too high and potentially offset the gains obtained through machine-generated annotations. Some data-points in a dataset are more informative as compared to the rest from a model training perspective. Identifying the "right" data subset for human-vs. machine-labeling can minimize the total labeling cost. Approach. In this paper we propose a novel technique, MCALfoot_0 (Minimum Cost Active Labeling), that addresses these challenges and is able to minimize annotation cost across diverse data sets. At its



MCAL is available at https://github.com/hangqiu/MCAL 1

