MCAL: MINIMUM COST HUMAN-MACHINE ACTIVE LABELING

Abstract

Today, ground-truth generation uses data sets annotated by cloud-based annotation services. These services rely on human annotation, which can be prohibitively expensive. In this paper, we consider the problem of hybrid human-machine labeling, which trains a classifier to accurately auto-label part of the data set. However, training the classifier can be expensive too. We propose an iterative approach that minimizes total overall cost by, at each step, jointly determining which samples to label using humans and which to label using the trained classifier. We validate our approach on well known public data sets such as Fashion-MNIST, CIFAR-10, CIFAR-100, and ImageNet. In some cases, our approach has 6× lower overall cost relative to human labeling the entire data set, and is always cheaper than the cheapest competing strategy.

1. INTRODUCTION

Ground-truth is crucial for training and testing ML models. Generating accurate ground-truth was cumbersome until the recent emergence of cloud-based human annotation services (SageMaker (2021); Google (2021); Figure-Eight (2021) ). Users of these services submit data sets and receive, in return, annotations on each data item in the data set. Because these services typically employ humans to generate ground-truth, annotation costs can be prohibitively high especially for large data sets. Hybrid Human-machine Annotations. In this paper, we explore using a hybrid human-machine approach to reduce annotation costs (in $) where humans only annotate a subset of the data items; a machine learning model trained on this annotated data annotates the rest. The accuracy of a model trained on a subset of the data set will typically be lower than that of human annotators. However, a user of an annotation service might choose to avail of this trade-off if (a) targeting a slightly lower annotation quality can significantly reduce costs, or (b) the cost of training a model to a higher accuracy is itself prohibitive. Consequently, this paper focuses on the design of a hybrid human-machine annotation scheme that minimizes the overall cost of annotating the entire data set (including the cost of training the model) while ensuring that the overall annotation accuracy, relative to human annotations, is higher than a pre-specified target (e.g., 95%). Challenges. In this paper, we consider a specific annotation task, multi-class labeling. We assume that the user of an annotation service provides a set X of data to be labeled and a classifier D to use for machine labeling. Then, the goal is to find a subset B ⊂ X human-labeled samples to train D, and use the classifier to label the rest, minimizing total cost while ensuring the target accuracy. A straw man approach might seek to predict human-labeled subset B in a single shot. This is hard to do because it depends on several factors: (a) the classifier architecture and how much accuracy it can achieve, (b) how "hard" the dataset is, (c) the cost of training and labeling, and (d) the target accuracy. Complex models may provide a high accuracy, their training costs may be too high and potentially offset the gains obtained through machine-generated annotations. Some data-points in a dataset are more informative as compared to the rest from a model training perspective. Identifying the "right" data subset for human-vs. machine-labeling can minimize the total labeling cost. Approach. In this paper we propose a novel technique, MCALfoot_0 (Minimum Cost Active Labeling), that addresses these challenges and is able to minimize annotation cost across diverse data sets. At its core, MCAL learns on-the-fly an accuracy model that, given a number of samples to human-label, and a number to machine label, can predict the overall accuracy of the resultant set of labeled samples. Intuitively, this model implicitly captures the complexity of the classifier and the data set. It also uses a cost model for training and labeling. MCAL proceeds iteratively. At each step, it uses the accuracy model and the cost model to search for the best combination of the number of samples to human-label and to machine-label that would minimize total cost. It obtains the human-labels, trains the classifier D with these, dynamically updates the accuracy model, and machine-labels un-labeled samples if it determines that additional training cannot further reduce cost. MCAL resembles active learning in determining which samples in the data set to select for humanlabeling. However, it differs from active learning (Fig. 1 ) in its goals; active learning seeks to train a classifier with a given target accuracy, while MCAL attempts to label a complete data set within a given error bound. In addition, active learning does not consider training costs, as MCAL does. This paper makes the following contributions: • It casts the minimum-cost labeling problem in an optimization framework ( §2) that minimizes total cost by jointly selecting which samples to human label and which to machine label. This framing requires a cost model and an accuracy model, as discussed above ( §3). For the former, MCAL assumes that total training cost at each step is proportional to training set size (and derives the cost model parameters using profiling on real hardware). For the latter, MCAL leverages the growing body of literature suggesting that a truncated power-law governs the relationship between model error and training set size (Cho et al. (2015) ; Hestness et al. (2017) ; Sala (2019)). • The MCAL algorithm ( §4) refines the power-law parameters, then does a fast search for the combination of human-and machine-labeled samples that minimizes the total cost. MCAL uses an active learning metric to select samples to human-label. But because it includes a machine-labeling step, not all metrics work well for MCAL. Specifically, core-set based sample selection is not the best choice for MCAL; the resulting classifier machine-labels fewer samples. • MCAL extends easily to the case where the user supplies multiple candidate architectures for the classifier. It trains each classifier up to the point where it is able to confidently predict which architecture can achieve the lowest overall cost. Evaluations ( §5) on various popular benchmark data sets show that MCAL achieves lower than the lowest-cost labeling achieved by an oracle active learning strategy. It automatically adapts its strategy to match the complexity of the data set. For example, it labels most of the Fashion data set using a trained classifier. At the other end, it chooses to label CIFAR-100 mostly using humans, since it estimates training costs to be prohibitive. Finally, it labels a little over half of CIFAR-10 using a classifier. MCAL is up to 6× cheaper compared to human labeling all images. It is able to achieve these savings, in part, by carefully determining the training size while accounting for training costs; cost savings due to active learning range from 20-32% for Fashion and CIFAR-10.

2. PROBLEM FORMULATION

In this section, we formalize the intuitions presented in §1. The input to MCAL is an unlabeled data set X and a target error rate bound ε. Suppose that MCAL trains a classifier D(B) using human generated labels for some B ⊂ X. Let the error rate of the classifier D(B) over the remaining unlabeled data



MCAL is available at https://github.com/hangqiu/MCAL



FIGURE 1: Differences between MCAL and Active Learning. Active learning outputs an ML model using few samples from the data set. MCAL completely annotates and outputs the dataset. MCAL must also use the ML model to annotate samples reliably (red arrow).

