MCAL: MINIMUM COST HUMAN-MACHINE ACTIVE LABELING

Abstract

Today, ground-truth generation uses data sets annotated by cloud-based annotation services. These services rely on human annotation, which can be prohibitively expensive. In this paper, we consider the problem of hybrid human-machine labeling, which trains a classifier to accurately auto-label part of the data set. However, training the classifier can be expensive too. We propose an iterative approach that minimizes total overall cost by, at each step, jointly determining which samples to label using humans and which to label using the trained classifier. We validate our approach on well known public data sets such as Fashion-MNIST, CIFAR-10, CIFAR-100, and ImageNet. In some cases, our approach has 6× lower overall cost relative to human labeling the entire data set, and is always cheaper than the cheapest competing strategy. Ground-truth is crucial for training and testing ML models. Generating accurate ground-truth was cumbersome until the recent emergence of cloud-based human annotation services (SageMaker (2021); Google (2021); Figure-Eight (2021)). Users of these services submit data sets and receive, in return, annotations on each data item in the data set. Because these services typically employ humans to generate ground-truth, annotation costs can be prohibitively high especially for large data sets. Hybrid Human-machine Annotations. In this paper, we explore using a hybrid human-machine approach to reduce annotation costs (in $) where humans only annotate a subset of the data items; a machine learning model trained on this annotated data annotates the rest. The accuracy of a model trained on a subset of the data set will typically be lower than that of human annotators. However, a user of an annotation service might choose to avail of this trade-off if (a) targeting a slightly lower annotation quality can significantly reduce costs, or (b) the cost of training a model to a higher accuracy is itself prohibitive. Consequently, this paper focuses on the design of a hybrid human-machine annotation scheme that minimizes the overall cost of annotating the entire data set (including the cost of training the model) while ensuring that the overall annotation accuracy, relative to human annotations, is higher than a pre-specified target (e.g., 95%). Challenges. In this paper, we consider a specific annotation task, multi-class labeling. We assume that the user of an annotation service provides a set X of data to be labeled and a classifier D to use for machine labeling. Then, the goal is to find a subset B ⊂ X human-labeled samples to train D, and use the classifier to label the rest, minimizing total cost while ensuring the target accuracy. A straw man approach might seek to predict human-labeled subset B in a single shot. This is hard to do because it depends on several factors: (a) the classifier architecture and how much accuracy it can achieve, (b) how "hard" the dataset is, (c) the cost of training and labeling, and (d) the target accuracy. Complex models may provide a high accuracy, their training costs may be too high and potentially offset the gains obtained through machine-generated annotations. Some data-points in a dataset are more informative as compared to the rest from a model training perspective. Identifying the "right" data subset for human-vs. machine-labeling can minimize the total labeling cost. Approach. In this paper we propose a novel technique, MCAL 1 (Minimum Cost Active Labeling), that addresses these challenges and is able to minimize annotation cost across diverse data sets. At its



FIGURE 1: Differences between MCAL and Active Learning. Active learning outputs an ML model using few samples from the data set. MCAL completely annotates and outputs the dataset. MCAL must also use the ML model to annotate samples reliably (red arrow). core, MCAL learns on-the-fly an accuracy model that, given a number of samples to human-label, and a number to machine label, can predict the overall accuracy of the resultant set of labeled samples. Intuitively, this model implicitly captures the complexity of the classifier and the data set. It also uses a cost model for training and labeling. MCAL proceeds iteratively. At each step, it uses the accuracy model and the cost model to search for the best combination of the number of samples to human-label and to machine-label that would minimize total cost. It obtains the human-labels, trains the classifier D with these, dynamically updates the accuracy model, and machine-labels un-labeled samples if it determines that additional training cannot further reduce cost. MCAL resembles active learning in determining which samples in the data set to select for humanlabeling. However, it differs from active learning (Fig. 1 ) in its goals; active learning seeks to train a classifier with a given target accuracy, while MCAL attempts to label a complete data set within a given error bound. In addition, active learning does not consider training costs, as MCAL does. This paper makes the following contributions: • It casts the minimum-cost labeling problem in an optimization framework ( §2) that minimizes total cost by jointly selecting which samples to human label and which to machine label. This framing requires a cost model and an accuracy model, as discussed above ( §3). For the former, MCAL assumes that total training cost at each step is proportional to training set size (and derives the cost model parameters using profiling on real hardware). For the latter, MCAL leverages the growing body of literature suggesting that a truncated power-law governs the relationship between model error and training set size (Cho et al. (2015) ; Hestness et al. (2017) ; Sala (2019) ). • The MCAL algorithm ( §4) refines the power-law parameters, then does a fast search for the combination of human-and machine-labeled samples that minimizes the total cost. MCAL uses an active learning metric to select samples to human-label. But because it includes a machine-labeling step, not all metrics work well for MCAL. Specifically, core-set based sample selection is not the best choice for MCAL; the resulting classifier machine-labels fewer samples. • MCAL extends easily to the case where the user supplies multiple candidate architectures for the classifier. It trains each classifier up to the point where it is able to confidently predict which architecture can achieve the lowest overall cost. Evaluations ( §5) on various popular benchmark data sets show that MCAL achieves lower than the lowest-cost labeling achieved by an oracle active learning strategy. It automatically adapts its strategy to match the complexity of the data set. For example, it labels most of the Fashion data set using a trained classifier. At the other end, it chooses to label CIFAR-100 mostly using humans, since it estimates training costs to be prohibitive. Finally, it labels a little over half of CIFAR-10 using a classifier. MCAL is up to 6× cheaper compared to human labeling all images. It is able to achieve these savings, in part, by carefully determining the training size while accounting for training costs; cost savings due to active learning range from 20-32% for Fashion and CIFAR-10.

2. PROBLEM FORMULATION

In this section, we formalize the intuitions presented in §1. The input to MCAL is an unlabeled data set X and a target error rate bound ε. Suppose that MCAL trains a classifier D(B) using human generated labels for some B ⊂ X. Let the error rate of the classifier D(B) over the remaining unlabeled data using D(B) be ϵ(X B). If D(B) machine-generatedfoot_1 labels for this remaining data, the overall ground-truth error rate for X would be, ϵ(X) = (1 -|B|/|X|) ϵ(X B). If ϵ(X) ≥ ε, then, this would violate the maximum error rate requirement. However, D(B) is still able to generate accurate labels for a carefully chosen subset S(D, B) ⊂ X B (e.g., comprising only those that D(B) is very confident about). After generating labels for S(D, B) using D(B), labels for the remaining X B \ S(D, B) can be once again generated by humans. The overall error rate of the generated ground-truth then would be {(|S(D, B)|)/(|X|)}ϵ(S(D, B)). ϵ(S(D, B)) is the error rate of generating labels over S(D, B) using D(B) and is, in general, higher for larger |S(D, B)|. Let S ⋆ (D, B) be the largest possible S(D, B) that ensures that the overall error rate is less than ε. Then, overall cost of generating labels is: C = (|X S ⋆ (D, B)|) • C h + Ct(D(B)) (1) where, C h is the cost of human labeling for a single data item and C t (D(B)) is the total cost of generating D(B) including the cost of optimizing B and training D(B). Given this, MCAL achieves the minimum cost by jointly selecting S ⋆ (D, B) and B subject to the accuracy constraint:  C ⋆ = argmin S ⋆ (D,B),B C, s.t. (|S ⋆ (D, B)|)/(|X|)ϵ(S ⋆ (D, B)) < ε

3.1. ESTIMATING MACHINE-LABELING PERFORMANCE

To maximize total cost reduction without violating the overall accuracy constraint (Eqn. 1), MCAL must determine a maximal fraction of samples θ ⋆ selected by L(.) for machine labeling. To predict the model error ϵ(S θ (D(B))), we leverage recent work ( §6) that observes that, for many tasks and many models, the generalization error vs. training set size is well-modeled by a power-law (Hestness et al. (2017) ; Sala ( 2019)) ( ϵ(S θ (D(B))) = α θ |B| -γ θ ). However, it is well-known that most power-laws experience a fall-off (Burroughs (2001) ) at high values of the independent variable. To model this, we use an upper-truncated power-law (Burroughs (2001)): ϵ(S θ (D(B))) = α θ |B| -γ θ e - |B| k θ (3) where α θ , γ θ , k θ are the power-law parameters. Eqn. 3 can better predict the generalization error (Fig. 2 ) based on the training size than a power-law, especially at larger values of |B|. Fig. 3 shows that more samples help calibrate the truncated power law to better predict the falloff. Other models and data sets show similar results (see Appendix §F). ϵ(S θ (D(B))) will increase monotonically with θ since increasing θ has the effect of adding data that D is progressively less confident about (ranked by L(.)). Lacking a parametric model for this dependence, to find θ ⋆ , MCAL generates power-law models ϵ(S θ (D(B))) for various discrete θ ∈ (0, 1) ( §4). MCAL obtains θ ⋆ for a given B by searching across the predicted ϵ(S θ (D(B))). While ϵ(S θ (D(B))) also depends on the data acquisition batch size (δ, see §3.2), when the total accumulated training sample size is large enough, the error rate starts to converge, hence the dependence is insignificant. Fig. 4 shows this variation is < 1% especially for smaller values of θ.

3.2. ESTIMATING ACTIVE LEARNING TRAINING COSTS

Active learning iteratively obtains human labels for a batch size of δ items ranked using the sample selection function M (.) and adds them to the training set B to retrain the classifier D. A smaller δ typically makes the active learning more effective: it allows for achieving a lower error for a potentially smaller B through more frequent sampling, but also significantly increases the training cost due to frequent re-training. Choosing an appropriate training sample acquisition batch size δ is thus an important aspect of minimizing overall cost. The training cost depends on the training time, which in turn is proportional to the data size (|B|) and the number of epochs. A common strategy is to use a fixed number of epochs per iteration, so the training cost in each iteration is proportional to |B|. Assuming δ new data samples added in each iteration, the total training cost follows Ct(D(B)) = 1 2 |B| (|B|/δ + 1) MCAL can accommodate other training cost modelsfoot_2 .

3.3. SAMPLE SELECTION

The sample selection functions M (.) and L(.) are qualitatively different. The former selects samples to obtain an accurate classifier, the latter determines which samples a given classifier is most confident about (so they can be machine-labeled). For L(.), we simply use the margin metric (Scheffer et al. (2001) ), the score difference between the highest and second highest ranked labels (see Fig. 5 ). 2001)) based approaches. The k-center selection method (Wolf (2011) ), an instance of the former, uses the feature-space distance between the last activation layer output to iteratively select data points farthest from existing centers. Uncertainty-based approaches include max-entropy (Dagan & Engelson (1995) ), which samples data with highest entropy in activation output, least-confidence (Culotta & McCallum (2005) ), and margin. We evaluate MCAL using these different choices for M (.), but find that ( §5), for a subtle reason, uncertainty-based approaches perform better. Unlike with active learning, where the goal is to select samples to match the generalization error of the entire data set, MCAL's goal is to select samples to train a classifier that can machine label the largest number of samples possible. When MCAL uses margin or least confidence for L(.), the samples selected have high accuracy (close to 100%, Fig. 5 )foot_3 . The samples selected by a core-set based algorithm such as k-centers is poorly correlated with accuracy (Fig. 5 ) and margin (Fig. 6 ).

4. THE MCAL ALGORITHM

The MCAL algorithm (Alg. 1, see appendix §A) takes as input an active learning metric M (.), the data set X, the classifier D (e.g., RESNET18) and parametric models for training cost (e.g., Eqn. 4), and for error rate as a function of training size (e.g., the truncated power law in Eqn. 3). The algorithm operates in two phases. In the first phase, it uses estimates obtained during active learning to learn the parameters of one truncated power-law model for each discrete machine-label fraction θ, and uses the cost measurements to learn the parameters for the training-cost model. In the second phase, having the models, it can estimate and refine the optimal machine-labeling subset S ⋆ (D, B) and the corresponding training set B that produce the optimal cost C ⋆ . It can also estimate the optimal batch size δ opt for this cost. It terminates when adding more samples to B is counter productive. It trains a classifier to label S ⋆ (D, B) and use human labels for the remaining unlabeled samples. To start with, MCAL randomly selects a test set T (e.g., |T | = 5% of |X|) and obtains human labels to test and measure the performance of D. Next, MCAL initializes B = B 0 by randomly selecting δ 0 samples from X (e.g., 1% of X in our implementation) and obtaining human labels for these. Then, MCAL trains D using B 0 and uses T to estimate the generalization errors ϵ T S θ (D(B 0 )) for various values of θ ∈ (0, 1) (we chose in increments of 0.05 {0.05, 0.1, • • • , 1}), using T and M (.). Next, the main loop of MCAL begins. Similar with active learning, MCAL selects, in each step, δ samples, ranked by M (.), obtains their labels and adds them to B, then trains D on them. The primary difference with active learning is that MCAL, in every iteration, estimates the model parameters for machine-labeling error ϵ(S θ (D(B))) for each θ with various training size B, then uses these to estimate the best combination of B and θ that will achieve the lowest corresponding overall cost C ⋆ . At the end of this iteration, MCAL can answer the question: "How many human generated labels must be obtained into B to train D, in order to minimize C?" ( §3). The estimated model parameters for the training cost C t (D(B)) and the machine-label error ϵ(S θ (D(B))) may not be stable in the first few iterations given limited data for the fit. To determine if the model parameters are stable, MCAL compares the estimated C ⋆ obtained from the previous iteration to the current. If the difference is small (≤ 5%, in our implementation), the model is considered to be stable for use. After the predictive models have stabilized, we can rely on the estimates of the optimal training size B opt , to calculate the final number of labels to be obtained into B. At this point MCAL adjusts δ to reduce the training cost when it is possible to do so. MCAL can do this because it targets relatively high accuracy for D(B). For these high targets, it is important to continue to improve model parameter estimates (e.g., the parameters for the truncated power law), and active learning can help achieve this. Fig. 3 shows how the fit to the truncated power law improves as more points are added. Finally, unlike active learning, MCAL adapts δ, proceeding faster to B opt to reduce training cost. Fig. 4 shows that, for most values of θ, the choice of δ does not affect the final classifier accuracy much. However, it can significantly impact training cost ( §3). This loop terminates when total cost obtained in a step is higher than that obtained in the previous step. At this point, MCAL simply trains the classifier using the last predicted optimal training size B opt , then human labels any remaining unlabeled samples. Extending MCAL to selecting the cheapest DNN architecture. In what we have described so far, we have assumed that MCAL is given a candidate classifier D. However, it is trivial to extend MCAL to the case when the data set curator supplies a small number (typically 2-4) of candidate classifiers {D 1 , D 2 , • • • }. In this case, MCAL can generate separate prediction models for each of the classifiers and pick the one that minimizes C once the model parameters have stabilized. This does not inflate the cost significantly since the training costs until this time are over small sizes of B. Accommodating a budget constraint. Instead of a constraint of labeling error, MCAL can be modified to accommodate other constraints, such as a limited total budget. Its algorithm can search for the lowest error satisfying a given budget constraint. Specifically, with the same model for estimation of network error and total cost ( §3), instead of searching the S ⋆ (D, B) for minimum total cost while error constraint is satisfied (line 18 of Alg. 1, Appendix §A), we can search for minimum estimated error while the total cost is within budget. The difference is: in the former case, we can always resort to human labeling when error constraint cannot be satisfied; in the latter case, we can only sacrifice the total accuracy by stopping the training process and taking the model's output when the money budget is too low.

5. EVALUATION

In this section, we evaluate the performance of MCAL over popular classification data sets: Fashion- MCAL evaluates different sampling methods discussed (Fig. 6 ) and uses margin to rank and select samples. At each active learning iteration, it trains the model over 200 epochs with a 10× learning rate reduction at 80, 120, 160, 180 epochs, and a mini-batch-size of 256 samples (Keras (2021) ). Training is performed on virtual machines with 4 NVIDIA K80 GPUs at a cost of 3.6 USD/hr. In all experiments, unless otherwise specified, the labeling accuracy requirement ε was set at 5%.  and c ) selects the best DNN architecture from among a set of candidates. In this section, we demonstrate that MCAL provides significant overall cost benefits at the expense of ε (5%) degradation in label quality. Further, it even outperforms active learning assisted by an oracle with optimal δ value.

5.1. REDUCTION IN LABELING COSTS USING MCAL

Fig. 7 depicts the total labeling costs incurred when using Amazon labeling services for three different schemes: i) when the entire data set is labeled by Amazon labeling services, ii) MCAL for ε = 5%, and iii) active learning with an oracle to choose δ for each DNN model. Tbl. 1 lists the numerical values of the costs (in $) for human labeling and MCAL. To calculate the total labeling error, we compare the machine labeling results on S ⋆ (D, B opt ) and human labeling results on X \ S ⋆ (D, B opt ) against the ground-truth. The human labeling costs are calculated based on the prices of Amazon labeling services SageMaker (2021) and Satyam (Qiu et al. (2018) ). Cost Saving Compared to Human Labeling. From Fig. 7 and Tbl. 1, MCAL provides an overall cost saving of 86%, 67% and 30% for Fashion, CIFAR-10 and CIFAR-100 respectively. As expected, the savings depend on the difficulty of the classification task. The "harder" the dataset the higher the savings. Tbl. 1 also shows the number of samples in B used to train D, as well as the number of samples |S| labeled using D. For Fashion, MCAL labels only 6.1% of the data to train the classifier and uses it to label 85% of the data set. For CIFAR-10, it trains using 22% of the data set and labels about 65% of the data using the classifier. CIFAR-100 requires more data to train the classifier to a high accuracy so is able to label only 10% of the data using the classifier. Cost Savings from Active Learning. In the absence of MCAL, how much cost savings would one obtain using naive Active Learning? As described in §3, the overall cost of AL depends on the batch size δ, the DNN architecture used for classification, as well as how "hard" it is to classify the data set. In order to examine these dependencies, we performed active learning using values of δ between 1% to 20% of |X|, to label each data set using the different DNN architectures until the desired overall labeling error constraint was met. The contribution to overall labeling error is zero for human annotated images and dictated by the classifier performance for machine labeled images. Dependence on sample selection method M (.). Fig. 11 shows the total cost of MCAL using different sample selection functions M (.). As §3.3 describes, k-center is qualitatively different from uncertainty-based methods in the way it selects samples. In Fig. 11 , for CIFAR10 on RESNET18 (other results are similar, omitted for brevity), while all metrics reduce cost relative to human-labeling the entire data set, uncertainty-based metrics have 25% lower cost compared to k-center because the latter machine-labels fewer samples. Thus, for active labeling, an uncertainty-based metric is better for selecting samples for classifier training. Dependence on batch size δ. Fig. 12 depicts the fraction of images that were machine labeled for the different data sets and DNN architectures trained with different batch sizes. As seen in Fig. 12 , lower δ values allow AL to adapt at a finer granularity, resulting in a higher number of machine labeled images. Increasing δ from 1% to 15% results in a 10-15% fewer images being machine labeled. Dependence on data set complexity. As seen from Figures 8-10, and 12, the training costs as well as potential gains from machine labeling depend on the complexity of the data set. While, using a small δ (2%) is beneficial for Fashion, the number is larger for CIFAR-10. For CIFAR-100, using active learning does not help at all as the high training costs overwhelm benefits from machine labeling. A second dimension of complexity is the number of samples per class. CIFAR-100 has 600 per class, CIFAR-10 has 6000. To demonstrate that MCAL adapts to varying numbers of samples per class, Fig. 13 shows an experiment in which we ran MCAL on subsets of CIFAR-10, with 1000-5000 Dependence on DNN architecture. While a larger DNN has the potential for higher accuracy, its training costs may be significantly higher and may potentially offset savings due to machine labeling. As seen from Figures 8-10, and 12, even though RESNET50 is able to achieve a higher prediction quality and machine-labels more images, its high training cost offsets these gains. CNN18 on the other hand incurs much lower training costs, however, its poor performance leads to few images being machine-labeled. RESNET18 provides for a better compromise resulting in overall lower cost. MCAL on Imagenet. We have explored applying MCAL on Imagenet using an EfficientNetB0 (Tan & Le (2019) ). Relative to CIFAR-10 data set, Imagenet (with over 1.2 M images) is challenging for MCAL because: it has more classes (1000), fewer samples per class (avg. 1200), and the training cost is 60-200× higher than that for RESNET-18. For these reasons, MCAL trains the network to over 80% accuracy up to 454K images, and decides, because it cannot machine label any, to human-label the entire data set. As with CIFAR-100, for complicated data sets, MCAL still makes the correct decision.foot_4  Summary. Tbl. 2 provides the optimal choices for δ and the optimal cost savings obtained for various DNNs. Comparing these values with Tbl. 1, we conclude that: MCAL outperforms Naive AL across various choices of DNN architectures and δ by automatically picking the right architecture, adapting δ suitably, and selecting the right subset of images to be human labeled.

5.2. GAINS FROM ACTIVE LEARNING

Figures 14 and 15 ( see appendix §B) show the overall labeling cost with and without AL for the three data sets using Amazon and Satyam labeling services. The percentage cost gains are in brackets. While Fashion and CIFAR-10 show a gain of about 20% for both data sets, the gains are low in the case of CIFAR-100 because most of the images in that data set were labeled by humans and active learning did not have an opportunity to improve significantly. The gains are higher for Satyam, since training costs are relatively higher in that cost model: active learning accounted for 25-31% for Fashion and CIFAR-10's costs, and even CIFAR-100 benefited from this.

5.3. EFFECT OF RELAXING OTHER ASSUMPTIONS

Cheaper Labeling Cost. Intuitively, with cheaper labeling costs MCAL should use more human labeling to train the classifier. This in turn should enable a larger fraction of data to be labeled by the classifier. To validate this, we used the Satyam (Qiu et al. (2018) ) labeling service (with 10× lower labeling cost). The effect of this reduction is most evident for CIFAR-100 in Tbl. 1 as MCAL chooses to train the classifier using 57.6% of the data (instead of 32% using Amazon labeling service). This increases the classifier's accuracy allowing it to machine-label 10% more of the data set. For other data sets, the differences are less dramatic (they use 2.5-5% more data to train the classifier, more details in appendix §C, Figures 16, 17, and 18) . The numerical values of the optimal δ as well as the corresponding cost savings are provided in Table 2 : MCAL achieves a lower overall cost compared to all these possible choices in this case as well. Relaxing Accuracy Requirement. In appendix §D, we examine the quality-cost trade-off by relaxing the accuracy target from 95% to 90% to quantify its impact on additional cost savings. Fashion achieves 30% cost reductions; many more images are labeled by the classifier. CIFAR-10 and CIFAR-100 also show 10-15% gains.

6. RELATED WORK

Active learning (Settles ( 2010)) aims to reduce labeling cost in training a model, by iteratively selecting the most informative or representative samples for labeling. Early work focused on designing metrics for sample selection based on coreset selection (Sener & Savarese (2017) ), margin sampling (Scheffer et al. (2001) ; Jiang & Gupta ( 2019)), region-based sampling (Cortes et al. (2020) ), max entropy (Dagan & Engelson (1995) ) and least confidence (Culotta & McCallum (2005) ). Recent work has focused on developing metrics tailored to specific tasks, such as classification (Coleman et al. (2020) ), detection (Brust et al. (2019) ), and segmentation (Yang et al. (2017) ), or for specialized settings such as when costs depend upon the label (Krishnamurthy et al. (2017) ), or for a hierarchy of labels (Hu et al. (2019) ). Other work in this area has explored variants of the problem of sample selection: leveraging selection-via-proxy models (Coleman et al. (2020) ), model assertions (Kang et al. (2020) ), model structure (Wang et al. (2017) ), using model ensembles to improve sampling efficacy (Beluch et al. (2018) ), incorporating both uncertainty and diversity (Ash et al. (2020) ), or using self-supervised mining of samples for active learning to avoid data set skew (Wang et al. (2018) ). MCAL uses active learning and can accommodate multiple sample selection metrics. Training cost figures prominently in the literature on hyper-parameter tuning, especially for architecture search. Prior work has attempted to predict learning curves to prune hyper-parameter search (Klein et al. (2017) ), develop effective search strategy within a given budget (Lu et al. (2019) ), or build a model to characterize maximum achievable accuracy to enable fast triage during architecture search (Istrate et al. (2019) ), all with the goal of reducing training cost. The literature on active learning has recognized the high cost of training for large datasets and has explored using cheaper models (Coleman et al. (2020) ) to select samples. MCAL solves a different problem (dataset labeling) and explicitly incorporates training cost in reasoning about which samples to select. Also relevant is the empirical work that has observed a power-law relationship between generalization error and training set size (Beery et al. (2018) ; Johnson et al. (2018) ; Figueroa et al. (2012) ) across a wide variety of tasks and models. MCAL builds upon this observation, and learns the parameters of a truncated power-law model with as few samples as possible.

7. CONCLUSIONS

Motivated by the need of data engineering for increasingly advanced ML applications as well as the the prohibitive human labeling cost, this paper asks: "How to label a data set at minimum cost"? To do this, it trains a classifier using a set B from the data set to label S samples, and uses humans to label the rest. The key challenge is to balance human-(B) vs. machine-labels (S) that minimizes total cost. MCAL jointly optimizes the selection by modeling the active learning accuracy vs. training size as a truncated power-law, and search for minimum cost strategy while respecting error constraints. The evaluation shows that it can achieve up to 6× lower cost than using humans, and is always cheaper than active learning with the lowest-cost batch size.

A MCAL ALGORITHM

Algorithm 1 MCAL Inputs: An active learning metric M (.), a classifier D, set of unlabeled images (X), a parametric model to predict Ct(D(B)) and a parametric model to predict ϵ(S θ (D(B))) 1: Obtain human generated labels for a randomly sampled test set T ⊂ X, and let X = X T . 2: Obtain human generated labels for a randomly sampled data items B0 ⊂ X, |B0| = δ0. Estimate ϵ T S θ (D(Bi)) using T and M (.)

16:

Estimate and update the error model parameters (α θ , γ θ , k θ from Eqn. 3) for ϵ(S θ (D(B))) using and 18 depict the effect of using Satyam as the labeling service. As seen in these figures, the lower labeling cost alters the tradeoff curves. The figures also depict the corresponding MCAL cost as well as the human labeling cost for reference. MCAL achieves a lower overall cost compared to all these possible choices in this case as well. ⟨|B k | , ϵ T S θ (D(B k )) ⟩, ∀k 17: end for 18: Find C ⋆ new = C ⋆ , Bopt as described in Section 3 19: if (|C ⋆ new -C ⋆ old |)/|C ⋆ new | < ∆ then 20: argmin N δopt = (|Bopt| -|Bi|)/N, s.t. C < C ⋆ (1 + β) 21: δ = δopt 22: end if 23: C ⋆ old = C ⋆

D EFFECT OF RELAXING ACCURACY REQUIREMENT

Tbl. 3 depicts the fraction of images that were machine-labeled by the classifier (|S|/|X|) and the number of samples used to train it (|B|/|X|) for each of the data sets. Comparing with Tbl. 1, for Fashion MCAL predicts more images by using a smaller number of training images. For CIFAR-10 and CIFAR-100, it makes a different decision and uses more training images to increase the classifier accuracy to enable more images to be machine-labeled. Further, RESNET18 continues to be the optimal architecture for all three data sets. As seen from Tbl. 3, MCAL ensures the accuracy target of 90% for all the data sets. Tbl. 3 also captures savings with respect to human labeling while using an accuracy guarantee of 90%. However, the savings do not dramatically increase indicating that most of the cost gain comes in reducing the accuracy requirement to 95% from 100%. As an example, we show the fitting results on the error profile of θ = 50% in all figures in this section. In all combinations of datasets and models, while using more points gives higher accuracy and better prediction, both power-law and truncated power-law can get stable and precise prediction using a very limited number of small sample points. 



MCAL is available at https://github.com/hangqiu/MCAL Evaluation of mixed-labels assumes perfect human-labels(B). For example, if the number of epochs is proportional to |B| in which case Ct(D(B)) can have a cubic dependency on |B| The results are from Res18 trained over 8K samples (CIFAR10); actual numbers will vary by dataset, training size, and model used. To make this decision, MCAL terminates when it has spent more than x% (x =10 in our evaluations) of the human labeling cost to train the classifier before reaching the desired accuracy. So, for any highly complex data set for which it decides to human-label all samples, MCAL pays a small "exploration tax" before termination.



2) 3 COST PREDICTION MODELS MCAL must determine the optimal value of the training size B and its corresponding maximum machine-labeled set S ⋆ (D, B) that minimizes C. In order to make optimal choices, MCAL must be able to predict C as a function of the choices. C in turn depends on |S ⋆ (D, B)| and the training cost C t (D(B)) (Eqn. 1). Thus, MCAL actually constructs two predictors, one each for |S ⋆ (D, B)| ( §3.1) and C t (D(B)) ( §3.2). These predictors rely on two sample selection functions ( §3.3): from among the remaining un-labeled samples, M (.) selects, which to human-label for training; L(.) selects those that the classifier D(B) can machine-label within the error constraint ε.

FIGURE 5: Machine labeling accuracy of samples ranked by L(.) FIGURE 6: Sampling metric (M (.)) comparison M (.) is similar to sample selection for active learning. Prior work has considered coreset (Sener & Savarese (2017)) and uncertainty (Scheffer et al. (2001)) based approaches. The k-center selection method(Wolf (2011)), an instance of the former, uses the feature-space distance between the last activation layer output to iteratively select data points farthest from existing centers. Uncertainty-based approaches include max-entropy(Dagan & Engelson (1995)), which samples data with highest entropy in activation output, least-confidence(Culotta & McCallum (2005)), and margin.

FIGURE 7: Total cost of labeling for various data sets, for i) Human labeling, ii) MCAL, and iii) Oracle assisted AL with various DNN architectures.

MCAL v.s. AL. Figures 8, 9 and 10 show the overall cost for each of the three data sets using different DNN architectures. The optimal value of δ is indicated by a circle in each of the figures. Further, the cost of labeling using humans only as well as MCAL cost is indicated using dashed lines for reference. As shown in the figure, MCAL outperforms AL even with optimal choice of δ. The choice of δ can significantly affect the overall cost, by up to 4-5× for hard data sets. Training cost. Figures 19, 20, and 21 (see appendix §E) depict the AL training costs for each of the data sets using three different DNNs. The training cost can vary significantly with δ. While for Fashion, there is a 2× reduction in training costs, it is about 5× for CIFAR-10 and CIFAR-100.

FIGURE 11: Total cost and machine-labeling fraction of MCAL using different sampling methods (RESNET18, CIFAR-10)

3: Train D(B0) and test the classifier over T 4: Record training cost Ct(D(B0)) 5: for θ ∈ {θmin, . . . , θmax} do 6: Estimate ϵ T S θ (D(B0)) using T and M (.) 7: end for 8: Initialization: C ⋆ new = 0, C ⋆ old = 0, δ = δ0, i = 1, Bopt = B0 9: while C ⋆ < C(Bopt + δ) do 10: Obtain human generated labels for bi ⊂ X Bi-1 comprising |bi| = δ samples ranked using M (.) 11: Bi = Bi-1 ∪ bi 12: Train D(Bi) and test the classifier over T 13: Record Ct(D(Bi)) and estimate Ct(D(B)) using ⟨|B k | , Ct(D(B k ))⟩, ∀k 14: for θ ∈ [θmin, • • • , θmax] do 15:

Figures 14 and 15 show the overall labeling cost with and without AL for the three data sets using Amazon and Satyam respectively The percentage cost gains are in brackets. While Fashion and CIFAR-10 show a gain of about 20% for both data sets, the gains are low in the case of CIFAR-100 because most of the images in that data set were labeled by humans and active learning did not have an opportunity to improve significantly. The gains are higher for Satyam, since training costs are relatively higher in that cost model: active learning accounted for 25-31% for Fashion and CIFAR-10's costs, and even CIFAR-100 benefited from this.

Figures 16, 17, and 18  depict the effect of using Satyam as the labeling service. As seen in these figures, the lower labeling cost alters the tradeoff curves. The figures also depict the corresponding MCAL cost as well as the human labeling cost for reference. MCAL achieves a lower overall cost compared to all these possible choices in this case as well.

Fig. 19, Fig. 20, Fig. 21 shows the training cost portion from the total cost (Fig. 8,Fig. 9, Fig. 10).The training cost can vary significantly with δ. While for Fashion, there is a 2× reduction in training costs, it is about 5× for CIFAR-10 and CIFAR-100 data sets.

FIGURE 16: Performance of MCAL compared to active learning with different batch sizes δ on Fashion using Satyam labeling

FIGURE 22: Power-law and Truncated Power-law fits on CIFAR-10 using CNN18 FIGURE 23: Power-law and Truncated Power-law fits on CIFAR-10 using RESNET18 FIGURE 24: Power-law and Truncated Power-law fits on CIFAR-10 using RESNET50



MNIST, CIFAR-10, CIFAR-100, and ImageNet. We chose these data sets to demonstrate that MCAL can work effectively across different difficulty levels, We use three popular DNN architectures RESNET50, RESNET18(He et al. (2016)), and CNN18 (RESNET18 without the skip connections). These architectures span the range of architectural complexity with differing training costs and achievable accuracy. This allows us to demonstrate how MCAL can effectively select the most cost efficient architecture among available choices. We also use two different labeling services: Amazon SageMaker (2021) at $0.04/image and Satyam(Qiu et al. (2018)) at $0.003/image. This allows us to demonstrate how MCAL adapts to changing human labeling costs. Finally, parametric model fitting costs are negligible compared to training, metric profiling, and human labeling costs.

Oracle Assisted Active Learning randomly selected samples per class. With 1000 samples per class, the majority of samples were used for training so MCAL can machine-label only 30%. With 5000 samples per class, this fraction goes up to 65%, resulting in increased cost savings relative to human-labeling the entire data set.

Relaxing Error Constraints to ε = 10%

