IS MARGIN ALL YOU NEED? AN EXTENSIVE EMPIRI-CAL STUDY OF DEEP ACTIVE LEARNING ON TABULAR DATA Anonymous authors Paper under double-blind review

Abstract

Given a labeled training set and a collection of unlabeled data, the goal of active learning (AL) is to identify the best unlabeled points to label. In this comprehensive study, we analyze the performance of a variety of AL algorithms on deep neural networks trained on 69 real-world tabular classification datasets from the OpenML-CC18 benchmark. We consider different data regimes and the effect of self-supervised model pre-training. Surprisingly, we find that the classical margin sampling technique matches or outperforms all others, including current state-ofart, in a wide range of experimental settings. To researchers, we hope to encourage rigorous benchmarking against margin, and to practitioners facing tabular data labeling constraints that hyper-parameter-free margin may often be all they need.

1. INTRODUCTION

Active learning (AL), the problem of identifying examples to label, is an important problem in machine learning since obtaining labels for data is oftentimes a costly manual process. Being able to efficiently select which points to label can reduce the cost of model learning tremendously. Highquality data is a key component in any machine learning system and has a very large influence on the results of that system (Cortes et al., 1994; Gudivada et al., 2017; Willemink et al., 2020) ; thus, improving data curation can potentially be fruitful for the entire ML pipeline. Margin sampling, also referred to as uncertainty sampling (Lewis et al., 1996; MacKay, 1992) , is a classical active learning technique that chooses the classifier's most uncertain examples to label. In the context of modern deep neural networks, the margin method scores each example by the difference between the top two confidence (e.g. softmax) scores of the model's prediction. In practical and industrial settings, margin is used extensively in a wide range of areas including computational drug discovery (Reker & Schneider, 2015; Warmuth et al., 2001) , magnetic resonance imaging (Liebgott et al., 2016) , named entity recognition (Shen et al., 2017) , as well as predictive models for weather (Chen et al., 2012 ), autonomous driving (Hussein et al., 2016 ), network traffic (Shahraki et al., 2021) , and financial fraud prediction (Karlos et al., 2017) . Since the margin sampling method is very simple, it seems particularly appealing to try to modify and improve on it, or even develop more complex AL methods to replace it. Indeed, many papers in the literature have proposed such methods that, at least in the particular settings considered, consistently outperform margin. In this paper, we put this intuition to the test by doing a headto-head comparison of margin with a number of recently proposed state-of-the-art active learning methods across a variety of tabular datasets. We show that in the end, margin matches or outperforms all other methods consistently in almost all situations. Thus, our results suggest that practitioners of active learning working with tabular datasets, similar to the ones we consider here, should keep things simple and stick to the good old margin method. In many previous AL studies, the improvements over margin are oftentimes only in settings that are not representative of all practical use cases. One such scenario is the large-batch case, where the number of examples to be labeled at once is large. It is often argued that margin is not the optimal strategy in this situation because it exhausts the labeling budget on a very narrow set of points close to decision boundary of the model and introducing more diversity would have helped (Huo & Tang, 2014; Sener & Savarese, 2017; Cai et al., 2021) . However, some studies find that the number of examples to be labeled at once has to be very high before there is advantage over margin (Jiang & Gupta, 2021) and in practice a large batch of examples usually does not need to be labeled at once, and it is to the learners' advantage to use smaller batch sizes so that as datapoints get labeled, such information can be incorporated to re-train the model and thus choose the next examples in a more informed way. It is important to point out, however, that in some cases, re-training the model is very costly (Citovsky et al., 2021) . In that case, gathering a larger batch could be beneficial. In this study, we focus on the practically more common setting of AL that allows frequent retraining of the model. Many papers also only restrict their study to only a couple of benchmark datasets and while such proposals may outperform margin, these results don't necessarily carry over to a broader set of datasets and thus such studies may have the unintended consequence of overfitting to the dataset. In the real world live active learning setting, examples are sent to human labelers and thus we don't have the luxury of comparing multiple active learning methods or even tuning the hyperparameters of a single method, without incurring significantly higher labeling cost. Instead, we have to commit to a single active learning method oftentimes without much information. Our results on the OpenML-CC18 benchmark suggest that in almost all cases when training with tabular data, it is safe for practitioners to commit to margin sampling (which comes with the welcome property of not having additional hyper-parameters) and have the peace of mind that other alternatives wouldn't have performed better in a statistically significant way.

2. RELATED WORK

There have been a number of works in the literature providing an empirical analysis of active learning procedures in the context of tabular data. Schein & Ungar ( 2007) studies active learning procedures for logistic regression and show that margin sampling performs most favorably. Ramirez-Loaiza et al. (2017) show that with simple datasets and models, margin sampling performs better compared to random and Query-by-Committee if accuracy is the metric, while they found random performs best under the AUC metric. Pereira-Santos et al. ( 2019) provides a investigation of the performance of active learning strategies with various models including SVMs, random forests, and nearest neighbors. They find that using margin with random forests was the strongest combination. Our study focuses on the accuracy metric and also shows that margin is the strongest baseline, but is much more relevant to the modern deep learning setting and with a comparison to a much more expanded set of baselines and datasets. Our focus on neural networks is timely as recent work (Bahri et al., 2021) showed that neural networks often outperform traditional approaches for modeling tabular data, like Gradient Boosted Decision Trees (Chen & Guestrin, 2016), particularly when they are pre-trained in the way we explore here. To our knowledge we provide the most comprehensive and practically relevant empirical study of active learning baselines on neural networks thus far. There have also been empirical evaluations of active learning procedures in the non-tabular case. Hu et al. (2021) showed that margin attained the best average performance of the baselines tested on two image and three text classification tasks across a variety of neural network architectures and labeling budgets. Munjal et al. (2022) showed that on the image classification benchmarks CIFAR-10, CIFAR-100, and ImageNet, under strong regularization, none of the numerous active learning baselines they tested had a meaningful advantage over random sampling. We hypothesize that this may be due to the initial network having too little information (i.e. no pre-training and small initial seed set) for active learning to be effective and conclusions may be different otherwise. It is also worth noting that many active learning studies in computer vision only present results on a few benchmark datasets (Munjal et al., 2022; Sener & Savarese, 2017; Beluch et al., 2018; Emam et al., 2021; Mottaghi & Yeung, 2019; Hu et al., 2018) , and while they may have promising results on such datasets, it's unclear how they translate to a wider set of computer vision datasets. We show that many of these ideas do not perform well when put to the test on our extensive tabular dataset setting. Dor et al. (2020) evaluated various active learning baselines for BERT and showed in most cases, margin provided the most statistically significant advantage over passive learning. One useful direction for future work is establishing an extensive empirical study for computer vision and NLP. While our study is empirical, it is worth mentioning that despite being such a simple and classical baseline, margin is difficult to analyze theoretically and there remains little theoretical understanding of the method. Balcan et al. (2007) provides learning bounds for a modification of margin where examples are labeled in batches where the batch sizes depend on predetermined thresholds and assume

