MACHINE LEARNING ALGORITHMS FOR DATA LA-BELING: AN EMPIRICAL EVALUATION

Abstract

The lack of labeled data is a major problem in both research and industrial settings since obtaining labels is often an expensive and time-consuming activity. In the past years, several machine learning algorithms were developed to assist and perform automated labeling in partially labeled datasets. While many of these algorithms are available in open-source packages, there is no research that investigates how these algorithms compare to each other in different types of datasets and with different percentages of available labels. To address this problem, this paper empirically evaluates and compares seven algorithms for automated labeling in terms of accuracy. We investigate how these algorithms perform in six different and well-known datasets with three different types of data, images, texts, and numerical values. We evaluate these algorithms under two different experimental conditions, with 10% and 50% labels of available labels in the dataset. Each algorithm, in each dataset for each experimental condition, is evaluated independently ten times with different random seeds. The results are analyzed and the algorithms are compared utilizing a Bayesian Bradley-Terry model. The results indicate that while the algorithms label spreading with K-nearest neighbors perform better in the aggregated results, the active learning algorithms query by instance QBC and query instance uncertainty sample perform better when there is only 10% of labels available. These results can help machine learning practitioners in choosing optimal machine learning algorithms to label their data.

1. INTRODUCTION

Supervised learning is the most commonly used machine learning paradigms. There are problems with supervised learning and machine learning in general. The first problem is that machine learning requires huge amounts of data. Secondly, supervised learning needs labels in the data. In a case study performed with industry, several labeling issues were found (Anonymous, 2020a) . A recent systematic literature review was conducted to see what type of machine learning algorithms exist to make the labeling easier. A recent systematic literature review investigated the use of Semisupervised learning and Active learning for automatic labeling of data (Anonymous, 2020b). From those results the authors concluded which active and semi-supervised learning algorithms were the most popular and which datatypes they can be used on. However, even if there has been work done on active and semi-supervised learning, these learning paradigms are still very new for many companies and consequentially seldomly used. Utilizing a simulation study we evaluated seven semi-supervised and active learning algorithms on six datasets of different types, numerical, text and image data. Implementing a Bayesian Bradley Terry model we ranked the algorithms according to accuracy and effort. The contribution of this paper is to provide a taxonomy of automatic labeling algorithms and an empirical evaluation of algorithms in the taxonomy evaluated across two dimensions: Performance, how accurate the algorithm is, and Effort, how much manual work has to be done from the data scientist. The remainder of this paper is organized as follows. In the upcoming section we provide the an overview about semi-supervised and active learning algorithms and how they work. In section 3 we will describe our study, how we preformed the simulations, what datasets and source code we used, 1

