LEAST PROBABLE DISAGREEMENT REGION FOR ACTIVE LEARNING

Abstract

Active learning strategy to query unlabeled samples nearer the estimated decision boundary at each step has been known to be effective when the distance from the sample data to the decision boundary can be explicitly evaluated; however, in numerous cases in machine learning, especially when it involves deep learning, conventional distance such as the p from sample to decision boundary is not readily measurable. This paper defines a theoretical distance of unlabeled sample to the decision boundary as the least probable disagreement region (LPDR) containing the unlabeled sample, and it discusses how this theoretical distance can be empirically evaluated with a lower order of time complexity. Monte Carlo sampling of the hypothesis is performed in approximating the theoretically defined distance. Experimental results on various datasets show that the proposed algorithm consistently outperforms all other high performing uncertainty based active learning algorithms and leads to state-of-the-art active learning performance on CIFAR10, CIFAR100, Tiny ImageNet and Food101 datasets. Only the proposed algorithm outperforms random sampling on CIFAR100 dataset using K-CNN while all other algorithms fail to do so.

1. INTRODUCTION

Active learning (Cohn et al., 1996) is a subfield of machine learning to attain data efficiency with fewer labeled training data when it is allowed to choose the training data from which to learn. For many real-world learning problems, large collections of unlabeled samples is assumed available, and based on a certain query strategy, the label of the most informative data is iteratively queried to an oracle to be used in retraining the model (Bouneffouf et al., 2014; Roy & McCallum, 2001; Sener & Savarese, 2017b; Settles et al., 2008; Sinha et al., 2019; Sener & Savarese, 2017a; Pinsler et al., 2019; Shi & Yu, 2019; Gudovskiy et al., 2020) . Active learning attempts to achieve high accuracy using as few labeled samples as possible (Settles, 2009) . Of the possible query strategies, uncertainty-based sampling (Culotta & McCallum, 2005; Scheffer et al., 2001; Mussmann & Liang, 2018) , which enhances the current model by labeling unlabeled samples that are difficult for the model to predict, is a simple strategy commonly used in pool-based active learning (Lewis & Gale, 1994) . Nevertheless, many existing uncertainty-based algorithms have their own limitations. Entropy (Shannon, 1948) based uncertainty sampling can query unlabeled samples near the decision boundary for binary classification, but it does not perform well in multiclass classification as entropy does not equate well with the distance to a complex decision boundary (Joshi et al., 2009) . Another approach based on MC-dropout sampling (Gal et al., 2017) which uses a mutual information based BALD (Houlsby et al., 2011) as an uncertainty measure identifies unlabeled samples that are individually informative. This approach, however, is not necessarily informative when it is jointly considered with other samples for label acquisition. To address this problem, BatchBALD (Kirsch et al., 2019) is introduced. However, BatchBALD computes, theoretically, all possible joint mutual information of batch, and is infeasible for large query size. The ensemble method (Beluch et al., 2018) , one of the query by committee (QBC) algorithm (Seung et al., 1992) , has been shown to perform well in many cases. The fundamental premise behind the QBC is minimizing the version space (Mitchell, 1982) , which is the set of hypotheses that are consistent with labeled samples. However, the ensemble method requires high computation load because all networks that make up the ensemble must be trained.

