LEAST PROBABLE DISAGREEMENT REGION FOR ACTIVE LEARNING

Abstract

Active learning strategy to query unlabeled samples nearer the estimated decision boundary at each step has been known to be effective when the distance from the sample data to the decision boundary can be explicitly evaluated; however, in numerous cases in machine learning, especially when it involves deep learning, conventional distance such as the p from sample to decision boundary is not readily measurable. This paper defines a theoretical distance of unlabeled sample to the decision boundary as the least probable disagreement region (LPDR) containing the unlabeled sample, and it discusses how this theoretical distance can be empirically evaluated with a lower order of time complexity. Monte Carlo sampling of the hypothesis is performed in approximating the theoretically defined distance. Experimental results on various datasets show that the proposed algorithm consistently outperforms all other high performing uncertainty based active learning algorithms and leads to state-of-the-art active learning performance on CIFAR10, CIFAR100, Tiny ImageNet and Food101 datasets. Only the proposed algorithm outperforms random sampling on CIFAR100 dataset using K-CNN while all other algorithms fail to do so.

1. INTRODUCTION

Active learning (Cohn et al., 1996) is a subfield of machine learning to attain data efficiency with fewer labeled training data when it is allowed to choose the training data from which to learn. For many real-world learning problems, large collections of unlabeled samples is assumed available, and based on a certain query strategy, the label of the most informative data is iteratively queried to an oracle to be used in retraining the model (Bouneffouf et al., 2014; Roy & McCallum, 2001; Sener & Savarese, 2017b; Settles et al., 2008; Sinha et al., 2019; Sener & Savarese, 2017a; Pinsler et al., 2019; Shi & Yu, 2019; Gudovskiy et al., 2020) . Active learning attempts to achieve high accuracy using as few labeled samples as possible (Settles, 2009) . Of the possible query strategies, uncertainty-based sampling (Culotta & McCallum, 2005; Scheffer et al., 2001; Mussmann & Liang, 2018) , which enhances the current model by labeling unlabeled samples that are difficult for the model to predict, is a simple strategy commonly used in pool-based active learning (Lewis & Gale, 1994) . Nevertheless, many existing uncertainty-based algorithms have their own limitations. Entropy (Shannon, 1948) based uncertainty sampling can query unlabeled samples near the decision boundary for binary classification, but it does not perform well in multiclass classification as entropy does not equate well with the distance to a complex decision boundary (Joshi et al., 2009) . Another approach based on MC-dropout sampling (Gal et al., 2017) which uses a mutual information based BALD (Houlsby et al., 2011) as an uncertainty measure identifies unlabeled samples that are individually informative. This approach, however, is not necessarily informative when it is jointly considered with other samples for label acquisition. To address this problem, BatchBALD (Kirsch et al., 2019) is introduced. However, BatchBALD computes, theoretically, all possible joint mutual information of batch, and is infeasible for large query size. The ensemble method (Beluch et al., 2018) , one of the query by committee (QBC) algorithm (Seung et al., 1992) , has been shown to perform well in many cases. The fundamental premise behind the QBC is minimizing the version space (Mitchell, 1982) , which is the set of hypotheses that are consistent with labeled samples. However, the ensemble method requires high computation load because all networks that make up the ensemble must be trained. This paper defines a theoretical distance referred to as the least probable disagreement region (LPDR) from sample to the estimated decision boundary, and in each step of active learning, labels of unlabeled samples nearest to the decision boundary in terms of LPDR are obtained to be used for retraining the classifier to improve accuracy of the estimated decision boundary. It is generally understood that labels to samples near the decision boundary are the most informative as the samples are uncertain. Indeed in Balcan et al. (2007) , selecting unlabeled samples with the smallest margin to the linear decision boundary and thereby minimal certainty attains exponential improvement over random sampling in terms of sample complexity. In deep learning, it is difficult to identify samples nearest to the decision boundary as sample distance to decision boundary is difficult to evaluate. An adversarial approach (Ducoffe & Precioso, 2018) to approximate the sample distance to decision boundary has been studied but this method does not show preservation of the order of the sample distance and requires considerable computation in obtaining the distance.

2. DISTANCE: LEAST PROBABLE DISAGREEMENT REGION (LPDR)

This paper proposes an algorithm for selecting unlabeled data that are close to the decision boundary which can not be explicitly defined in many of cases. Let X , Y, H and D be the instance space, the label space, the set of hypotheses h : x → y and the joint distribution over (x, y) ∈ X × Y. The distance between two hypotheses ĥ and h is defined as the probability of the disagreement region for ĥ and h. This distance was originally defined in Hanneke et al. (2014) and Hsu (2010): ρ( ĥ, h) := P D [ ĥ(X) = h(X)]. (1) This paper defines the sample distance d of x to the hypothesis ĥ ∈ H based on ρ as the least probable disagreement region (LPDR) that contains x: d(x, ĥ) := inf h∈H(x, ĥ) ρ( ĥ, h) where H(x, ĥ) = {h ∈ H : ĥ(x) = h(x)}.  b ∈ H(x 0 , h a ) is ρ(h a , h x0 ) = x 0 -a. Here, the sample distribution D is unknown, and H(x, ĥ) may be uncountably infinite. Therefore, a systematic and empirical method for evaluating the distance is required. One might the procedure below: Sample hypotheses sets H = {h : ρ( ĥ, h ) ≤ ρ } in terms of ρ , and perform grid search to determine the smallest ρ such that there exists h ∈ H satisfying ĥ(x) = h (x) for a given x. Sampling the hypotheses within the ball can be performed by sampling the corresponding parameters with the assumption that the expected hypothesis distance is monotonically increasing for the expected distance between the corresponding parameters (see Assumption 1). This scheme is based on performing grid search on ρ and is therefore computationally inefficient. However, unlabeled samples can be ordered according to d without grid search with the assumption that there exists a H such that variation ratio V (x) = 1 -f 



Figure 1: An example of LPDR between a sample x = x 0 and a hypothesis ĥ = h a in binary classification using the h θ (x) = I[x > θ] on input x ∼ U [0, 1].

Figure 1 shows an example of LPDR. Let's define H = {h θ : h θ (x) = I[x > θ]} on input x sampled from uniform distribution D = U [0, 1] where I[•]is an indicator function. Suppose x = x 0 and ĥ = h a ∈ H when a < x 0 . Here, H(x 0 , h a ) consists of all hypotheses whose prediction on x 0 is in disagreement with h a (x 0 ) = 1, i.e., H(x 0 , h a ) = {h b ∈ H : h b (x 0 ) = 0} = {h b ∈ H : b > x 0 }.Then, the LPDR between x 0 and h a , d(x 0 , h a ) = x 0 -a as the infimum of the distance between h a and h b ∈ H(x 0 , h a ) is ρ(h a , h x0 ) = x 0 -a.

(x)m /|H | and d(x, ĥ) have strong negative correlation where f(x) m = max c h ∈H I[h (x) = c] (see Assumption 2).

