LEAST PROBABLE DISAGREEMENT REGION FOR ACTIVE LEARNING

Abstract

Active learning strategy to query unlabeled samples nearer the estimated decision boundary at each step has been known to be effective when the distance from the sample data to the decision boundary can be explicitly evaluated; however, in numerous cases in machine learning, especially when it involves deep learning, conventional distance such as the p from sample to decision boundary is not readily measurable. This paper defines a theoretical distance of unlabeled sample to the decision boundary as the least probable disagreement region (LPDR) containing the unlabeled sample, and it discusses how this theoretical distance can be empirically evaluated with a lower order of time complexity. Monte Carlo sampling of the hypothesis is performed in approximating the theoretically defined distance. Experimental results on various datasets show that the proposed algorithm consistently outperforms all other high performing uncertainty based active learning algorithms and leads to state-of-the-art active learning performance on CIFAR10, CIFAR100, Tiny ImageNet and Food101 datasets. Only the proposed algorithm outperforms random sampling on CIFAR100 dataset using K-CNN while all other algorithms fail to do so.

1. INTRODUCTION

Active learning (Cohn et al., 1996) is a subfield of machine learning to attain data efficiency with fewer labeled training data when it is allowed to choose the training data from which to learn. For many real-world learning problems, large collections of unlabeled samples is assumed available, and based on a certain query strategy, the label of the most informative data is iteratively queried to an oracle to be used in retraining the model (Bouneffouf et al., 2014; Roy & McCallum, 2001; Sener & Savarese, 2017b; Settles et al., 2008; Sinha et al., 2019; Sener & Savarese, 2017a; Pinsler et al., 2019; Shi & Yu, 2019; Gudovskiy et al., 2020) . Active learning attempts to achieve high accuracy using as few labeled samples as possible (Settles, 2009) . Of the possible query strategies, uncertainty-based sampling (Culotta & McCallum, 2005; Scheffer et al., 2001; Mussmann & Liang, 2018) , which enhances the current model by labeling unlabeled samples that are difficult for the model to predict, is a simple strategy commonly used in pool-based active learning (Lewis & Gale, 1994) . Nevertheless, many existing uncertainty-based algorithms have their own limitations. Entropy (Shannon, 1948) based uncertainty sampling can query unlabeled samples near the decision boundary for binary classification, but it does not perform well in multiclass classification as entropy does not equate well with the distance to a complex decision boundary (Joshi et al., 2009) . Another approach based on MC-dropout sampling (Gal et al., 2017) which uses a mutual information based BALD (Houlsby et al., 2011) as an uncertainty measure identifies unlabeled samples that are individually informative. This approach, however, is not necessarily informative when it is jointly considered with other samples for label acquisition. To address this problem, BatchBALD (Kirsch et al., 2019) is introduced. However, BatchBALD computes, theoretically, all possible joint mutual information of batch, and is infeasible for large query size. The ensemble method (Beluch et al., 2018) , one of the query by committee (QBC) algorithm (Seung et al., 1992) , has been shown to perform well in many cases. The fundamental premise behind the QBC is minimizing the version space (Mitchell, 1982) , which is the set of hypotheses that are consistent with labeled samples. However, the ensemble method requires high computation load because all networks that make up the ensemble must be trained. This paper defines a theoretical distance referred to as the least probable disagreement region (LPDR) from sample to the estimated decision boundary, and in each step of active learning, labels of unlabeled samples nearest to the decision boundary in terms of LPDR are obtained to be used for retraining the classifier to improve accuracy of the estimated decision boundary. It is generally understood that labels to samples near the decision boundary are the most informative as the samples are uncertain. Indeed in Balcan et al. (2007) , selecting unlabeled samples with the smallest margin to the linear decision boundary and thereby minimal certainty attains exponential improvement over random sampling in terms of sample complexity. In deep learning, it is difficult to identify samples nearest to the decision boundary as sample distance to decision boundary is difficult to evaluate. An adversarial approach (Ducoffe & Precioso, 2018) to approximate the sample distance to decision boundary has been studied but this method does not show preservation of the order of the sample distance and requires considerable computation in obtaining the distance.

2. DISTANCE: LEAST PROBABLE DISAGREEMENT REGION (LPDR)

This paper proposes an algorithm for selecting unlabeled data that are close to the decision boundary which can not be explicitly defined in many of cases. Let X , Y, H and D be the instance space, the label space, the set of hypotheses h : x → y and the joint distribution over (x, y) ∈ X × Y. The distance between two hypotheses ĥ and h is defined as the probability of the disagreement region for ĥ and h. This distance was originally defined in Hanneke et al. (2014) and Hsu (2010): ρ( ĥ, h) := P D [ ĥ(X) = h(X)]. (1) This paper defines the sample distance d of x to the hypothesis ĥ ∈ H based on ρ as the least probable disagreement region (LPDR) that contains x: d(x, ĥ) := inf h∈H(x, ĥ) ρ( ĥ, h) where H(x, ĥ) = {h ∈ H : ĥ(x) = h(x)}. Here, H(x 0 , h a ) consists of all hypotheses whose prediction on x 0 is in disagreement with h a (x 0 ) = 1, i.e., H(x 0 , h a ) = {h b ∈ H : h b (x 0 ) = 0} = {h b ∈ H : b > x 0 }. Then, the LPDR between x 0 and h a , d(x 0 , h a ) = x 0 -a as the infimum of the distance between h a and h b ∈ H(x 0 , h a ) is ρ(h a , h x0 ) = x 0 -a. Here, the sample distribution D is unknown, and H(x, ĥ) may be uncountably infinite. Therefore, a systematic and empirical method for evaluating the distance is required. One might the procedure below: Sample hypotheses sets H = {h : ρ( ĥ, h ) ≤ ρ } in terms of ρ , and perform grid search to determine the smallest ρ such that there exists h ∈ H satisfying ĥ(x) = h (x) for a given x. Sampling the hypotheses within the ball can be performed by sampling the corresponding parameters with the assumption that the expected hypothesis distance is monotonically increasing for the expected distance between the corresponding parameters (see Assumption 1). This scheme is based on performing grid search on ρ and is therefore computationally inefficient. However, unlabeled samples can be ordered according to d without grid search with the assumption that there exists a H such that variation ratio V (x) = 1 -f (x) m /|H | and d(x, ĥ) have strong negative correlation where f Assumption 1. The expected distance between ĥ and randomly sampled h is monotonically increasing in the expected distance between the corresponding ŵ and w, i.e., E[ ŵ - (x) m = max c h ∈H I[h (x) = c] (see Assumption 2). w 1 | ŵ] ≤ E[ ŵ -w 2 | ŵ] implies that E[ρ( ĥ, h 1 ) | ĥ] ≤ E[ρ( ĥ, h 2 ) | ĥ] where ŵ, w 1 and w 2 are the parameters pertaining to ĥ, h 1 and h 2 respectively. Assumption 2. There exists a hypothesis set H sampled around ĥ having the property that large variation ratio for a given sample data implies small sample distance to ĥ with high probability, i.e., there exists H such that V (x 1 ) ≥ V (x 2 ) implies that d(x 1 , ĥ) ≤ d(x 2 , ĥ) with high probability.

3.1. HYPOTHESES AND PARAMETERS IN DEEP NETWORKS: ASSUMPTION 1

The distance between two hypotheses can be approximated by vectors of the predicted labels on random samples by the hypotheses: ρ( ĥ, h) ≈ ρ e ( ĥ, h) = 1 m m i=1 I ĥ(x (i) ) = h(x (i) ) where x (i) is the i th sample for i ∈ [m]. The h is sampled by sampling model parameter w ∼ N ( ŵ, Iσ 2 ) where ŵ is the model parameter of ĥ, and the expectation of distances between w and ŵ depends on σ. The ρ e is obtained by the average of 100 times for a fixed σ. The left-hand side of Figure 2 shows the relationship between ρ e and σ on various datasets and deep networks. The ρ e increases almost monotonically as σ increases. This implies that the order is preserved between the σ and ρ e . Furthermore, the ρ e is almost linearly proportional to log(σ) in the ascension of the graph, i.e., σ ∝ e βρe for some β > 0. The right-hand side of Figure 2 shows V with respect to σ for each unlabeled sample on MNIST. The sample distance to the decision boundary can be expressed as σ at which the variation ratio is not zero for the first time (white arrow), where the indices of unlabeled samples in y-axis are ordered by LPDR. The variation ratio increases as the σ increases, and it is expected that the data point with short distance has the large variation ratio compared to the data point with long distance on a certain range of σ.

3.2. LPDR VS VARIATION RATIO: ASSUMPTION 2

The left-hand side of Figure 3 shows Spearman's rank correlation coefficient (Spearman, 1904) between LPDR and the variation ratio with respect to σ. The correlation is calculated using only unlabeled samples whose variation ratio is not 0. The strong rank correlation is verified when the σ has the appropriate value. Too larger value of σ generates hypotheses too far away from ĥ, which is not helpful to measure the distance. The right-hand side of Figure 3 shows an example of σ (log(σ) = -5.0) which makes LPDR and the variation ratio have a strong negative correlation on MNIST, that is, the data point with larger variation is closer to the decision boundary. Results for various datasets and networks are presented in Appendix C. The time complexity is discussed to validate the efficiency of using variation ratio. Let m, N and n σ be the unlabeled sample size, |H | and the number of grid for σ respectively. Ordering unlabeled samples in terms of LPDR by grid search with respect to σ requires the time complexity of m × N × n σ (see the right-hand side of Figure 2 ). However, using variation ratio for ordering unlabeled samples reduces the time complexity to m × N . In the case of n σ = cN for some c > 0, then the time complexity can be reduced from O(mN 2 ) to O(mN ).

4. ALGORITHM FOR LPDR

4.1 FRAMEWORK Let L t and U t be the labeled and unlabeled samples at step t. At step t, LPDR trains model parameters ŵt using labeled samples L t , and constructs H by sampling the model parameters w n ∼ N ( ŵt , Iσ 2 ) for n ∈ [N ]. Then, LPDR queries the top q unlabeled samples having highest variation ratio from the pool data P t ⊂ U t of size m.

4.2. CONSTRUCTION OF SAMPLED HYPOTHESIS SET

It is important to set an appropriate σ when constructing H as variation ratios goes to 0 with decreasing σ (see the right-hand side of Figure 2 ) and the rank correlation goes to zero with increasing σ (see the left-hand side of Figure 3 ). Theoretically, let's consider the binary classification with logistic regression where the predicted label is defined as y = sgn(x T w) and sup x∈X x ∞ < ∞. Then the following theorem holds and the proof is described in Appendix A. Theorem 1. Suppose that w n for n = 1, . . . , N are generated with the variance of σ 2 . For all x, the followings hold: 1) As N → ∞, 1 -f (x) m /N goes to 0 in probability as σ 2 goes to 0, 2) As N → ∞, 1 -f (x) m /N goes to 1/2 in probability for binary classification using logistic regression as σ 2 goes to ∞. The implication of Theorem 1 is that when σ is too small or too large, it would be difficult to compare the sample distances of unlabeled samples. In this active learning task, at least q most informative unlabeled samples must be identified. To meet this condition, it is reasonable to set ρ n denoted in the Algorithm 1 as ρ * = q/m, which is not very small and is less than 1/2 in general, for N hypotheses. This can be attained by updating σ n as σ n+1 = σ n e -β(ρ n -ρ * ) where β > 0 (see Appendix D). The Figure 10 in Appendix E shows the final test accuracy with respect to target ρ n on MNIST dataset. The LPDR performs best when the target ρ n is roughly ρ * . In addition, the range of target ρ n , associated with the best performance, is wide; thus, LPDR is relatively robust against target ρ n . Furthermore, LPDR is robust against hyperparameters β, N and sampling layers (see Appendix F).

Algorithm 1 Least Probable Disagree Region (LPDR)

Input: L 0 , U 0 : Initial labeled and unlabeled samples m, q : Size of pool data and number of queries σ 2 0 : Initial variance for sampling ρ * : Target hypothesis distance (= q/m) Procedure: 1: for step t = 0, 1, 2, . . . , T -1 2: Train parameters ŵt with L t , then evaluate its empirical error εt on L t 3: σ t → σ 1 4: for n = 1, 2, . . . , N 5: Sample parameters w n ∼ N ( ŵt , Iσ 2 n ) for h n 6: Compute γ n = e -(ε n -εt)+ where ε n is empirical error of w n on L t 7: Compute ρ n = ρ e ( ĥt , h n ) 8: Update σ n+1 = σ n e -β(ρ n -ρ * ) where β > 0 9: end for 10: σ N +1 → σ t+1 11: Compute V w (x (i) ) = 1 -f (i) w / N n=1 γ n where f (i) m = max c N n=1 γ n I h n (x (i) ) = c 12: Get I * = arg max I⊂IP t ,|I|=q i∈I V w (x (i) ) where I Pt = j : x (j) ∈ P t ⊆ U t 13: Update L t+1 = L t ∪ x (i) , y (i) i∈I * and U t+1 = U t \ x (i) i∈I * 14: end for Meanwhile, the efficiency of querying samples in the disagreement region of the version space is well known both theoretically (Hanneke et al., 2014) and empirically (Beluch et al., 2018) . When the trained hypothesis ĥt is in the version space, the sampled hypotheses h n s are in the version space with high probability, but there are cases where they are outside the version space (see Appendix G). Thus, LPDR gives weight γ n on the prediction of sampled hypothesis h n where γ n = e -(ε n -εt) + is a function of εt = err Lt ( ĥt ) and ε n = err Lt (h n ). Here, (•) + is max{0, •} and err L (h) is the empirical error of h on L. Then, LPDR uses weighted variation ratio V w as a function of the weighted frequency of the modal class f w as defined below: V w (x (i) ) = 1 - f (i) w N n=1 γ n (4) where f (i) w = max c N n=1 γ n I h n (x (i) ) = c and x (i) ∈ P t ⊆ U t . If H is a subset of the version space in realizable case, the sample complexity of LPDR follows Hanneke's theorem (Hanneke et al., 2014) . Let Λ be the sample complexity defined as the smallest integer t such that for all t ≥ t, err(h t ) ≤ where err(h) := P D [h(X) = Y ] with probability at least 1 -δ. Then, LPDR achieves a sample complexity Λ such that, for D in the realizable case, ∀ , δ ∈ (0, 1), Λ( , δ, D) ξ • D log ξ + log log(1/ ) δ • log 1 where D and ξ are the VC-dimension of H and the disagree coefficient with respect to H and D. When ξ = O(1), in terms of , the number of labeled samples required by LPDR is just O(log(1/ ) • log log(1/ )), while the number of labeled samples by a passive learning is Ω(1/ ). Therefore, in this case, LPDR provides an exponential improvement over passive learning in sample complexity (Hsu, 2010) .

5. EXPERIMENTS

This section discusses experimental results on 8 benchmark datasets: MNIST (LeCun et al., 1998) , CIFAR10 (Krizhevsky et al., 2009) , SVHN (Netzer et al., 2019) , EMNIST (Cohen et al., 2017) , CIFAR100 (Krizhevsky et al., 2009) , Tiny ImageNet (subset of the ILSVRC dataset containing 

5.2. RESULTS FOR MNIST, CIFAR10, SVHN AND EMNIST

A number of experiments are conducted to compare performance of LPDR with other high performing uncertainty based active learning algorithms on 8 datasets. Figure 4 shows the test accuracy with respect to the number of labeled samples on MNIST, CIFAR10, SVHN and EMNIST datasets. Each algorithm is denoted such as 'LPDR': the proposed algorithm, 'Random': random sampling, 'Entropy': entropy based uncertainty sampling, 'MC-BALD': MC dropout sampling using BALD, 'MC-VarR': MC dropout sampling using variation ratio (Ducoffe & Precioso, 2015) and 'ENS-VarR': ensemble method. Overall, LPDR either performs best or comparable with all other algorithms. Its performance is consistent regardless of the benchmark datasets. In the early step, LPDR significantly outperforms all other algorithms on MNIST and CIFAR10 datasets. Of all the algorithms compared, Entropy performed the worst. MC-BALD performed well only on SVHN dataset: it seems that the performance of BALD is highly dependent on the dataset. With the query size set to 1, LPDR outperforms BatchBALD on MNIST dataset (see Appendix I). Although MC-VarR and ENS-VarR are based on different sampling methods, both perform similarily-both outperforming all others on EMNIST dataset, while showing a significant drop in performance compared to LPDR on SVHN and CIFAR10 datasets. It is observed that the performances of other algorithms have a relatively strong data dependency compared to LPDR. On CIFAR10 dataset, the performances of MC-VarR and ENS-VarR are no better than that of Random, and Entropy and MC-BALD have lower performance than Random. These results can be attributed to the low network capacity compared to the data complexity. This issue will be discussed in next section. Overall, LPDR consistently either performs best or comparable with all other algorithms regardless of dataset. The performance of all algorithms except LPDR tend to be data dependent. Figure 5 : Performance comparison with respect to the network capacity on CIFAR100 dataset. The performances of all algorithms except LPDR are much worse than that of Random when using K-CNN, which has a relatively smaller network capacity than that of WRN-16-8. LPDR is able to perform consistently better than Random regardless of the network capacity.

5.3. RESULTS FOR CIFAR100 WITH K-CNN AND WIDE-RESNET

In order to compare the performance of the algorithms with respect to the network capacity, experiments are conducted using networks of different capacity but on the same dataset. Figure 5 shows the results of test accuracy with respect to the number of labeled samples on CIFAR100 dataset with K-CNN and WRN-16-8. The left-hand figure is the results of using K-CNN, which has a relatively smaller network capacity than that of WRN-16-8. With the exception of LPDR, the performances of all algorithms are much worse than that of Random. The right-hand figure is the result of using WRN-16-8, which has a relatively larger network capacity. In contrast to the results for K-CNN, most algorithms outperform Random. With a large network capacity, the performance gap between LPDR and the other algorithms is reduced, but LPDR still outperforms others. LPDR is able to perform consistently better than Random regardless of the network capacity, and it seems to be particularly effective with low capacity networks. LPDR outperforms all other algorithms in more difficult tasks.

5.4. RESULTS FOR TINY IMAGENET AND FOOD101

Experiments on a more difficult task are conducted. Figure 6 shows test accuracy with respect to the number of labeled samples on Tiny ImageNet and Food101 datasets with WRN-16-8. Tiny ImageNet and Food101 are considered to be more difficult than CIFAR100. Even on more difficult tasks, LPDR outperforms all other algorithms.

5.5. RESULTS FOR HAM10000

Figure 7 : The performance comparison on HAM10000 dataset with WRN-16-8. LPDR outperforms all other algorithms on imbalanced dataset. Additional experiments are conducted to compare the performance of the algorithms on imbalanced HAM10000 dataset with WRN-16-8. Figure 7 shows the results of the test accuracy with respect to the number of labeled samples. The LPDR outperforms all other algorithms compared. Figure 15 in Appendix J shows the results of AUC with respect to the number of labeled samples. The LPDR performs comparable with all other algorithms. To sum up the comparing algorithms across all experimental settings and repetitions, rank and Dolan-More curves are presented in Appendix K. The LPDR consistently achieves top rank for all steps and significantly outperforms the other algorithms in all experimental settings.

6. RELATED WORK

Other than uncertainty-based sampling framework (Culotta & McCallum, 2005; Scheffer et al., 2001; Mussmann & Liang, 2018; Lewis & Gale, 1994; Gal et al., 2017; Kirsch et al., 2019; Beluch et al., 2018) for active learning, decision-theoretic framework based methods such as expected model change (Settles et al., 2008) have certain relevance to the proposed LPDR as unlabeled samples nearer the decision boundary which LPDR is attempting to identify have larger gradients leading larger model change. Recently, adversarial approaches are proposed to discriminate labeled and unlabeled samples (Gissin & Shalev-Shwartz, 2019; Sinha et al., 2019; Zhang et al., 2020) , and after performing adversarial learning, any unlabeled samples that is most confidently predicted as unlabeled is queried and used to retrain the network. Here adversarial learning is used to indirectly identify sample near the decision boundary.

7. CONCLUSION

This paper defines a theoretical distance of unlabeled sample to the decision boundary referred to as the least probable disagreement region (LPDR) containing the unlabeled sample for active learn-ing. LPDR can be evaluated empirically with low computational load by making two assumptions regarding parameters of the hypothesis space, variation ratio and the LPDR. The two assumptions are empirically verified. Experimental results on various datasets show that LPDR consistently outperforms all other high performing uncertainty based active learning algorithms and leads to state-of-the-art active learning performance on CIFAR10, CIFAR100, Tiny ImageNet, and Food101 datasets. In addition, LPDR is able to perform consistently better than random sampling regardless of the network capacity while all other algorithms compared fail to do so. LPDR is simple enough to be applied to various classification tasks with deep networks: the implementation requires only sampling a subset of parameters (parameters in the last FC layer of the deep network). Additionally, LPDR is capable of quick and reliable performance in a variety of different settings with only a computational load that is not much higher than that of other uncertainty sampling methods. In conclusion, LPDR is an effective uncertainty based sampling algorithm in pool-based active learning. and Z nk s are independent random variables from N (0, 1). The event of {sgn x T w n = sgn x T ŵt } is equal to that of E 1 ∪ E 2 where E 2 = {σx T e n ≥ 0, x T ŵt < 0} and E 2 = {σx T e n < 0, x T ŵt ≥ 0}. Thus, the proof has two folds: the cases of 1) E 1 and 2) E 2 . In the first fold, P [E 1 ] = P σx T e n ≥ |x T ŵt | = P σ x Z ≥ |x T ŵt | = 1 -Φ a (x, ŵt ) σ where Z ∼ N (0, 1), Φ is the cumulative distribution function of the normal distribution, and a (x, ŵt ) = |x T ŵt |/ x . Note that σx T e n ∼ N (0, σ 2 x 2 ). Consequently, P[E 1 ] < 1/2 due to a (x, ŵt ) > 0. Hence, the following f (x) m N = N n=1 1 N I ĥt (x) = h n (x) goes to value greater than 1/2 in probability as N → ∞ because Var(f (x) m /N ) → 0 as N → ∞. Therefore, as N → ∞, ∀x, the variation ratio is 1 - f (x) m N = 1 - N n=1 1 N I ĥt (x) = h n (x) → 1 -Φ a (x, ŵt ) σ in probability. This is due to that f (x) m is the frequency of mode class with probability tending to 1 as N → ∞. By the smoothness of Φ, 1 - f (x) m N → 1 -Φ(∞) = 0 as σ 2 → 0 and 1 - f (x) m N → 1 -Φ(0) = 1 2 as σ 2 → ∞. Next, in the second fold, P [E 2 ] = P σx T e n < -|x T ŵt | = P σ x Z < -|x T ŵt | = Φ - a (x, ŵt ) σ . Consequently, P[E 2 ] < 1/2. Hence the following f (x) m N = N n=1 1 N I ĥt (x) = h n (x) goes to the value greater than 1/2 in probability as N → ∞ because Var(f (x) m /N ) → 0 as N → ∞. Therefore, as N → ∞, ∀x, the variation ratio is 1 - f (x) m N = 1 - N n=1 1 N I ĥt (x) = h n (x) → Φ -a (x, ŵt ) σ = 1 -Φ a (x, ŵt ) σ in probability. This is due to that f (x) m is the frequency of mode class with probability tending to 1 as N → ∞. By the smoothness of Φ, 1 - f (x) m N → 1 -Φ(∞) = 0 as σ 2 → 0 and 1 - f (x) m N → 1 -Φ(0) = 1 2 as σ 2 → ∞. This completes the proof. B EXPERIMENTAL SETTINGS All experiments are run for a fixed number of acquisition steps until a certain amount of training data is labeled. Results are averaged over 5 repetitions. For all datasets, the initial labeled samples for each repetition are randomly sampled according to the distribution of the training set. For MC dropout we use 100 forward passes, and ensemble consists of 5 networks of identical architecture but different random initialization and random batches. For LPDR, we set σ 0 = 0.01, β = 1, N = 100 and parameter sampling is applied to the last dense layer of each network. log(σ) with respect to the active learning progress. For all experiments, the variance of sampling increases as the labeling proceeds. This is because larger variance is required to make the ρ n = ρ * since unlabeled samples move away from the learned decision boundary from labeled samples due to an increase in network confidence as the number of labeled samples increases. Figure 9 : The ρ n and σ with respect to the labeling proceeds for all experimental settings. LPDR reliably guides the ρ n to be the target value by increasing the variance of sampling as the number of labeled samples increases.

E FINAL TEST ACCURACY VS TARGET ρ n

The Figure 10 shows the final test accuracy with respect to target ρ n on MNIST dataset. The results show that at around ρ * (= 0.02), it performs the best for q = 20 and m = 1000. In addition, the range of target ρ n , associated with the best performance, is wide (0.01 ∼ 0.1); thus, LPDR is robust against the target ρ n in the wide range. F ROBUSTNESS OF LPDR AGAINST HYPERPARAMETERS LPDR has four hyperparameters: 1) the initial variance of sampling σ 0 ; 2) the positive hyperparameter for regulating the variance of sampling β; 3) the number of sampled hypotheses N , and 4) the layer index of the network to which sampling is applied. The σ 0 has no significant effect on the performance of LPDR since σ is adaptively regulated based on the ρ n while sampling the sampled hypothesis. Thus, σ 0 is not examined in detail. Figure 11 shows the performance comparison with respect to the hyperparameters of LPDR on MNIST and CIFAR10 datasets. The left figures show that there is no significant difference in the performance of LPDR for various β ∈ {0.1, 1, 10} on both datasets. The robustness of LPDR against β is based on the sufficient buffer for regulating σ since the range of target ρ n associated with the best performance is wide. The middle figures show that there is no significant difference in the performance of LPDR for various N ∈ {5, 10, 20, 50, 100, 200} on both datasets. The robustness of LPDR against N is based on the sufficient discrimination in the variation ratio for identifying q most informative unlabeled samples with a small number of sampled hypotheses by setting ρ * = q/m. The right figures show that there is no significant difference in the performance of LPDR for the sampling to the parameters of last layer and to the parameters of all layers of the networks on both datasets. Figure 12 : The empirical errors of the learned and the sampled hypotheses with respect to the acquisition step for all experimental settings. It is observed that the empirical error of the learned hypothesis or the sampled hypothesis is not zero.

G EMPIRICAL ERRORS OF LEARNED AND SAMPLED HYPOTHESES

Figure 12 shows the empirical error of the learned and the sampled hypotheses with respect to the acquisition step for all experimental settings. In many cases, the empirical error of the learned hypothesis becomes zero, thus it is placed in the version space, while the sampled hypothesis is often placed outside the version space, e.g., in SVHN dataset. Even in the cases of EMNIST and CIFAR100 with K-CNN datasets, as the number of labeled samples increases, the empirical error of the learned and the sampled hypothesis increases. To address this situation, LPDR incorporates the weighted hypotheses based on the prediction error difference between the learned and the sampled hypotheses, and it works well empirically.

H PLOTS FOR TEST ACCURACY

Figure 13 shows the test accuracy with respect to the number of labeled samples from initial to final step for all experimental settings.

I LPDR VS MC-BATCHBALD

Figure 14 shows the performance comparison between LPDR and MC-BALD on MNIST dataset using S-CNN when the query size is 1 or 20. LPDR significantly outperforms MC-BatchBALD on MNIST dataset when q = 1 such that MC-BatchBALD is completely identical to MC-BALD. LPDR is also expected to outperform MC-BatchBALD even when q > 1: LPDR with q > 1 performs better than MC-BALD with q = 1 that MC-BatchBALD with q > 1 does not exceed (Kirsch et al., 2019) . Figure 14 : The comparison of performance between LPDR and MC-BALD on MNIST dataset where the query size is 1 or 20. The performance of BatchBALD with q > 1 does not exceed that of MC-BALD (q = 1) and LPDR (q = 20) outperforms MC-BALD (q = 1). J AUC OF HAM10000 DATASET On imbalanced dataset, the performance comparison is performed not only for accuracy but also for AUC. Figure 15 shows the results of AUC with respect to the number of labeled samples on HAM10000 dataset. LPDR performs comparable with Entropy or ENS-VarR performing better than other algorithms. Rank curves and Dolan-More curves are used to compare the performance of the algorithms across all experimental settings and repetitions. Figure 16 shows the rank and Dolan-More curves for all



Figure 1: An example of LPDR between a sample x = x 0 and a hypothesis ĥ = h a in binary classification using the h θ (x) = I[x > θ] on input x ∼ U [0, 1].

Figure 1 shows an example of LPDR. Let's define H = {h θ : h θ (x) = I[x > θ]} on input x sampled from uniform distribution D = U [0, 1] where I[•]is an indicator function. Suppose x = x 0 and ĥ = h a ∈ H when a < x 0 . Here, H(x 0 , h a ) consists of all hypotheses whose prediction on x 0 is in disagreement with h a (x 0 ) = 1, i.e., H(x 0 , h a ) = {h b ∈ H : h b (x 0 ) = 0} = {h b ∈ H : b > x 0 }.Then, the LPDR between x 0 and h a , d(x 0 , h a ) = x 0 -a as the infimum of the distance between h a and h b ∈ H(x 0 , h a ) is ρ(h a , h x0 ) = x 0 -a.

Figure 2: Empirical validation of Assumption 1. Left figure: Relationship between approximated hypothesis distance and σ at step t = 0. Hypothesis distance is almost linearly proportional to log(σ) in the ascension. Right figure: Relationship between variation ratio and σ (MNIST). Sample distance to the decision boundary can be expressed as σ at which the variation ratio is not zero for the first time (white arrow). The unlabeled samples are ordered in terms of LPDR.

Figure 3: Empirical validation of Assumption 2. Left figure: Spearman's rank correlation coefficient between LPDR and the variation ratio in terms of σ showing that there exists a σ such that LPDR and the variation ratio have a strong rank correlation. Right figure: An example of strong negative correlation between both ranks when log(σ) = -5.0. Samples with increasing LPDR or variation ratio are ranked from high to low.

Figure 4: The performance comparison of LPDR with the uncertainty based active learning algorithms on MNIST, CIFAR10, SVHN and EMNIST datasets (Random: random sampling, Entropy: entropy based uncertainty sampling, MC-BALD: MC dropout sampling with BALD, MC-VarR: MC dropout sampling with variation ratio, ENS-VarR: ensemble network with variation ratio). Overall, LPDR consistently either performs best or comparable with all other algorithms regardless of dataset. The performance of all algorithms except LPDR tend to be data dependent.

Figure 6: The performance comparison on Tiny ImageNet and Food101 datasets with WRN-16-8. LPDR outperforms all other algorithms in more difficult tasks.

Figure 10: The final accuracy with respect to the target ρ n on MNIST dataset. LPDR performs best in a wide range of the target ρ n .

Figure11: The performance comparison with respect to the hyperparameters of LPDR on MNIST and CIFAR10 datasets. LPDR is robust against β and N , and has no significant performance difference whether the sampling is applied to the parameters of last layer or all layers.

Figure 13: The test accuracy with respect to the number of labeled samples from initial to final step for all experimental settings.

Figure 15: The comparison of AUC on HAM10000 dataset. LPDR performs comparable with the best performing algorithms.

Experimental settings for comparing the performance on various datasets are summarized. Epochs is the maximum number of training epochs. Data size denotes the sizes of datasets for training / validation / test. Acquisition size denotes the number of samples for the initial model + number of samples acquired in each step (from the number of samples in the pool data) → Maximum number of samples acquired during training.

B.1 DATASETSMNIST (LeCun et al., 1998)  is a dataset of handwritten digits which has a training set of 60, 000 samples and a test set of 10, 000 samples in 10 classes. Each sample is a black and white image and 28 × 28 in size.CIFAR10 and CIFAR100(Krizhevsky et al., 2009) are labeled subsets of the 80 million tiny images dataset which have a training set of 50, 000 samples and a test set of 10, 000 samples in 10 and 100 classes respectively. Each sample is a color image and 32 × 32 in size.SVHN(Netzer et al., 2019) is a real-world digit image dataset which has a training set of 73, 257 samples and a test set of 26, 032 samples in 10 classes. Each sample is a color image and 32 × 32 in size.EMNIST(Cohen et al., 2017) is a dataset of handwritten character digits which has a training set of 80, 000 samples and a test set of 10, 000 samples in 47 classes. Each sample is a black and white image and 28 × 28 in size.Tiny ImageNet is a subset of the ILSVRC(Russakovsky et al., 2015) dataset which has 100, 000 samples in 200 classes. Each sample is a color image and 64 × 64 in size. In experiments, Tiny ImageNet is split into two parts: a training set of 90, 000 samples and a test set of 10, 000 samples.Food101(Bossard et al., 2014) is a fine grained dataset which has a training set of 75, 750 samples and a test set of 25, 250 samples in 101 classes. Each sample is a color image and resized to 75 × 75. HAM10000(Tschandl et al., 2018) is a imbalanced dataset which has 10, 015 samples in 7 classes. Each sample is a color image and resized to 75 × 75. In experiments, HAM10000 is split into two parts: a training set of 8, 515 samples and a test set of 1, 500 samples. The optimizer, initial learning rate, learning rate schedule and batch size for each experimental setting are described in Table2. He normal initialization is used for all models.

The mean (± standard deviation) of performance gap from the best competitor for all steps of each algorithm on each dataset. LPDR significantly outperforms the other algorithms on all datasets.

APPENDIX A PROOF OF THEOREM 1

Assume that ŵt = 1 without the loss of generality, and x = 0 to avoid the null case. The predicted label of x by w n disagrees with that by ŵt if sgn x T w n = sgn x T ŵt , here, sgn(0) = 1. Note thatx T w n = x T ŵt + σx T e n where e n = (Z n1 , . . . , Z n|w| ) T , C RANK CORRELATION BETWEEN LPDR AND VARIATION RATIO Figure 8 shows an example of negative Spearman's rank correlation between LPDR and the variation ratio for each experimental setting. Samples with increasing LPDR or variation ratio are ranked from high to low. The σ is selected to satisfy ρ n = ρ * = q/m at initial step. where n p is the total number of evaluations for the problem p. Thus, R a (τ ) is the ratio of problems with performance gap between algorithm a and the best performing competitor not more than τ . Note that R a (0) is the ratio of problems on which algorithm a performs the best. LPDR has the highest value R a (0) = 43.3%, and LPDR maintains the highest R a (τ ) for all τ .Table 3 presents the mean and the standard deviation of performance gap from the best competitor for all steps of each algorithm on each dataset. Consistent with all the results so far, LPDR significantly outperforms the other algorithms in all experimental settings.

