UNCERTAINTY-AWARE ACTIVE LEARNING FOR OPTI-MAL BAYESIAN CLASSIFIER

Abstract

For pool-based active learning, in each iteration a candidate training sample is chosen for labeling by optimizing an acquisition function. In Bayesian classification, expected Loss Reduction (ELR) methods maximize the expected reduction in the classification error given a new labeled candidate based on a one-step-lookahead strategy. ELR is the optimal strategy with a single query; however, since such myopic strategies cannot identify the long-term effect of a query on the classification error, ELR may get stuck before reaching the optimal classifier. In this paper, inspired by the mean objective cost of uncertainty (MOCU), a metric quantifying the uncertainty directly affecting the classification error, we propose an acquisition function based on a weighted form of MOCU. Similar to ELR, the proposed method focuses on the reduction of the uncertainty that pertains to the classification error. But unlike any other existing scheme, it provides the critical advantage that the resulting Bayesian active learning algorithm guarantees convergence to the optimal classifier of the true model. We demonstrate its performance with both synthetic and real-world datasets.

1. INTRODUCTION

In supervised learning, labeling data is often expensive and highly time consuming. Active learning is one field of research that aims to address this problem and has been demonstrated for sampleefficient learning with less required labeled data (Gal et al., 2017; Tran et al., 2019; Sinha et al., 2019) . In this paper, we focus on pool-based Bayesian active learning for classification with 0-1 loss function. Bayesian active learning starts from the prior knowledge of uncertain models. By optimizing an acquisition function, it chooses the next candidate training sample to query for labeling, and then based on the acquired data, updates the belief of uncertain models through Bayes' rule to approach the optimal classifier of the true model, which minimizes the classification error. In active learning, maximizing the performance of the model trained on queried candidates is the ultimate objective. However, most of the existing methods do not directly target the learning objective. For example, Maximum Entropy Sampling (MES) or Uncertainty Sampling, simply queries the candidate with the maximum predictive entropy (Lewis & Gale, 1994; Sebastiani & Wynn, 2000; Mussmann & Liang, 2018) ; but the method fails to differentiate between the model uncertainty and the observation uncertainty. Bayesian Active Learning by Disagreement (BALD) seeks the data point that maximizes the mutual information between the observation and the model parameters (Houlsby et al., 2011; Kirsch et al., 2019) . Besides BALD, there are also other methods reducing the model uncertainty in different forms (Golovin et al., 2010; Cuong et al., 2013) . However, not all the model uncertainty will affect the performance of the learning task of interest. Without identifying whether the uncertainty is related to the classification error or not, these methods can be inefficient in the sense that it may query candidates that do not directly help improve prediction performance. In this paper we focus on the active learning methods directly maximizing the learning model performance. There exist such active learning methods by Expected Loss Reduction (ELR) that aim to maximize the expected reduction in loss based on a one-step-look-ahead manner (Roy & McCallum, 2001; Zhu et al., 2003; Kapoor et al., 2007) . The ELR methods can focus on only the uncertainty related to the loss function to achieve sample-efficient learning. In fact, ELR is the optimal strategy for active learning with a single query (Roy & McCallum, 2001) . However, a critical shortcoming of previous ELR schemes is that none of them provide any theoretical guarantee regarding their longterm performance. In fact, since these methods are myopic and cannot identify the long-term effect of a query on the loss functions, without special design on the loss function, they may get stuck before reaching the optimal classifier. To the best of our knowledge, there is currently no method that directly maximizes the model performance while simultaneously guaranteeing the convergence to the optimal classifier. Fig. 1a provides an example of binary classification with one feature where both BALD and ELR methods fail. In the figure, the red lines indicate the upper and lower bounds of the prediction probability of class 1, illustrating the model with higher probability uncertainty on the sides (x → ±4) than that in the middle (x = 0). Querying candidates on the sides will provide more information of the model parameters, and therefore is preferred in BALD. However, since the possible probabilities on the sides are always larger than or less than 0.5, querying candidates on the sides will not help reduce the classification error. On the other hand, ELR queries candidates that help reduce the classification error the most, so it prefers data in the middle whose optimal labels are uncertain given the prior knowledge. The performance shown in Fig. 1b agrees with our analysis. Fig. 1b shows the performance averaged over 1000 runs, with more details and discussions of the example included in Appendix C. BALD performs inefficiently at the beginning by querying points on both sides. On the other hand, the ELR method performs the best at the beginning, but becomes inefficient after some iterations (∼100), indicating some of its runs get stuck before reaching the optimal classifier. In this paper, we consider the algorithm to "get stuck" when the acquisition function value is 0 for all the candidates in the pool and the algorithm degenerates to uniform random sampling. In this paper, we analyze the reason why ELR methods may get stuck before reaching the optimal classifier, and propose a new strategy to solve this problem. Our contributions are in four parts: 1. We show that ELR methods may get stuck, preventing active learning from reaching the optimal classifier efficiently. 2. We propose a novel weighted-MOCU active learning method that can focus only on the uncertainty related to the loss for efficient active learning and is guaranteed to converge to the optimal classifier of the true model. 3. We provide the convergence proof of the weighted-MOCU method. 4. We demonstrate the sample-efficiency of our weighted-MOCU method with both synthetic and real-world datasets.

2. BACKGROUND

Optimal Bayesian classifier. Consider a classification problem with candidates x ∈ X and class labels y ∈ Y = {0, 1, . . . , M -1}. The predictive probability p(y|x, θ) is modeled with parameters



Figure 1: (a) Predictive probability of class 1 under uncertainty: the red lines indicate the upper and lower bounds of the predictive probability; the blue dash line is the mean of the predictive probability; the green dash line indicates that the probability is equal to 0.5. (b) Active learning performance comparison.

