ACTIVE LEARNING IN BAYESIAN NEURAL NETWORKS WITH BALANCED ENTROPY LEARNING PRINCIPLE

Abstract

Acquiring labeled data is challenging in many machine learning applications with limited budgets. Active learning gives a procedure to select the most informative data points and improve data efficiency by reducing the cost of labeling. The infomax learning principle maximizing mutual information such as BALD has been successful and widely adapted in various active learning applications. However, this pool-based specific objective inherently introduces a redundant selection and further requires a high computational cost for batch selection. In this paper, we design and propose a new uncertainty measure, Balanced Entropy Acquisition (BalEntAcq), which captures the information balance between the uncertainty of underlying softmax probability and the label variable. To do this, we approximate each marginal distribution by Beta distribution. Beta approximation enables us to formulate BalEntAcq as a ratio between an augmented entropy and the marginalized joint entropy. The closed-form expression of BalEntAcq facilitates parallelization by estimating two parameters in each marginal Beta distribution. BalEntAcq is a purely standalone measure without requiring any relational computations with other data points. Nevertheless, BalEntAcq captures a well-diversified selection near the decision boundary with a margin, unlike other existing uncertainty measures such as BALD, Entropy, or Mean Standard Deviation (MeanSD). Finally, we demonstrate that our balanced entropy learning principle with BalEntAcq 1 consistently outperforms well-known linearly scalable active learning methods, including a recently proposed PowerBALD, a simple but diversified version of BALD, by showing experimental results obtained from MNIST, CIFAR-100, SVHN, and TinyImageNet datasets.

1. INTRODUCTION

Acquiring labeled data is challenging in many machine learning applications with limited budgets. As the dataset size gets bigger and bigger for training a complex model, labeling data by humans becomes more expensive. Active learning gives a procedure to select the most informative data points and improve data efficiency by reducing the cost of labeling. The active learning problem is well-aligned with a subset selection problem that can find the most efficient but minimal subset from the data pool (Hochbaum, 1996; Nemhauser et al., 1978; Dvoretzky, 1961; Milman, 1971; Spielman & Teng, 2014; Spielman & Woo, 2009; Batson et al., 2009; Spielman & Srivastava, 2011) . The difference is that active learning is typically an iterative process where a model is trained and a collection of data points is selected to be labeled from an unlabelled data pool. It is well-known that any active learning method cannot improve the label complexity better than passive learning (random acquisition) in general (Vapnik & Chervonenkis, 1974; Kääriäinen, 2006; Castro & Nowak, 2008) . Under some conditions on labels or models, it is possible to achieve exponential savings (Balcan et al., 2007; Hanneke, 2007; Dasgupta et al., 2005; Hsu, 2010; Dekel et al., 2012; Hanneke, 2014; Zhang & Chaudhuri, 2014; Krishnamurthy et al., 2017; Shekhar et al., 2021; Puchkin & Zhivotovskiy, 2021) . Zhu & Nowak (2022b; a) recently proposed a provably exponentially efficient active learning algorithm with abstention with high probability but limited in binary classification cases. On the other hand, although numerous practically successful active learning methods have been proposed, no algorithm has proven efficient enough and linearly scalable to guarantee exponential label savings in general. Therefore, it is still theoretically challenging but important to improve data efficiency significantly. It is now commonly accepted that standard deep learning models do not capture model uncertainty correctly. The simple predictive probabilities are usually erroneously described as model confidence (Hein et al., 2019) . So there is a risk that a model can be misdirecting its outputs with high confidence. However, the predictive distribution generated from Bayesian deep learning models better captures the uncertainty from the data (Gal & Ghahramani, 2016; Kristiadi et al., 2020; Mukhoti et al., 2021; Daxberger et al., 2021) . Therefore, we focus on developing an active learning framework in the Bayesian deep neural network model by leveraging the Monte-Carlo (MC) dropout method as a proxy of the Gaussian process (Gal & Ghahramani, 2016) which may facilitate further analysis. 1.1 OUR CONTRIBUTIONS Our proposed active learning method is well-aligned with Bayesian experimental design (Verdinelli & Kadane, 1992; Cohn et al., 1996; Sebastiani & Wynn, 2000; Malinin & Gales, 2018; Foster et al., 2019) with an assumption that the forward active learning iterative process follows the Bayesian prior-posterior framework. Furthermore, our approach is also aligned with Bayesian uncertainty quantification methods (Houlsby et al., 2011; Kandasamy et al., 2015; Kampffmeyer et al., 2016; Gal & Ghahramani, 2016; Alex Kendall & Cipolla, 2017; Gal et al., 2017; Kirsch et al., 2019; Mukhoti et al., 2021; Kirsch et al., 2021) with an assumption that the working neural network model is a Bayesian network (Koller & Friedman, 2009) . In this paper, we extend and improve recent advances in both aspects of Bayesian experimental design and Bayesian uncertainty quantification. We investigate the generalized notion of the joint entropy between model parameters and the predictive outputs by leveraging a point process entropy (McFadden, 1965; Fritz, 1973; Papangelou, 1978; Daley & Vere-Jones, 2007; Baccelli & Woo, 2016) . By approximating the marginals using Beta distributions, we then derive an explicit formula of the marginalized joint entropy by estimating Beta parameters from Bayesian deep learning models. As a Bayesian experiment, we revisit the well-known entropy and mutual information measures given expected cross-entropy loss. We show that well-known acquisition measures are functions of marginal distributions through analytical formulas. We propose a new uncertainty measure, Balanced Entropy Acquisition (BalEntAcq), which captures the information balance between the uncertainty of underlying softmax probability and the label variable. Finally, we demonstrate that our balanced entropy learning principle with BalEntAcq consistently outperforms well-known linearly scalable active learning methods, including a recently proposed PowerBALD (Kirsch et al., 2021) for mitigating the redundant selection in BALD (Gal et al., 2017) , by showing experimental results obtained from MNIST, CIFAR-100, SVHN, and TinyImageNet datasets. (1)



Code is available. https://github.com/jaeohwoo/BalancedEntropy



PROBLEM FORMULATION We write an unlabeled dataset D pool and the labeled training set D training ⊆ D pool in each active learning iteration. We denote by D (n) training if it's necessary to indicate the specific n-th iteration step. Given D training , we train a Bayesian deep neural network model Φ with model parameters ω ∼ p (ω).Then for a data point x given D training , the Bayesian deep neural network Φ produces the prediction probability: Φ (x, ω) := (P 1 (x, ω),• • • , P C (x, ω)) ∈ ∆ C where ∆ C = {(p 1 , • • • , p C ) : p 1 + • • • + p C = 1, p i ≥0 for each i} and C is the number of classes. For the final class output Y , it is assumed to be a multinoulli distribution (or categorical distribution): Y (x, ω) :=      1 with probability P 1 (x, ω) . . . . . . C with probability P C (x, ω).

