ACTIVE LEARNING IN BAYESIAN NEURAL NETWORKS WITH BALANCED ENTROPY LEARNING PRINCIPLE

Abstract

Acquiring labeled data is challenging in many machine learning applications with limited budgets. Active learning gives a procedure to select the most informative data points and improve data efficiency by reducing the cost of labeling. The infomax learning principle maximizing mutual information such as BALD has been successful and widely adapted in various active learning applications. However, this pool-based specific objective inherently introduces a redundant selection and further requires a high computational cost for batch selection. In this paper, we design and propose a new uncertainty measure, Balanced Entropy Acquisition (BalEntAcq), which captures the information balance between the uncertainty of underlying softmax probability and the label variable. To do this, we approximate each marginal distribution by Beta distribution. Beta approximation enables us to formulate BalEntAcq as a ratio between an augmented entropy and the marginalized joint entropy. The closed-form expression of BalEntAcq facilitates parallelization by estimating two parameters in each marginal Beta distribution. BalEntAcq is a purely standalone measure without requiring any relational computations with other data points. Nevertheless, BalEntAcq captures a well-diversified selection near the decision boundary with a margin, unlike other existing uncertainty measures such as BALD, Entropy, or Mean Standard Deviation (MeanSD). Finally, we demonstrate that our balanced entropy learning principle with BalEntAcq 1 consistently outperforms well-known linearly scalable active learning methods, including a recently proposed PowerBALD, a simple but diversified version of BALD, by showing experimental results obtained from MNIST, CIFAR-100, SVHN, and TinyImageNet datasets.

1. INTRODUCTION

Acquiring labeled data is challenging in many machine learning applications with limited budgets. As the dataset size gets bigger and bigger for training a complex model, labeling data by humans becomes more expensive. Active learning gives a procedure to select the most informative data points and improve data efficiency by reducing the cost of labeling. The active learning problem is well-aligned with a subset selection problem that can find the most efficient but minimal subset from the data pool (Hochbaum, 1996; Nemhauser et al., 1978; Dvoretzky, 1961; Milman, 1971; Spielman & Teng, 2014; Spielman & Woo, 2009; Batson et al., 2009; Spielman & Srivastava, 2011) . The difference is that active learning is typically an iterative process where a model is trained and a collection of data points is selected to be labeled from an unlabelled data pool. It is well-known that any active learning method cannot improve the label complexity better than passive learning (random acquisition) in general (Vapnik & Chervonenkis, 1974; Kääriäinen, 2006; Castro & Nowak, 2008) . Under some conditions on labels or models, it is possible to achieve exponential savings (Balcan et al., 2007; Hanneke, 2007; Dasgupta et al., 2005; Hsu, 2010; Dekel et al., 2012; Hanneke, 2014; Zhang & Chaudhuri, 2014; Krishnamurthy et al., 2017; Shekhar et al., 2021; Puchkin & Zhivotovskiy, 2021) . Zhu & Nowak (2022b; a) recently proposed a provably exponentially efficient active learning algorithm with abstention with high probability but limited in binary classification cases. On the other hand, although numerous practically successful active learning methods have been proposed, no algorithm has proven efficient enough and linearly scalable to guarantee exponential label savings in general. Therefore, it is still theoretically challenging but important to improve data efficiency significantly. It is now commonly accepted that standard deep learning models do not capture model uncertainty correctly. The simple predictive probabilities are usually erroneously described as model confidence (Hein et al., 2019) . So there is a risk that a model can be misdirecting its outputs with high confidence. However, the predictive distribution generated from Bayesian deep learning models better captures the uncertainty from the data (Gal & Ghahramani, 2016; Kristiadi et al., 2020; Mukhoti et al., 2021; Daxberger et al., 2021) . Therefore, we focus on developing an active learning framework in the Bayesian deep neural network model by leveraging the Monte-Carlo (MC) dropout method as a proxy of the Gaussian process (Gal & Ghahramani, 2016) which may facilitate further analysis. 1.1 OUR CONTRIBUTIONS Our proposed active learning method is well-aligned with Bayesian experimental design (Verdinelli & Kadane, 1992; Cohn et al., 1996; Sebastiani & Wynn, 2000; Malinin & Gales, 2018; Foster et al., 2019) with an assumption that the forward active learning iterative process follows the Bayesian prior-posterior framework. Furthermore, our approach is also aligned with Bayesian uncertainty quantification methods (Houlsby et al., 2011; Kandasamy et al., 2015; Kampffmeyer et al., 2016; Gal & Ghahramani, 2016; Alex Kendall & Cipolla, 2017; Gal et al., 2017; Kirsch et al., 2019; Mukhoti et al., 2021; Kirsch et al., 2021) with an assumption that the working neural network model is a Bayesian network (Koller & Friedman, 2009) . In this paper, we extend and improve recent advances in both aspects of Bayesian experimental design and Bayesian uncertainty quantification. We investigate the generalized notion of the joint entropy between model parameters and the predictive outputs by leveraging a point process entropy (McFadden, 1965; Fritz, 1973; Papangelou, 1978; Daley & Vere-Jones, 2007; Baccelli & Woo, 2016) . By approximating the marginals using Beta distributions, we then derive an explicit formula of the marginalized joint entropy by estimating Beta parameters from Bayesian deep learning models. As a Bayesian experiment, we revisit the well-known entropy and mutual information measures given expected cross-entropy loss. We show that well-known acquisition measures are functions of marginal distributions through analytical formulas. We propose a new uncertainty measure, Balanced Entropy Acquisition (BalEntAcq), which captures the information balance between the uncertainty of underlying softmax probability and the label variable. Finally, we demonstrate that our balanced entropy learning principle with BalEntAcq consistently outperforms well-known linearly scalable active learning methods, including a recently proposed PowerBALD (Kirsch et al., 2021) for mitigating the redundant selection in BALD (Gal et al., 2017) , by showing experimental results obtained from MNIST, CIFAR-100, SVHN, and TinyImageNet datasets. (1)

2. BACKGROUND

For the sake of brevity, we sometimes omit x or ω by writing Φ (ω), P i (ω), Y (ω) or Φ, P i , Y unless we need further clarifications on each data point x. Under this formulation, the oracle (active learning algorithm) selects a subset of data points to add to the next training set, i.e. at (n + 1)-th iteration, the training set is determined by D (n+1) training = D (n) training ∪ {Next training batch from Oracle}. Once the next training batch is selected, the selected batch will be labeled. This means that the ground truth label information of the selected data is added in training set D (n+1) training in the next round. Then the goal in active learning is to minimize the number of selected data points to reach a certain level of prediction accuracy.

2.2. EXAMPLES OF UNCERTAINTY BASED ACTIVE LEARNING METHODS

In this section, we list up well-known uncertainty measures suitable for Bayesian active learning. 1. Random: Rand[x] := U (ω ′ ) where U (•) is a uniform distribution which is independent to ω. Random acquisition function assigns a random uniform value on [0, 1] to each data point. 2. BALD (Bayesian active learning by disagreement) (Lindley, 1956; Houlsby et al., 2011; Gal et al., 2017) : BALD[x] := I (ω, Y (x, ω)), where I(•, •) represents a mutual information between random measures. BALD captures the mutual information between the model parameters and the predictive output of the data point. In practice, we calculate the mutual information between the predictive output and the predictive probabilities. 3. Entropy (Shannon, 1948)  : Ent[x] := -i (EP i ) log (EP i ). Entropy is the Shannon entropy with respect to the expected predictive probability. Entropy can be the uncertainty of the prediction probability. Moreover, under the cross-entropy loss, we may also interpret the entropy measure as an expected loss gain since -log (EP i ) is the cross-entropy loss given the ground truth label is the class i. 4. Mean standard deviation (MeanSD) (Cohn et al., 1996; Kampffmeyer et al., 2016; Alex Kendall & Cipolla, 2017)  : MeanSD[x] := 1 C i EP 2 i -(EP i ) 2 . Mean standard deviation captures the average of the standard deviations for each marginal distribution. 5. PowerBALD (Farquhar et al., 2021; Kirsch et al., 2021)  : PowerBALD[x] := log BALD[x] + Z, where Z is an independently generated random value from Gumbel distribution with the location µ = 0 and the scale β = 1 parameters, see Wikipedia (2023). We use β = 1 as a default choice suggested by Kirsch et al. (2021) . The motivation of this randomized acquisition is to mitigate the redundant selection by diversifying selected multi-batch points. In general, we do not know which parameter β will be the optimal choice. In a multiple acquisition scenario, we simply add the above uncertainty values for each data point x i : AcqFunc[x 1 , • • • , x n ] := n i=1 AcqFunc[x i ], where AcqFunc ∈ {Rand, BALD, Ent, MeanSD, PowerBALD}. et al. (1996) provided one of the first statistical analyses in active learning, establishing how to synthesize queries that reduce the model's forward-looking error by minimizing its variance leveraging MacKay's closed-form variance approximation (MacKay, 1992) . In this fashion, there exists a line of works in Bayesian experimental design (Chaloner & Verdinelli, 1995; Lindley, 1956; Verdinelli & Kadane, 1992; Cohn et al., 1996; Sebastiani & Wynn, 2000; Roy & McCallum, 2001; Yoon et al., 2013; Vincent & Rainforth, 2017; Foster et al., 2019; 2021; Jha et al., 2022) with an assumption that the forward active learning iterative process follows Bayesian prior-posterior framework.

Cohn

On the other hand, in active learning, accommodating both the information uncertainty and the diversification of the acquired samples is essential to improve the performance under multi-batch acquisition scenarios. In a theoretical perspective, the most natural way to combine the uncertainty and the diversification seems to leverage reasonable sub-modular functions, e.g. Nearest neighbor set function (Wei et al., 2015) , BatchBALD (Kirsch et al., 2019) , Determinantal Point Process (Bıyık et al., 2019) and SIMILAR (Kothawade et al., 2021) with sub-modular information measures, and then/or apply a fast linear-time algorithm to find a diversified multi-batch with a provable performance guarantee (Nemhauser & Wolsey, 1978; Nemhauser et al., 1978; Ene & Nguyen, 2017; Yaroslavtsev et al., 2020; Schreiber et al., 2020; Iyer et al., 2021a; b; Li et al., 2022) . Although a fast lineartime solver is available for general sub-modular functions, there still exists a gap with practical implementation, such as high memory requirements, which makes the computation unscalable for identifying multi-batch acquisition points, e.g., BatchBALD (Kirsch et al., 2019) . Similar to the submodular function optimization, there exist many customized optimization approaches, e.g. CoreSet (Sener & Savarese, 2018) and more approaches (Guo, 2010; Joshi et al., 2010; Elhamifar et al., 2013; Yang et al., 2015; Wang & Ye, 2015) . Another recent approach is to look at parameters of the neural network and to diversify points such as BADGE (Ash et al., 2020) with gradients and BAIT (Ash et al., 2021) with Fisher information. There also exist network architectural design focused approaches such as Learning loss by designing loss prediction layers (Yoo & Kweon, 2019) , UncertainGCN and CoreGCN (Caramalau et al., 2021) with graph neural networks , VAAL (Sinha et al., 2019) and TA-VAAL (Kim et al., 2021) by applying adversarial learning methods.

3. BAYESIAN NEURAL NETWORK MODEL

We adopt the Bayesian neural network framework introduced in Gal & Ghahramani (2016) . The core idea in the Bayesian neural network is leveraging the MC dropout feature to generate a distribution of the predictive probability as an output at inference time. Under mild assumptions, it turns out that it is equivalent to an approximation to a Gaussian Process (Rasmussen & Williams, 2006; Neal, 1996; Williams, 1997; Gal & Ghahramani, 2016; Lee et al., 2017) .

3.1. SOFTMAX PROBABILITY MARGINAL APPROXIMATELY FOLLOWS BETA DISTRIBUTION

We may consider a Bayesian neural network model Φ as a random measure, i.e., stochastic process parametrized by D training over the data set D pool . Given a data point x ∈ D pool , Φ (x, ω) produces a random probability distribution in a simplex ∆ C . This analogy has a close connection with the construction of random discrete distribution, originally introduced by Kingman (1975). Since then, random measure construction has been extensively developed in Bayesian nonparametrics, and it is well-known that Dirichlet probability having Beta marginals plays the central role in the construction of the random discrete distribution (Kingman, 1977; Ferguson, 1973; Pitman & Yor, 1997; Pitman et al., 2002; Broderick et al., 2012; Orlitsky et al., 2004; Santhanam et al., 2014) . It is the main motivation of the Beta distribution approximation. Many kinds of literature similarly assume the Dirichlet distribution after the softmax in the Bayesian neural network. As illustrated by Milios et al. (2018) , we may follow the construction of Dirichlet distribution. Following the approach by Ferguson (1973) , a Dirichlet probability can be constructed through a collection of independent Gamma distributions. On the other hand, each marginal in Gaussian Process (approximated by Bayesian neural network) in the softmax output having dependent components follows a log-normal distribution (before the normalization, but after the exponentiation in softmax). Then by applying the shape similarity between a log-normal distribution and Gamma distribution, the construction of random probability from log-normal distributions would produce an approximated Dirichlet distribution. Therefore we may assume that the marginal distribution would approximately follow the Beta distribution. Alternatively, as an analytical approach, we may see that Beta approximation can be justified through Laplace approximation (MacKay, 1998; Hennig et al., 2012; Hobbhahn et al., 2020; Daxberger et al., 2021) . There exists a mapping between multivariate Gaussian distribution and Dirichlet distribution under a softmax basis. Then Beta distribution follows as a marginal distribution of Dirichlet distribution. Therefore we may assume that Beta approximation exists through Laplace approximation under the assumption that the Bayesian neural network produces the multivariate Gaussian distribution (as a marginalized Gaussian process over finite rank covariate function) before the softmax layer (Neal, 1996; Williams, 1997; Gal & Ghahramani, 2016; Lee et al., 2017) . In practice, once we estimate the sample mean and sample variance for each marginal of Φ (x, ω), we can estimate two parameters of the Beta distribution as follows. Assume that P i ∼ Beta (α i , β i ). If EP i = m i and VarP i = σ 2 i , then α i = m 2 i (1-mi) σ 2 i -m i , β i = 1 mi -1 α i . When P i ∼ Beta (α i , β i ), EP i = αi αi+βi = m and VarP i = αiβi (αi+βi) 2 (αi+βi+1) = σ 2 i . Solving the equation with respect to α i and β i , then the equation follows.

3.2. MARGINALIZED JOINT ENTROPY IN BAYESIAN NEURAL NETWORK

Assume that each P i ∼ Beta(α i , β i ) by applying Beta approximation. We may define a quantity of the marginalized joint entropy (See Appendix A.1 and A.2) and we find an equivalent formulation as follows: MJEnt[x] := - i E Pi [P i log (P i f (P i ))] = i (EP i ) h(P + i ) posterior uncertainty + I (ω, Y ) epistemic uncertainty + E ω [H (Y |ω)] aleatoric uncertainty . (3) where h(•) is a differential entropy, I(•, •) represents mutual information between two quantities, H(•) is Shannon entropy, P + i is the conjugate Beta posterior entropy of P i which follows P + i ∼ Beta(α i + 1, β i ). We call the first term in equation 3 to be the posterior uncertainty. We may interpret the posterior uncertainty as an expected posterior entropy assuming that we observed a positive sample of the class toward P i for each i without knowing the true class label. The first term is always non-positive, and is maximized (equals to 0) when each P + i is Beta(1, 1), i.e., Uniform on [0, 1]. So -∞ < MJEnt[x] ≤ H(Y ), where we note that H(Y ) = I (ω, Y ) + E ω [H (Y |ω)]. The epistemic uncertainty captures the model uncertainty (as BALD), and the aleatoric uncertainty captures the data uncertainty (Matthies, 2007) . Therefore the marginalized joint entropy, MJEnt[x] is a decomposition of three types of uncertainty values.

3.3. ENTROPY IS FOR MAXIMIZING AN EXPECTED CROSS-ENTROPY LOSS

Given a ground-truth label {Y = i}, the cross-entropy loss of the neural network can be given as loss (Φ (x, ω) , Y = i) = -log EP i . Therefore, inspired by the expected loss in risk management (Jorion, 2000) , we can calculate the expected cross-entropy loss without knowing the truth label: ExpectedLoss[x] := C i=1 P [Y = i] loss (Φ (x, ω) , Y = i) = - i (EP i ) log (EP i ) = Ent[x]. Based on the re-formulation, we may interpret that entropy acquisition is for maximizing an expected cross-entropy loss in a selection of acquisition points, aligning the idea with the learning loss (Yoo & Kweon, 2019) . The natural question is, "Once we acquire a data point that maximizes entropy acquisition, can we remove/or learn this expected cross-entropy amount of loss at the future stage of the active learning?". The answer could be "No." The exhaustive loss acquisition could only happen if the neural network perfectly over-fits the training data. Therefore, there exists a gap between a realistic neural network training scenario and the objective of the entropy acquisition. Our equivalent loss interpretation without introducing epistemic or aleatoric uncertainty confirms typical perceptions of why the entropy acquisition might not be successful in practice, even in the single-point acquisition scenario.

3.4. BALD IS A FUNCTION OF MARGINALS AND IS STRONGLY ALIGNED WITH MAXIMIZING AN EXPECTED CROSS-ENTROPY LOSS DIFFERENCE UPTO THE NEXT ITERATION

We have the mutual information between ω and Y and it is the same as the mutual information between the encoded message and the channel output since Y depends only on Φ (x, ω) (Gal et al., 2017)  , BALD[x] := I (ω, Y ) = I (Φ (x, ω) , Y (x, ω)) . By assuming that Φ (x, ω) follows a Dirichlet distribution, we can calculate the mutual information analytically (Woo, 2022) . Then by investigating further into the analytical mutual information formula, we see that the marginal distributions P i 's in Φ (x, ω) are sufficient to estimate BALD. This marginal representation is an important phenomenon as dependency of all coordinates could be removed at the predictive stage (Bobkov & Madiman, 2011, See Conjecture V.5 ). Therefore we can represent BALD through Beta marginal distributions as follows. See Appendix A.4 for more details. Theorem 3.1. Under Beta marginal distribution approximation, let P i ∼ Beta(α i , β i ) in Φ (x, ω). Then the mutual information BALD[x] can be estimated as follows: BetaMarginalBALD[x] := C i=1 (α i -1) Ψ (α i + β i ) - C i=1 α i α i + β i log α i α i + β i - C i=1 α i (α i -1) α i + β i Ψ (α i ) - C i=1 β i (α i -1) α i + β i Ψ (α i + β i + 1) + C i=1 α 2 i α i + β i [Ψ (α i + 1) -Ψ (α i + β i + 1)] . As a Bayesian experimental design process, we may assume that each Beta marginal distribution P i with the ground-truth label {Y = i} of the next trained model would follow the Beta posterior distribution P + i . Without this assumption, existing choices of acquisition functions such as BALD or MeanSD might not be well-justified. For example, what is the implication of maximizing mutual information through the active learning process with a Bayesian neural network? How is it different from the maximization of the entropy acquisition? To answer these questions, leveraging our Beta marginalization and considering the similar idea of expected information gain (Foster et al., 2019) , we may consider the expected cross-entropy loss difference between the current stage model and the next stage model. ExpectedEffectiveLoss[x] := C i=1 EP i -log EP i --log EP + i = C i=1 α i α i + β i log α i + 1 α i + β i + 1 -log α i α i + β i . ExpectedEffectiveLoss captures the effective amount of cross-entropy loss for the model to learn after the acquisition. By definition, we see that ExpectedEffectiveLoss aims to exclude the undesirable over-fitting scenario assumption unlike Entropy acquisition. Since Digamma function Ψ(x) ∼ log x -1 2x where f (x) ∼ g(x) implies lim x→∞ f (x)/g(x) = 1, we may expect that BetaMarginalBALD[x] and ExpectedEffectiveLoss[x] would behave similarly. Figure 1 shows the Spearman's rank correlations among different acquisition measures upto a class dimension C = 10, 000. We observe that BetaMarginalBALD behaves equally like the original BALD and we confirm that BALD and MeanSD are strongly aligned with maximizing ExpectedEffectiveLoss. Therefore, acquiring points through BALD or MeanSD could be a better strategy than Entropy because BALD or MeanSD takes into account the effective loss acquisition instead of the unrealistic full amount of the loss acquisition. 

4. BALANCED ENTROPY LEARNING PRINCIPLE

The previous section shows that well-known acquisition measures have an objective toward crossentropy loss and are closely related to marginal distributions. However, according to Farquhar et al. (2021) , to be successful in active learning, they hypothesize that it is crucial to find a good balance between active learning bias and over-fitting bias under over-parametrized neural networks. Although they claim that it might not be possible to achieve the ultimate active learning goal without having the full information, we may still define the balanced entropy (BalEnt) to be a ratio between the marginalized joint entropy equation 3 and the augmented entropy: BalEnt[x] := MJEnt[x] Ent[x] + log 2 = i (EP i ) h(P + i ) + H(Y ) H(Y ) + log 2 . ( ) Recall that we call the first term in MJEnt[x] to be posterior uncertainty, and it is an expected posterior entropy of underlying marginals. BalEnt captures the information balance between the posterior uncertainty from the model Φ and entropy of the label variable Y .

4.1. IMPLICATIONS OF BALANCED ENTROPY

To understand the implication of BalEnt[x], we can prove the following Theorem 4.1. Theorem 4.1. Let ∆ -1 := ⌊2e H(Y ) ⌋ and Υ := {I n }, a collection of evenly divided intervals in [0, 1] where I n := (n -1)∆, n∆ for n = 1, • • • , (∆ -1 -1) and I ∆ -1 := [1 -∆, 1]. Let Pi be a discretized random variable over Υ of P i from Φ (x, ω). For any estimator Pi of Pi given the label {Y = i} we have E P Pi ̸ = Pi Y = i ≥ i (EP i ) h(P + i ) + H(Y ) H(Y ) + log 2 (1 + ε 1 ) -ε 2 = BalEnt[x](1 + ε 1 ) -ε 2 , where ε 1 , ε 2 ≥ 0 are adjustment terms depending on ∆ such that ε 1 → 0 and ε 2 → 0 as ∆ → 0. Theorem 4.1 tries to answer the following inverse problem. For the unlabeled data point, x, if we know the information of the label {Y = i}, how much can we reliably estimate the underlying probability P i from the model Φ? As we know that -log P i is the cross-entropy loss of the trained model with Y , it equivalently answers the estimation error probability of the loss prediction under a unit precision up to -log ∆ level. For the precision level, we are assuming to carry -log ∆ ≈ H(Y ) + log 2 natsnatural unit of information, re-scaled amount of bits, matching the enumerator with MJEnt[x] term. It is not clear how to determine a better choice of the precision level -log ∆. But we may understand the denominator H(Y ) + log 2 is for normalizing the term BalEnt[x] ≤ 1 as a probability. Then the sign of BalEnt[x] becomes very important. BalEnt[x] ≥ 0 implies that it could be impossible to perfectly predict the loss -log P i given currently available information. i.e., there could exist information imbalance between the model and the label approximately starting from BalEnt[x] = 0. Therefore, insight from Theorem 4.1 suggests us a new direction for our main active learning principle. We define our primary acquisition function, namely, balanced entropy learning acquisition (BalEntAcq), as follows: BalEntAcq[x] := BalEnt[x] -1 if BalEnt[x] ≥ 0, BalEnt[x] if BalEnt[x] < 0, Since the information imbalance exists at least from BalEnt[x] = 0, we prioritize to fill the information gap from BalEnt[x] = 0 toward positively increasing direction which aligns with choosing the entropy increasing contours. If we try to fill the information imbalance gap from the highest BalEnt[x], the information imbalance would still exist around BalEnt[x] = 0 area. Therefore, it might not improve the active learning performance much. See Appendix A.13.2 and A.13.3 for different prioritization and precision level results. That's the motivation why we take the reciprocal of BalEnt[x] when BalEnt[x] ≥ 0.

4.2. TOY EXAMPLE ILLUSTRATION

To illustrate the behavior of BalEntAcq and its relationship with other uncertainty measures, we train a simple Bayesian neural network with a 3-class moon dataset in R 2 . Then we calculate each acquisition measure for all fixed lattice points in the square domain by assuming that the unlabeled pool is highly regularized (or uniform). i.e., by evenly discretizing the domain, we obtain each uncertainty value for each lattice point. The total number of lattice points is around 0.6 million. 

5. EXPERIMENTAL RESULTS

In this section, we demonstrate the performance of BalEntAcq from MNIST (LeCun & Cortes, 2010), CIFAR-100 (Krizhevsky et al., 2012) , SVHN (Netzer et al., 2011) , and TinyImageNet (Le & Yang, 2015) datasets under various scenarios. We used a single NVIDIA A100 GPU for each experiment, and details about the experiments are explained in Appendix A.13. We test Random, BALD, Entropy, MeanSD, PowerBALD, and BalEntAcq measures. We add BADGE for additional baseline. Note that all acquisition measures except BADGE in our experiments are standalone quantities, so they can be easily parallelized, i.e., linearly scalable. Single acquisition active learning with MNIST. MNIST is the most popular and elementary dataset to validate the performance of image-based deep learning models initially. We use a simple convolutional neural network (CNN) model applying dropouts to all layers with a single acquisition size. The primary purpose of this single acquisition experiment is to validate our proposed balanced entropy approach by removing the contribution of diversification unlike multi-batch acquisition scenario. Fixed features with CIFAR-100 and 3×CIFAR-100. In recent years, significant efforts have been made on building an efficient framework of unsupervised or self-supervised feature learning such as SimCLR (Chen et al., 2020a; b) , MoCo (He et al., 2020) , BYOL (Grill et al., 2020) , SwAV (Caron et al., 2020) , DINO (Caron et al., 2021) , etc. As an application in active learning, we may leverage the feature space from the unsupervised feature learning without explicitly knowing true labels but construct a good representation space. In our experiments, we adopt SimCLR for simplicity with ResNet-50 to build a feature space for CIFAR-100. With 3×CIFAR-100 dataset, we observe the effect of the redundant information treatment for each method by adding three identical points. We use the same fixed feature obtained from SimCLR with CIFAR-100. We may observe how each method effectively diversifies the selection under a redundant data pool scenario by fixing the feature space. Pre-trained backbone with SVHN and strong data augmentation with TinyImageNet. In this experiment, we follow a typical image classification scenario in practice. We use the ResNet-18 backbone for SVHN and the ResNet-50 backbone for TinyImageNet with ImageNet pre-trained model for model architecture, and the last linear classification layer is replaced with a simple Bayesian neural network with dropouts. In TinyImageNet iterations, we re-use the previously trained model for the next training. So the pre-trained ImageNet weight is only used at the initial iteration. We apply strong data augmentations for TinyImageNet, including random crop, random flip, random color jitter, and random grayscale. Under this scenario, the feature space from the backbone is continuously evolving and keeps confused as the training and active learning process proceeds. Because of the strong data augmentation and batch normalization in ResNet-18 or ResNet-50, the decision boundary keeps confused, implying that the Bayesian experimental design assumption might not hold. However, we still want to observe the general behavior of each measure and how to improve the accuracy under a more dynamic feature space. Discussion. BalEntAcq consistently outperforms other linearly scalable baselines in all datasets, as shown in Table 1 . BADGE performs similarly to Entropy under a single acquisition scenario in MNIST because BADGE focuses on maximizing the loss gradient similar to Entropy, as we explained in Section 3.3. BADGE shows better performances at first when we fix the feature space, but our BalEntAcq eventually catches up with the performance of BADGE. We also note that BADGE is not a linearly scalable method. Under dynamic feature scenarios in SVHN or TinyImageNet, we observe that our BalEntAcq performs better. Considering the acquisition calculation time (see Appendix A.18), our BalEntAcq should be a better choice. Figure 3 shows the full active learning curves. For CIFAR-100 and 3×CIFAR-100 cases, by fixing features, we control/remove all other effects possibly affecting the model's performance, such as data augmentation or the role of backbone in the classification. As demonstrated in Figure 2 , BalEntAcq is very efficient in selecting diversified points along the decision boundary. Instead, PowerBALD suffers from improving accuracy because it focuses more on diversification/randomization by missing the information near the decision boundary. For SVHN or TinyImageNet, BalEntAcq shows better performance again. We suppose that diversification near the decision boundary in BalEntAcq also plays the data exploration because the representation space keeps evolving with the backbone training.

6. CONCLUSION

In this paper, we designed and proposed a new uncertainty measure, Balanced Entropy Acquisition (BalEntAcq), which captures the information balance between the underlying probability and the label variable through Beta approximation with a Bayesian neural network. BalEntAcq offers a diversified selection and is unique compared to other uncertainty measures. We expect that our proposed balanced entropy measure does not have to be confined to active learning problems in general. BalEntAcq would improve the diversified selection process in many other Bayesian frameworks. Therefore, we look forward to having further follow-up studies with broad applications beyond the active learning problems. (Gal et al., 2017; Kingma & Welling, 2014; Tzikas et al., 2008) . We can easily prove that the mutual information between ω and Y is the same as the mutual information between the encoded Φ (x, ω) and the predictive output Y since Y depends only on Φ (x, ω): BALD[x] :=I (ω, Y (x, ω)) = H(Y (x, ω)) -E ω [H (Y (x, ω) |ω)] (5) =H(Y (x, ω)) -E Φ [H (Y (x, ω) |Φ (x, ω))] = I (Φ (x, ω) , Y (x, ω)) , where H(Y (x, ω)) represents the Shannon entropy by marginalizing out the randomness of ω in Y (x, ω) and I(•, •) represents a mutual information between two quantities. The formulations of the mutual information equation 5 -equation 6 look natural, but we need to note that ω or Φ (x, ω) is on a continuous domain, and Y (x, ω) is on a discrete domain. This combined domain implies that we cannot directly apply Shannon entropy and differential entropy notions (Cover, 1999) . One immediate question is what the joint entropy between Φ (x, ω) and Y (x, ω) is. For this, we can leverage point process entropy (McFadden, 1965; Fritz, 1973; Papangelou, 1978; Daley & Vere-Jones, 2007; Baccelli & Woo, 2016) by generalizing the notion of the entropy in this combined domain. We consider the joint entropy of Φ (x, ω) and Y (x, ω), denoting by H (Φ (x, ω) , Y (x, ω)) through the point process entropy. We write a Janossy density function (Daley & Vere-Jones, 2007) j (p, y = i) of (Φ (x, ω) , Y (x, ω)) on ∆ C × [C] as follows: j (p, y = i) = p i f (p) , where p := (p 1 , • • • , p C ) and f (•) is a density function of Φ (x, ω). Then the joint entropy of Φ (x, ω) and Y (x, ω) can be defined as H (Φ (x, ω) , Y (x, ω)) = - C i=1 ∆ c j (p, y = i) log j (p, y = i) dp. ( ) By plugging equation 7 into equation 8, we have the following identity. H (Φ (x, ω) , Y (x, ω)) =H(Y (x, ω)) + E Y [h (Φ (x, ω) |Y (x, ω))] , where H(•) represents the usual Shannon entropy, and h(•) represents the usual differential entropy. By applying Jensen's inequality, we may derive a marginalized joint entropy as an upper bound of the joint entropy (See Appendix A.2): H (Φ (x, ω) , Y (x, ω)) ≤ - i E Pi [P i log (P i f (P i ))] , where we ambiguously write f (•) to be a density function for each P i .

A.2 DERIVATION OF MARGINALIZED JOINT ENTROPY WITH THE POINT PROCESS ENTROPY

The Janossy density function resides in a combination of continuous and discrete domains (Daley & Vere-Jones, 2007). For the Janossy density of (Φ (x, ω) , Y (x, ω)) on ∆ C × [C], we may follow the classical approach: P (P 1 ∈ [p 1 + dp 1 ], • • • , P C ∈ [p c + dp c ], Y = i) ≈P Y = i P 1 = p 1 , • • • , P C = p C P (P 1 ∈ [p 1 + dp 1 ], • • • , P C ∈ [p c + dp c ]) ≈p i f (p 1 , • • • , p C ) dp 1 • • • dp C , where f (•) is a density function of Φ (x, ω). So we may write the Janossy density of (Φ (x, ω) , Y (x, ω)) as follows: j (p 1 , • • • , p C , y = i) = p i f (p 1 , • • • , p C ) . Following the point process entropy (McFadden, 1965; Fritz, 1973; Papangelou, 1978; Daley & Vere-Jones, 2007) , the joint entropy of Φ (x, ω) and Y (x, ω) can be defined as H (Φ (x, ω) , Y (x, ω)) = - C i=1 ∆ c j (p 1 , • • • , p C , y = i) log j (p 1 , • • • , p C , y = i) dp 1 • • • dp C . We note that ∆ c p i f (p 1 , • • • , p C ) dp 1 • • • dp C = [0,1] p i f (p i ) dp i = EP i . ( ) We may split the Jannosy density into two pieces: j (p 1 , • • • , p C , y = i) = (EP i ) p i EP i f (p 1 , • • • , p C ) . Plugging equation 15 into equation 13, we have H (Φ (x, ω) , Y (x, ω)) = H (Y (x, ω)) + E Y [h (Φ (x, ω) |Y )] . On the other hand, (6) = - C i=1 ∆ c j (p 1 , • • • , p C , y = i) log j (p 1 , • • • , p C , y = i) dp 1 • • • dp C = - C i=1 ∆ c (EP i ) p i EP i p (p 1 , • • • , p C ) log (EP i ) p i EP i p (p 1 , • • • , p C ) dp 1 • • • dp C = - C i=1 (EP i ) log (EP i ) - C i=1 ∆ c (EP i ) p i EP i p (p 1 , • • • , p C ) log p i EP i p (p 1 , • • • , p C ) dp 1 • • • dp C . We apply Jensen's inequality on the second term (by focusing on each summand). For each i ∈ {1, • • • C}, -(EP i ) ∆ c p i EP i p (p 1 , • • • , p C ) log p i EP i p (p 1 , • • • , p C ) dp 1 • • • dp C = -(EP i ) pi ∆ c \{pi} p i EP i p (p 1 , • • • , p C ) log p i EP i p (p 1 , • • • , p C ) dp -i 1•••C dp i ≤ -(EP i ) pi ∆ c \{pi} p i EP i p (p 1 , • • • , p C ) dp -i 1•••C log ∆ c \{pi} p i EP i p (p 1 , • • • , p C ) dp -i 1•••C dp i = - pi p i f (p i ) log p i EP i f (p i ) dp i = -E Pi P i log P i EP i f (P i ) , where dp -i 1•••C indicates dp 1 • • • dp C except dp i . By combining all terms together, we have (6) ≤ - C i=1 (EP i ) log (EP i ) - C i=1 E Pi P i log P i EP i f (P i ) = - i E Pi [P i log (P i f (P i ))] .

A.3 EQUIVALENT FORMULATION OF MARGINALIZED JOINT ENTROPY

Let us assume that P i ∼ Beta(α i , β i ) and P + i ∼ Beta(α i + 1, β i ). MJEnt[x] = - i E Pi [P i log (P i f (P i ))] = - i 1 0 p i f (p i ) log (p i f (p i )) dp i = - i (EP i ) 1 0 p i f (p i ) EP i log p i f (p i ) EP i dp i - i (EP i ) log (EP i ) (20) = i (EP i ) h(P + i ) -log (EP i ) , where h(P + i ) is the differential entropy of P + i . A.4 PROOF OF THEOREM 3.1 Let η = (η 1 , • • • , η C ) and η(i, ++) = (η 1 , • • • , η i-1 , η i + 1, η i+1 , • • • , η C ). Let B (η) = Γ(η1)•••Γ(η C ) Γ( C k=1 η k ) , and .1. Woo (2022, Theorem III.1) The analytical formula of the mutual information BALD[x] is the following. Γ(•) is a Gamma function. Assume that Φ (x, ω) := (P 1 , • • • , P C ) ∼ Dirichlet(η 1 , • • • , η C ). Theorem A DirichletBALD[x] := C k=1 η k -C Ψ C k=1 η k - C i=1 (η i -1) Ψ (η i ) - C i=1 η i C k=1 η k log η i C k=1 η k + C i=1 j̸ =i (η j -1) B (η(i, ++)) B (η) Ψ (η j ) -Ψ C k=1 η k + 1 + C i=1 η i B (η(i, ++)) B (η) Ψ (η i + 1) -Ψ C k=1 η k + 1 , where Ψ(•) is a Digamma function. Given the above theorem, we can simplify the formula further: DirichletBALD[x] = C k=1 η k -C Ψ C k=1 η k - C i=1 (η i -1) Ψ (η i ) - C i=1 η i C k=1 η k log η i C k=1 η k + C i=1 j̸ =i η i (η j -1) C k=1 η k Ψ (η j ) -Ψ C k=1 η k + 1 + C i=1 η 2 i C k=1 η k Ψ (η i + 1) -Ψ C k=1 η k + 1 = C k=1 η k -C Ψ C k=1 η k - C i=1 η i C k=1 η k log η i C k=1 η k - C i=1 η i (η i -1) C k=1 η k Ψ (η i ) - C i=1 (η i -1) 1 - η i C k=1 η k Ψ C k=1 η k + 1 + C i=1 η 2 i C k=1 η k Ψ (η i + 1) -Ψ C k=1 η k + 1 . Therefore DirichletBALD is a function of marginals of Φ (x, ω) with Dirichlet distribution parameters η i and C i=1 η i . Under Beta marginal distribution assumption, by letting η i = α i and C k=1 η k = α i + β i for any i since each marginal distribution of Dirichlet distribution follows Beta distribution, we have BetaMarginalBALD[x] := C i=1 (α i -1) Ψ (α i + β i ) - C i=1 α i α i + β i log α i α i + β i - C i=1 α i (α i -1) α i + β i Ψ (α i ) - C i=1 β i (α i -1) α i + β i Ψ (α i + β i + 1) + C i=1 α 2 i α i + β i [Ψ (α i + 1) -Ψ (α i + β i + 1)] . Therefore Theorem 3.1 follows. On the other hand, since Beta marginal distributions are sufficient to calculate the mutual information, the same idea can be applied to the aleatoric uncertainty. Corollary 1. Under Beta marginal distribution approximation, let P i ∼ Beta(α i , β i ) in Φ (x, ω). Then the aleatoric uncertainty can be estimated as follows: BetaMarginalAleatoricUncertainty[x] := - C i=1 (α i -1) Ψ (α i + β i ) + C i=1 α i (α i -1) α i + β i Ψ (α i ) + C i=1 β i (α i -1) α i + β i Ψ (α i + β i + 1) - C i=1 α 2 i α i + β i [Ψ (α i + 1) -Ψ (α i + β i + 1)] . A.5 PROOF OF THEOREM 4.1 First let a positive integer ∆ -1 > 0 be given and let Υ := {I n }, a collection of evenly divided intervals in [0, 1] where I n := (n -1)∆, n∆ for n = 1, • • • , (∆ -1 -1) and I ∆ -1 := [1 -∆, 1]. Let Pi be a discretized random variable over Υ of P i from Φ (x, ω). i.e., Pi = n -1 2 ∆ if P i ∈ I n such that P Pi = n -1 2 ∆ = P [P i ∈ I n ]. For any estimator Pi of Pi given the label {Y = i}, by applying Fano's inequality (Fano, 1961; Anantharam & Verdu, 1996) (Cover, 1999, Theorem 2.10 .1), we have (note that our log has a base e) P Pi ̸ = Pi Y = i ≥ H Pi Y = i -log 2 log ∆ -1 = H Pi Y = i -log 2 -log ∆ . ( ) We note that Shannon entropy and the differential entropy have the following connection (Cover, 1999, Theorem 8.3.1): H Pi Y = i + log ∆ = h P i Y = i + ϵ i = h P + i + ϵ i , where ϵ i is an adjustment constant depending on ∆ such that ϵ i → 0 as ∆ → 0. Note that ϵ i does not have to be non-negative. Then we can rewrite the inequality as follows: P Pi ̸ = Pi Y = i ≥ h P + i -log ∆ -log 2 -log ∆ + ϵ i -log ∆ . ( ) Taking the expectation with respect to Y , we have E P Pi ̸ = Pi Y = i ≥ i (EP i ) h(P + i ) -log ∆ -log 2 -log ∆ + i (EP i ) ϵ i -log ∆ =: ( * * ). If we let ∆ -1 = ⌊2e H(Y ) ⌋, there exists a δ ≥ 0 such that H(Y ) + log 2 -δ = -log ∆ = log⌊2e H(Y ) ⌋ ≤ H(Y ) + log 2. ( ) We also note that δ → 0 as H(Y ) → ∞ (or equivalently ∆ → 0). Therefore, when ∆ -1 = ⌊2e H(Y ) ⌋, we have ( * * ) = i (EP i ) h(P + i ) + H(Y ) -δ H(Y ) + log 2 -δ + i (EP i ) ϵ i H(Y ) + log 2 -δ ≥ i (EP i ) h(P + i ) + H(Y ) H(Y ) + log 2 1 + δ H(Y ) + log 2 -δ -i (EP i ) |ϵ i | + δ H(Y ) + log 2 . ( ) Let ε 1 = δ H(Y )+log 2-δ and ε 2 = i (EPi)|ϵi|+δ H(Y )+log 2 ≥ 0. Since ϵ i → 0 and δ → 0 as ∆ → 0, ε 1 and ε 2 → 0 as ∆ → 0. Therefore Theorem 4.1 follows. As a final remark, we note that ∆ -1 can be regarded as a variant of the discrete entropy power (Wang et al., 2014; Woo & Madiman, 2015; Madiman et al., 2019; 2021; Haghighatshoar et al., 2014; Jog & Anantharam, 2014) .

A.6 MORE BETA MARGINAL FORMULATIONS

With Beta approximation, we are able to describe Beta marginal formulation of MeanSD. Since we are matching the variance of each marginal distribution, the empirical value of MeanSD should be the same as BetaMarginalMeanSD. BetaMarginalMeanSD[x] := 1 C C i=1 α i β i (α i + β i ) 2 (α i + β i + 1) = MeanSD[x]. ( ) In Foster et al. (2019) , the expected information gain has been proposed and studied. We may also formulate the expected information gain with Beta marginal distributions. BetaMarginalEIG[x] :=H(Y ) -EH Y + Y = i = C i=1 α i α i + β i   C j=1 α j + δ i (j) α j + β j + 1 log α j + δ i (j) α j + β j + 1 -log α i α i + β i   , ( ) where Y + is a categorical random variable over the posterior probability given Y = i, δ i (j) = 1 if i = j, and δ i (j) = 0 otherwise. Figure 5 and Figure 6 shows an example of Beta approximations obtained from the MNIST and CIFAR-10 datasets. P 1 , • • • , P 10 show each marginal distribution of the predictive probability of each digit. We observe that the Beta approximation is a reasonable approximation. Figure 6 : An example of Beta approximations (red lines) for each marginal distribution after applying softmax layer in CIFAR-10 dataset.

A.8 RANK CORRELATION STUDY WITH BETAMARGINALEIG

A good advantage of explicit formula is that we can study the behavior of each measure directly. For example, if C = 2 and Φ (x, ω) ∼ Dirichlet(α, β) such that P 1 ∼ Beta(α, β) and P 2 ∼ Beta(β, α), we are able to plot the behavior of each Beta marginal measure. With BetaMarginalEIG, we are able to generate the same type of plot shown in Figure 1 . EIG shows positive correlations with BALD and MeanSD, but the correlation is around 70% implying that EIG might show more variations. Figure 8 : Same experiments with BetaMarginalEIG described in Figure 1 from the main article. This is another independent experiment (as a validation), so the captured correlation values are slightly different. Figure 9 : Pairwise scatter plot at C = 10, 000 companion with Figure 8 . A.9 BALD AND BETAMARGINALBALD Although BALD and BetaMarignalBALD shows high rank-correlation (under softmax applied Gaussian distribution assumption), we might wonder how much different they are in the value. We plot the RMSE (rooted mean square error) between two measures. Under Dirichlet distribution assumption, RMSE between BALD and BetaMarginalBALD is < 0.002 upto C ≤ 1, 000. However, under softmax-applied Gaussian distribution assumption, RMSE between BALD and BetaMarginalBALD shows < 0.07 upto C ≤ 1, 000. This implies that Beta marginal approximation still preserves a high rank-correlation, but the absolute values are slightly shifted. e.g., BALD [x] ≈ BetaMarginalBALD [x] + err for some constant err ∈ R. This study also implies that Beta marginal approximation is a reasonable assumption. Figure 12 shows the active learning curves for ExpectedEffectiveLoss and BetaMarginalEIG with MNIST and 3×CIFAR-100. This experiment also confirms that BALD and ExpectedEffectiveLoss are tightly aligned as we show that both are highly correlated. In MNIST, BetaMarginalEIG performs similar to BALD and ExpectedEffectiveLoss. However, in 3×CIFAR-100, BetaMarginalEIG performs similar to BALD at first, but it essentially performs better than BALD and similar to the random case. Recall that the rank correlation between BALD and BetaMarginalEIG is around 70%. In this section, we study the non-negative region of BalEntAcq [x]. BalEntAcq[x] is non-negative when MJEnt[x] ≥ 0. Under Beta marginal distribution approximation, let P i ∼ Beta(α i , β i ) in (x, ω). Then we can fully write MJEnt[x] as follows: MJEnt[x] = i (EP i ) h(P + i ) + H(Y ) = C i=1 α i α i + β i log B(α i + 1, β i ) -α i Ψ(α i + 1) -(β i -1)Ψ(β i ) -(α i + β i -1)Ψ(α i + β i + 1) -log α i α i + β i . Then, we are able to generate a 3D plot and a contour plot of BalEnt [x] when C = 2. i.e., Φ (x, ω) ∼ Dirichlet(α, β) such that P 1 ∼ Beta(α, β) and P 2 ∼ Beta(β, α). (Gal & Ghahramani, 2016) . But this requires a high computational cost. Therefore we adopt several additional last layer dropout architecture to build a Bayesian neural network equipped with Beta approximation. There exist several different lines of works to justify the effectiveness of this simple last layer modification (Snoek et al., 2015; Wilson et al., 2016; Brosse et al., 2020; Kristiadi et al., 2020; Hobbhahn et al., 2020) . More precisely, similar to Laplace approximation applied at the last layer (Kristiadi et al., 2020, See Theorem 2.4 ) and (Hobbhahn et al., 2020) , we may replace several last linear layers with a dropout applied and ReLU activated linear layers. For example, we may add two or more dropout layers after ResNet-50 fixed backbone in our CIFAR-100 experiments to avoid any pathological cases (Foong et al., 2020) . In practice, we observe a single dropout layer application is sufficient to achieve our Beta approximated marginals as shown below. We note that in MNIST experiment, we use 50% dropout rate and for all other our experiments, we use 20% dropout rate.

A.13.2 CHOICES OF PRIORITIZATION IN BALENTACQ

In this section, we study the impact of the prioritization in BalEnt [x]. P1. P1 [x] = -BalEnt [x] . This is the case where we put higher priority when the posterior uncertainty captures very small values. Note that this also includes high epistemic uncertainty (BALD) valued case. For example, when C = 2 with Φ (x, ω) ∼ Dirichlet(α, β), the posterior uncertainty goes to -∞ as α → 0 and β → 0. Therefore BalEnt [x] → -∞. But this case also achieves the highest epistemic uncertainty. P2. P2[x] = BalEnt[x] -1 if BalEnt[x] ≥ 0, BalEnt[x] if BalEnt[x] < 0 . This is the same case as our proposed acquisition measure.

P3. P3

[x] = BalEnt [x] . This is the case where we put higher priority when the posterior uncertainty captures very high values (close to zero). As discussed in Section 4.1, we want to prioritize more when the information imbalance gap is higher. Figure 15 and Table 3 show that selecting the points near BalEnt[x] ≈ 0 is a better way to improve the accuracy as we discussed in Section 4.1. When we prioritize the small posterior uncertainty case, P1 shows a very poor performance in a fixed feature scenario. However, under the backbone and augmentation scenario, the performance of P1 is similar to the high posterior uncertainty case of P3. This could be because of the evolution of the feature space and the batch normalization during the active learning process. i.e. previously captured uncertainty values will not be preserved under the backbone with augmentation scenario. As shown in the proof of Theorem 4.1, we may have some freedom to choose the level of the precision in the P i estimation. Therefore we report the active learning behavior for other precision choices. It is not clear which precision level achieves the optimal performance, but our preference Figure 17 shows the full active learning curves, negative log-likelihood, average epistemic uncertainty for selected samples, and average aleatoric uncertainty for selected samples. We note that our proposed method selects neither high epistemic uncertainty (=model uncertainty) nor aleatoric uncertainty (=data uncertainty) samples. Nevertheless, BelEntAcq shows a good performance improvement during the active learning iterations. Furthermore, we observe that BelEntAcq keeps choosing low aleatoric uncertainty points but increasing epistemic uncertainty points. In this section, we report the active learning result of TinyImageNet without a pretrained model. By doing so, we can also observe the effect of the pre-knowledge of the model. We use the same setting as we used in the main experiment. We observe that the accuracy progression is slower, so it requires more samples, but the observed behavior of each method is the same as before. Figure 18 : Active learning curves of TinyImageNet without pretrained model The key insight of this Algorithm 1 to achieve exponential label savings is to abstain from the point very close to the decision boundary. Similarly, as we demonstrated in A.11, our BalEntAcq[x] finds original CoreMSE requires calculating the entire correlational effect on the loss from all data points, we cannot linearly scale up this computation. Even if we try to relax the memory issue by fixing the number of points of interest for correlation instead of the entire unlabelled dataset, we still experience a memory deficiency. The following Figure 21 confirms the dependency of the acquisition time of our BalEntAcq and BADGE concerning the unlabelled dataset size and the acquisition size by augmenting the number of images from CIFAR-10 and CIFAR-100. Our BalEntAcq does not rely on the acquisition size, but BADGE is still linearly proportional to the acquisition size. For the unlabelled dataset size, both methods linearly increase the acquisition time, but the slope of BADGE is much steep. In this section, we observe the active learning performance behavior to various redundancy levels with τ ×CIFAR-100 dataset where τ ∈ {1, 3, 5, 7, 10, 20, 50}. Active learning performance with BalEntAcq shows the best accuracy when no redundant images exist. With 50×CIFAR-100, our BalEntAcq requires only up to 1.25% data points among 2.5 million images to achieve the fully supervised accuracy (= 66%). We do not observe any significant performance deterioration compared to the increment of the pool size, even in highly redundant image scenarios. It demonstrates the possibility of exponential label savings as discussed in Section A.15. Refer to the following Table 9 for the specific accuracy values. .13.3, MJEnt[x] itself is a result of the derivation by applying Jensen's inequality shown in Appendix A.2. In other words, MJEnt[x] is the maximally achievable joint entropy between Φ (x, ω) and Y (x, ω). So we may expect that there should exist an entropyincreasing physical dynamical process, something similar to a diffusion process (Oksendal, 2013) , from the state (Φ (x, ω) , Y (x, ω)) to another maximally marginalized state S ′ which has a joint entropy MJEnt[x] . Because of our limited understanding of the point process entropy, we have yet to know the full specification of the existence of these physical dynamics. Lastly, because our generalized probability interpretation of BalEnt In this section, we report our active learning experiment when we train a Bayesian neural network with variational dropouts (Kingma et al., 2015) with 3×CIFAR-10 with acquisition size 50 and 3×CIFAR-100 with acquisition size 500 under a fixed feature scenario.



Code is available. https://github.com/jaeohwoo/BalancedEntropy



PROBLEM FORMULATION We write an unlabeled dataset D pool and the labeled training set D training ⊆ D pool in each active learning iteration. We denote by D (n) training if it's necessary to indicate the specific n-th iteration step. Given D training , we train a Bayesian deep neural network model Φ with model parameters ω ∼ p (ω).Then for a data point x given D training , the Bayesian deep neural network Φ produces the prediction probability: Φ (x, ω) := (P 1 (x, ω),• • • , P C (x, ω)) ∈ ∆ C where ∆ C = {(p 1 , • • • , p C ) : p 1 + • • • + p C = 1, p i ≥0 for each i} and C is the number of classes. For the final class output Y , it is assumed to be a multinoulli distribution (or categorical distribution): Y (x, ω) :=      1 with probability P 1 (x, ω) . . . . . . C with probability P C (x, ω).

Figure 1: Scatter plot at C = 10, 000 between BALD and ExpectedEffectiveLoss (left), Spearman's rank correlations over various class dimensions (middle), and Spearman's rank correlation matrix at C = 10, 000 (right). The relationship between BetaMarginalBALD and ExpectedEffectiveLoss consistently captures a high rank-correlation with > 99.6% regardless of the class dimensions. BALD and ExpectedEffectiveLoss show > 97.5% rank-correlation. We randomly generate 100 softmax applied C-dimensional Gaussian samples and repeated the process 10 times. Shaded band shows the standard deviation.

Figure 2: Top-K selected points are marked by red color. The first row shows the top K = 25 points. The second row shows the top K = 500 point selections among around 0.6 million grid points.

Figure 3: Active learning accuracy curves obtained from various scenarios. Our proposed BalEntAcq outperforms well-known acquisition measures, and we repeated the experiment 3 times.

Figure 5: An example of Beta approximations (red lines) for each marginal distribution after applying softmax layer in MNIST dataset. Each Beta distribution is estimated by calculating the sample mean and sample variance of the histogram generated by the Bayesian deep learning model.

Figure 7: 3D plot of each uncertainty measure when Beta marginal assumption holds.

Figure10: Scatter plot at C = 1, 000 between BALD and BetaMarginalBALD (left), RMSE between BALD and BetaMarginalBALD over various class dimensions (middle), and Spearman's rank correlations over various class dimensions (right). The first row is the result from C-dimensional 100 random Dirichlet samples. The second row is obtained from softmax-applied 100 random Gaussian samples. Then we repeat the process 10 times. In both cases, we observe > 96% rank correlations as well.

Figure 11: Top-K selected points are marked by red color. The first row shows the top K = 25 point selections. The second row shows the top K = 500 point selections among around 0.6 million uniform grid points. The same experiments shown in Figure 2 in the main article.

Figure 12: Active learning curves for ExpectedEffectiveLoss and BetaMarginalEIG with MNIST and 3×CIFAR-100.

Figure 13: BalEnt 3D plot (left) and Positive BalEnt contour plot (right) over parameters (α, β).For the contour plot, starting from the outside, contours are generated when BalEnt [x] ∈ {-3, -2, -1, 0, 0.1, 0.2, 0.3}.

Figure13suggests that in Dirichlet distribution's parameter space, there exist (uncountably and) infinitely many parameters which produce non-negativeBalEnt [x]  values. Then we also plot the non-negative region (red shaded) of BalEntAcq in our toy example.

illustrates that there exist infinitely many points which produce the same BalEntAcq [x] values. Therefore we may imagine that we are conducting a uniform sampling on each contour surface {BalEnt[x] = λ} for each λ ≥ 0, then moving to the surface for each λ ≥ 0. This observation also explains how BalEnt[x] diversifies the selection near the decision boundary.

Figure 14: Non-negative BalEntAcq region illustration from the toy example

Figure 15: Active learning curves depending on different prioritization.

Figure 16: Active learning curves depending on different precision levels.

Figure 17: Full active learning curves obtained from different scenarios. From the top row, each row represents a result of MNIST, CIFAR-100, 3×CIFAR-100, and TinyImageNet.

Figure 21: Acquisition time under different unlabelled dataset sizes (first row) and under different acquisition sizes (second row).

Figure 22: Active learning curves under various redundancy level scenarios

Figure23and Table10show the active learning performances on CIFAR-100 and 3×CIFAR-100. We observe that MJEntAcq[x] performs well to reach the fully supervised accuracy. However, BalEntAcq[x] performs better thanMJEntAcq[x]. It suggests that the slight distortion ofMJEnt[x]   to BalEnt[x] could help improve active learning performance.



Yinglun Zhu and Robert Nowak. Active learning with neural networks: Insights from nonparametric statistics. Advances in Neural Information Processing Systems, 2022a. Yinglun Zhu and Robert Nowak. Efficient active learning with abstention. Advances in Neural Information Processing Systems, 2022b.

shows a summary of dataset, configurations, and hyperparmeters used in our experiments. For each experiment, we repeat 3 times to generate the full active learning accuracy curve.

Detailed configurations used in our experiments.In SimCLR(Chen et al., 2020a)  feature training, we trained ResNet-50 with 224 × 224 image size, 192 batch size, 500 epochs, and 0.0003 learning rate with Adam optimizer for CIFAR-10/CIFAR-100.

Selected accuracy table depending on different prioritization. Mean and standard deviation are from 3 repeated experiments. The best performance in each column is shown in bold.

Selected accuracy table depending on different precision levels. Mean and standard deviation are from 3 repeated experiments. The best performance in each column is shown in bold.

Selected accuracy table. Mean and standard deviation are from 3 repeated experiments. At each iteration m, we add geometrically increasing 2 m queried points.

First two rows show the theoretical time and space complexity. Remaining rows present the average calculation time what we observed in our experiments.

Selected accuracy table for various redundancy levels. Mean and standard deviation are from 3 repeated experiments. Moreover, although we empirically demonstrate the choice of the precision level H(Y ) + log 2 to match MJEnt[x] in Appendix A

Selected accuracy table for various redundancy levels. Mean and standard deviation are from 3 repeated experiments.

annex

a margin by focusing on the positive sign of BalEntAcq[x] which corresponds to finding x outside the abstention region such that x -1 2 > γ near the decision boundary. Then following the positive BalEntAcq[x] values, we acquire points toward the decision boundary direction, which corresponds to the condition 1 2 ∈ UB [x] . We know that the point near the decision boundary should have high aleatoric uncertainty (so possibly noise-seeking). On the other hand, Corollary 1 implies that aleatoric uncertainty is increasing as α, β → +∞. So MJEnt[x] → -∞. Then BalEntAcq[x] → -∞. Therefore, our BalEntAcq[x] will acquire points near the decision boundary but will not acquire the point if it's too close to the decision boundary. This strategy in our BalEntAcq[x] exactly matches the key insight of Algorithm 1. So we may be able to theoretically guarantee that our proposed acquisition function BalEntAcq[x] could be a universally working active learning algorithm by achieving exponential label savings.

A.16 MORE EXPERIMENTS WITH SMALLER ACQUISITION SIZE

In this section, we conduct more experiments with 3×MNIST and 3×CIFAR-10 by adding more baselines such as VarRatio [x] := 1 -max i EP i (Freeman, 1965) , BatchBALD, and CoreSet. The main purpose of these experiments is to test the relatively smaller acquisition size. We acquire 25 points for each active learning iteration. For 3×MNIST, we use CNN architectures. For 3×CIFAR-10, we fix the feature space obtained from SimCLR (Chen et al., 2020a) , the same setting we used in our main experiments. Overall, the additional experimental results are well-aligned with our main results. We observe that BADGE is the best performing baseline. However, we note that BADGE is not linearly scalable, and it requires more computational costs. Figure 19 and Table 6 show full results. for the optimal Bayes classifier has been proposed (Zhao et al., 2021; Tan et al., 2021) . Under this framework, they attempt to optimize the loss reduction in a holistical way, accounting for average loss reduction from all points. However, this non-parametric approach requires a very expensive computational cost. With a large dataset size, ELR (Zhao et al., 2021) , wMOCU (Zhao et al., 2021) , CoreLog (Tan et al., 2021) , and CoreMSE (Tan et al., 2021) require a vast memory size unless we apply size reductions on the data space and MC samples (Tan et al., 2021) . If the number of classes is large, running the algorithm in practice is impossible. Therefore the naive application of the ELR-based algorithm is not scalable. Moreover, both works have pitfalls in the convergence proof by assuming the finite data and parameter space. Both pieces of the works end up with null proof. Nevertheless, we tested the performance of CoreMSE (Tan et al., 2021) with MNIST, seemingly the best method under this framework. Figure 20 and Table 7 show the full active learning results. In this section, we discuss the time and space complexity of the acquisition calculation for each active learning iteration. We denote by N number of unlabelled points, C number of classes, K the acquisition size. For BADGE, we use the last layer feature vector and then apply k-means++ initialization (Vassilvitskii & Arthur, 2007) .The main computational bottleneck of BADGE is the dependency of K in the calculation of multiacquisition. Here, the K-iteration cannot be parallelized since the pairwise distance calculation depends on the previous selection of centers in k-means++. On the other hand, the main computational bottleneck of CoreMSE is the dependency of N 2 in both time and space complexity. Since the A.20 DISCUSSION ABOUT BALENTACQ FORMIn this section, we discuss further the underlying motivation of the ratio form of our BalEnt[x]:We design BalEnt[x] to be a generalized probability as a ratio between the marginalized joint entropy MJEnt[x] and the augmented Shannon entropy H(Y )+log 2. As shown in Appendix A.5 for the proof of Theorem 4.1, BalEnt[x] has a natural interpretation as a generalized estimation error probability leveraging Fano's inequality. i.e., -∞ < BalEnt[x] ≤ 1. The rigorous interpretation of the negative probability is an open problem. Still, we can intuitively understand that the negative probability implies that the model has sufficient knowledge to estimate the value of P + i given the information of Y up to the specified precision level. It suggests that a phase transition in the model's current knowledge occurs near the zero probability point of BalEnt[x] . It is because H(Y ) is the minimum amount of nats (or bits) to encode the information of Y , which is the fundamental information limit when observing labels, and we require an additional log 2 nats given our choice of the precision level. That's why we call the quantity BalEnt[x] to be a balanced entropy that captures the information balance between the model and the label.We use an Adam optimizer with a learning rate of 0.0003 and 500 epochs in each experiment. Compared to MC-dropout Bayesian neural network models, we observe that the convergence with variational dropouts is not stable, so it requires much longer epochs if we newly train the model at each active learning iteration for both cases. Therefore, we continue to train the model from the previously trained model at each iteration except the initial iteration so that the convergence can be more stable.Here is the architecture we used for our 3×CIFAR-10 experiment, and for the 3×CIFAR-10 case, we can simply modify the out feature size to be 100.

VARIATIONAL_DROPOUT_CLASSIFIER(

(classifier): Sequential( (0): VariationalDropout(in_features=2048, out_features=1024) (1): VariationalDropout(in_features=1024, out_features=1024)(2): Linear(in_features=1024, out_features=10, bias=False) ) )We observe a similar result from the MC-dropout Bayesian neural networks. Our BalEntAcq consistently outperforms other linearly scalable baselines and is eventually on par with BADGE in 3×CIFAR-10 and approaching close to BADGE in 3×CIFAR-100. In this section, we report the active learning result with 3×CIFAR-10 when we train the model with the exact Gaussian process which could be of another independent interest.We use an Adam optimizer with a learning rate of 0.1 and 5, 000 iterations in each experiment. We use the exact Gaussian process. We observe that the exact Gaussian process requires much more samples compared to the Bayesian neural network, but the observed behavior of each method is the same as before.Figure 25 : Active learning curves with exact Gaussian process This section is moderately informal and a little digression, but we think it is worth discussing. An active learning problem shares a weak similarity with a portfolio optimization problem (Markowitz, 1952) . In portfolio optimization, we aim to minimize uncertainty but maximize profits under correlated environments. Traditional mean-variance portfolio optimization is well-known, and the primary objective function is minimizing the uncertainty (=risk), but the selection needs to be well-diversified (Markowitz, 1952) . In active learning, we typically aim to maximize the uncertainty, but diversify the selection under correlated data points, e.g., BADGE (Ash et al., 2020) .In the modern dynamic financial system, the traditional way of optimization could still be exposed to a concentration risk (Levy & Zhang, 2019) . To mitigate the concentration risk, equalizing the risk contribution, a.k.a. risk parity strategy, for each factor has been popularized after the financial crisis (Prince, 2011; Hurst et al., 2010; Chaves et al., 2011; Qian, 2011; Asness et al., 2012; Costa & Kwon, 2020) . The critical idea of risk parity portfolio optimization is to find a good balance between different factors. If we closely look at the risk parity optimization equation (Costa & Kwon, 2020, e. g., See the equation ( 3)), the entropy-like constraint plays a critical role. Then we can balance the risk of each factor. Although we cannot clearly state a tight relationship between our balanced entropy acquisition and the risk parity strategy, both goals are similar. So it would be interesting to see the close relationship between our balanced entropy and risk parity strategy. 

