SCALABLE BATCH-MODE DEEP BAYESIAN ACTIVE LEARNING VIA EQUIVALENCE CLASS ANNEALING

Abstract

Active learning has demonstrated data efficiency in many fields. Existing active learning algorithms, especially in the context of batch-mode deep Bayesian active models, rely heavily on the quality of uncertainty estimations of the model, and are often challenging to scale to large batches. In this paper, we propose Batch-BALANCE, a scalable batch-mode active learning algorithm, which combines insights from decision-theoretic active learning, combinatorial information measure, and diversity sampling. At its core, Batch-BALANCE relies on a novel decision-theoretic acquisition function that facilitates differentiation among different equivalence classes. Intuitively, each equivalence class consists of hypotheses (e.g., posterior samples of deep neural networks) with similar predictions, and Batch-BALANCE adaptively adjusts the size of the equivalence classes as learning progresses. To scale up the computation of queries to large batches, we further propose an efficient batch-mode acquisition procedure, which aims to maximize a novel information measure defined through the acquisition function. We show that our algorithm can effectively handle realistic multi-class classification tasks, and achieves compelling performance on several benchmark datasets for active learning under both low-and large-batch regimes. Reference code is released at https://github.com/zhangrenyuuchicago/BALanCe.

1. INTRODUCTION

Active learning (AL) (Settles, 2012) characterizes a collection of techniques that efficiently select data for training machine learning models. In the pool-based setting, an active learner selectively queries the labels of data points from a pool of unlabeled examples and incurs a certain cost for each label obtained. The goal is to minimize the total cost while achieving a target level of performance. A common practice for AL is to devise efficient surrogates, aka acquisition functions, to assess the effectiveness of unlabeled data points in the pool. There has been a vast body of literature and empirical studies (Huang et al., 2010; Houlsby et al., 2011; Wang & Ye, 2015; Hsu & Lin, 2015; Huang et al., 2016; Sener & Savarese, 2017; Ducoffe & Precioso, 2018; Ash et al., 2019; Liu et al., 2020; Yan et al., 2020) suggesting a variety of heuristics as potential acquisition functions for AL. Among these methods, Bayesian Active Learning by Disagreement (BALD) (Houlsby et al., 2011) has attained notable success in the context of deep Bayesian AL, while maintaining the expressiveness of Bayesian models (Gal et al., 2017; Janz et al., 2017; Shen et al., 2017) . Concretely, BALD relies on a most informative selection (MIS) strategy-a classical heuristic that dates back to Lindley (1956) -which greedily queries the data point exhibiting the maximal mutual information with the model parameters at each iteration. Despite the overwhelming popularity of such heuristics due to the algorithmic simplicity (MacKay, 1992; Chen et al., 2015; Gal & Ghahramani, 2016) , the performance of these AL algorithms unfortunately is sensitive to the quality of uncertainty estimations of the underlying model, and it remains an open problem in deep AL to accurately quantify the model uncertainty, due to limited access to training data and the challenge of posterior estimation. In figure 1 , we demonstrate the potential issues of MIS-based strategies introduced by inaccurate posterior samples from a Bayesian Neural Network (BNN) on a multi-class classification dataset. Here, the samples (i.e. hypotheses) from the model posterior are grouped into equivalence classes (ECs) (Golovin et al., 2010) according to the Hamming distance between their predictions as shown in figure 1a. Informally, an equivalence class contains hypotheses that are close in their predictions for a randomly selected set of examples (See section 2.2 for its formal definition). We note from figure 1b that the probability mass of the models sampled from the BNN is centered around the mode of the approximate posterior distribution, while little coverage is seen on models of higher accuracy. Consequently, MIS tends to select data points that reveal the maximal information w.r.t. the sampled distribution, rather than guiding the active learner towards learning high accuracy models. In addition to the robustness concern, another challenge for deep AL is the scalability to large batches of queries. In many real-world applications, fully sequential data acquisition algorithms are often undesirable especially for large models, as model retraining becomes the bottleneck of the learning system (Mittal et al., 2019; Ostapuk et al., 2019) . Due to such concerns, batch-mode algorithms are designed to reduce the computational time spent on model retraining and increase labeling efficiency. Unfortunately, for most acquisition functions, computing the optimal batch of queries function is NP-hard (Chen & Krause, 2013) ; when the evaluation of the acquisition function is expensive or the pool of candidate queries is large, it is even computationally challenging to construct a batch greedily (Gal et al., 2017; Kirsch et al., 2019; Ash et al., 2019) . Recently, efforts in scaling up batch-mode AL algorithms often involve diversity sampling strategies (Sener & Savarese, 2017; Ash et al., 2019; Citovsky et al., 2021; Kirsch et al., 2021a) . Unfortunately, these diversity selection strategies either ignores the downstream learning objective (e.g., using clustering as by (Citovsky et al., 2021) ), or inherit the limitations of the sequential acquisition functions (e.g., sensitivity to uncertainty estimate as elaborated in figure 1 (Kirsch et al., 2021a) ). Motivated by these two challenges, this paper aims to simultaneously (1) mitigate the limitations of uncertainty-based deep AL heuristics due to inaccurate uncertainty estimation, and (2) enable efficient computation of batches of queries at scale. We propose Batch-BALANCE-an efficient batchmode deep Bayesian AL framework-which employs a decision-theoretic acquisition function inspired by Golovin et al. (2010) ; Chen et al. (2016) . Concretely, Batch-BALANCE utilizes BNNs as the underlying hypotheses, and uses Monte Carlo (MC) dropout (Gal & Ghahramani, 2016; Kingma et al., 2015) or Stochastic gradient Markov Chain Monte Carlo (SG-MCMC) (Welling & Teh, 2011; Chen et al., 2014; Ding et al., 2014; Li et al., 2016a) to estimate the model posterior. It then selects points that can most effectively tell apart hypotheses from different equivalence classes (as illustrated in figure 1 ). Intuitively, such disagreement structure is induced by the pool of unlabeled data points; therefore our selection criterion takes into account the informativeness of a query with respect to the target models (as done in BALD), while putting less focus on differentiating models with little disagreement on target data distribution. As learning progresses, Batch-BALANCE adaptively anneals the radii of the equivalence classes, resulting in selecting more "difficult examples" that distinguish more similar hypotheses as the model accuracy improves (section 3.1). When computing queries in small batches, Batch-BALANCE employs an importance sampling strategy to efficiently compute the expected gain in differentiating equivalence classes for a batch of examples and chooses samples within a batch in a greedy manner. To scale up the computation of queries to large batches, we further propose an efficient batch-mode acquisition procedure, which aims to maximize a novel combinatorial information measure (Kothawade et al., 2021) defined through our novel acquisition function. The resulting algorithm can efficiently scale to realistic batched learning tasks with reasonably large batch sizes (section 3.2, section 3.3, appendix B). Finally, we demonstrate the effectiveness of variants of Batch-BALANCE via an extensive empirical study, and show that they achieve compelling performance-sometimes by a large margin-on several benchmark datasets (section 4, appendix D) for both small and large batch settings.

2. BACKGROUND AND PROBLEM SETUP

In this section, we introduce useful notations and formally state the (deep) Bayesian AL problem. We then describe two important classes of existing AL algorithms along with their limitations, as a warm-up discussion before introducing our algorithm in section 3.

2.1. PROBLEM SETUP

Notations We consider pool-based Bayesian AL, where we are given an unlabelled dataset D pool drawn i.i.d. from some underlying data distribution. Further, assume a labeled dataset D train and a set of hypotheses H = {h 1 , . . . , h n }. We would like to distinguish a set of (unknown) target hypotheses among the ground set of hypotheses H. Let H denote the random variable that represents the target hypotheses. Let p(H) be a prior distribution over the hypotheses. In this paper, we resort to BNN with parameters ω ∼ p(ω | D train )foot_0 . Problem statement An AL algorithm will select samples from D pool and query labels from experts. The experts will provide label y for given query x ∈ D pool . We assume labeling each query x incurs a unit cost. Our goal is to find an adaptive policy for selecting samples that allows us to find a hypotheses with target error rate σ ∈ [0, 1] while minimizing the total cost of the queries. Formally, a policy π is a mapping π from the labeled dataset D train to samples in D pool . We use D π train to denote the set of examples chosen by π. Given the labeled dataset D π train , we define p ERR (π) as the expected error probability w.r.t. the posterior p(ω | D π train ). Let the cost of a policy π be cost(π) ≜ max |D π train |, i.e., the maximum number of queries made by policy π over all possible realizations of the target hypothesis H ∈ H. Given a tolerance parameter σ ∈ [0, 1], we seek a policy with the minimal cost, such that upon termination, it will get expected error probability less than σ. Formally, we seek arg min π cost(π), s.t. p ERR (π) ≤ σ.

2.2. THE EQUIVALENCE-CLASS-BASED SELECTION CRITERION

As alluded in section 1 and figure 1 , the MIS strategy can be ineffective when the samples from the model posterior are heavily biased and cluttered toward sub-optimal hypotheses. We refer the readers to appendix A.1 for details of a stylized example where a MIS-based strategy (such as BALD) can perform arbitrarily worse than the optimal policy. A "smarter" strategy would instead leverage the structure of the hypothesis space induced by the underlying (unlabeled) pool of data points. In fact, this idea connects to an important problem for approximate AL, which is often cast as learning equivalence classes (Golovin et al., 2010) : Definition 2.1 (Equivalence Class) Let (H, d) be a metric space where H is a hypothesis class and d is a metric. For a given set V ⊆ H and centers S = {s 1 , ..., s k } ⊆ V of size k, let r S : V → [k] be a partition function over V and D i := {h ∈ V | r S (h) = i}, such that ∀i, j ∈ [k], r S (s i ) = i and ∀h ∈ D i , d(h, s i ) ≤ d(h, s j ). Each D i ⊆ V is called an equivalence class induced by s i ∈ S. Consider a pool-based AL problem with hypothesis space H, a sampled set V ⊆ H, and an unlabeled dataset Dpool which is drawn i.i.d. from the underlying data distribution. Each hypothesis h ∈ H can be represented by a vector v h indicating the predictions of all samples in Dpool . We can construct equivalence classes with the Hamming distance, which is denoted as d H (h, h ′ ), and equivalence class number k on sampled hypotheses V. Let d S H (V) := max h,h ′ ∈V:r S (h)=r S (h ′ ) d H (h, h ′ ) be the maximal diameter of equivalence classes induced by S. Therefore, the error rates of any unordered pair of hypotheses {h, h ′ } that lie in the same equivalence class are at most d S H (V) away from each other. If we construct the k equivalence-class-inducing centers (as in definition 2.1) as the solution of the max-diameter clustering problem: C = arg min |S|=k d S H (V), we can obtain the minimal worstcase relative error (i.e. difference in error rate) between hypotheses pair {h, h ′ } that lie in the same equivalence class. We denote E = {{h, h ′ } : r C (h) ̸ = r C (h ′ )} as the set of all (unordered) pairs of hypotheses (i.e. undirected edges) corresponding to different equivalence classes with centers in C. Existing EC-based AL algorithms (e.g., EC 2 (Golovin et al., 2010) as described in appendix A.2 and ECED (Chen et al., 2016) as in appendix A.3) are not directly applicable to deep Bayesian AL tasks. This is because computing the acquisition function (equation 4 and equation 5) needs to integrate over the hypotheses space, which is intractable for large models (such as deep BNN). Moreover, it is nontrivial to extend to batch-mode setting since the number of possible candidate batches and the number of label configurations for the candidate batch grows exponentially with the batch size. Therefore, we need efficient approaches to approximate the ECED acquisition function when dealing with BNNs in both fully sequential setting and batch-mode setting.

3. OUR APPROACH

We first introduce our acquisition function for the sequential setting, namely BALANCE (as in Bayesian Active Learning via Equivalence Class Annealing), and then present the batch-mode extension under both small and large batch-mode AL settings.

3.1. THE BALANCE ACQUISITION FUNCTION

We resort to Monte Carlo method to estimate the acquisition function. Given all available labeled samples D train at each iteration, hypotheses ω are sampled from the BNN posterior. We instantiate our methods with two different BNN posterior sampling approaches: MC dropout (Gal & Ghahramani, 2016) and cSG-MCMC (Zhang et al., 2019) . MC dropout is easy to implement and scales well to large models and datasets very efficiently (Kirsch et al., 2019; Gal & Ghahramani, 2016; Gal et al., 2017) . However, it is often poorly calibrated (Foong et al., 2020; Fortuin et al., 2021) . cSG-MCMC is more practical and indeed has high-fidelity to the true posterior (Zhang et al., 2019; Fortuin et al., 2021; Wenzel et al., 2020) . In order to determine if there is an edge {ω, ω′ } that connects a pair of sampled hypotheses ω, ω′ (i.e., if they are in different equivalence classes), we calculate the Hamming distance d H (ω, ω′ ) between the predictions of ω, ω′ on the unlabeled dataset Dpool . If the distance is greater than some threshold τ , we consider the edge {ω, ω′ } ∈ Ê; otherwise not. We define the acquisition function of BALANCE for a set x 1:b ≜ {x 1 , ..., x b } as: (Chen et al., 2016) , and 1 dH(ω k ,ω ′ k )>τ is the indicator function. We can adaptively anneal τ by setting τ proportional to BNN's validation error rate ε in each AL iteration. ∆ BALANCE (x 1:b | D train ) ≜ E y 1:b E ω,ω ′ ∼p(ω|Dtrain) 1 dH(ω,ω ′ )>τ • (1 -λ ω,y 1:b λ ω ′ ,y 1:b ) (1) where λ ω,y 1:b ≜ p(y 1:b |ω,x 1:b ) max y ′ 1:b p(y ′ 1:b |ω,x 1:b ) is the likelihood ratio 2 In practice, we cannot directly compute equation 1; instead we estimate it with sampled BNN posteriors: We first acquire K pairs of BNN posterior samples {ω, ω′ }. The Hamming distances d H (ω, ω′ ) between these pairs of BNN posterior samples are computed. Next, we calculate the weight discount factor 1λ ωk ,y 1:b λ ω′ k ,y 1:b for each possible label y and each pair {ω, ω′ } where d H (ω, ω′ ) > τ . At last, we take the expectation of the discounted weight over all y 1:b configurations. In summary, ∆ BALANCE (x 1:b ) is approximated as 1 2K 2 y 1:b K k=1 (p(y 1:b | ωk ) + p(y 1:b | ω′ k )) K k=1 1 dH(ω k ,ω ′ k )>τ 1 -λ ωk ,y 1:b λ ω′ k ,y 1:b . ( ) D train is omitted for simplicity of notations. Note that in our algorithms we never explicitly construct equivalence classes on BNN posterior samples, due to the fact that (1) it is intractable to find the exact solution for the max-diameter clustering problem and (2) an explicit partitioning of the hypotheses samples tends to introduce "unnecessary" edges where the incident hypotheses are closeby (e.g., if a pair of hypotheses lie on the adjacent edge between two hypothesis partitions), and therefore may overly estimate the utility of a query. Nevertheless, we conducted an empirical study of a variant of BALANCE with explicit partitioning (which underperforms BALANCE). We defer detailed discussion on this approach, as well as empirical study, to the appendix D.4.  A B ← GreedySelection(D pool , Dpool , {ω k , ω′ k } K k=1 , τ, B) # see section 3.2 5: else 6: downsample subset C ⊂ D pool with p(x) ∼ ∆ BALANCE (x) β 7: S 1:B , µ 1:B ← BALANCE-Clustering(C, Dpool , {ω k , ω′ k } K k=1 , τ, β, B) # see section 3.3 8: A B ← µ 1:B 9: output: A B In the fully sequential setting, we choose one sample x with top ∆ BALANCE (x) in each AL iteration. In the batch-mode setting, we consider two strategies for selecting samples within a batch: greedy selection strategy for small batches and acquisition-function-driven clustering strategy for large batches. We refer to our full algorithm as Batch-BALANCE (algorithm 1) and expand on the batch-mode extensions in the following two subsections.

3.2. GREEDY SELECTION STRATEGY

To avoid the combinatorial explosion of possible batch number, the greedy selection strategy selects sample x with maximum ∆ BALANCE (x 1:b-1 ∪ {x}) in the b-th step of a batch. However, the configuration y 1:b of a subset x 1:b expands exponentially with subset size b. In order to efficiently estimate ∆ BALANCE (x 1:b ), we employ an importance sampling method. The current M configuration samples of y 1:b are drawn by concatenating previous drawn M samples of y 1:b-1 and M samples of y b (samples drawn from proposal distribution). The pseudocode for the greedy selection strategy is provided in algorithm 2. We refer the readers to appendix B.2 for details of importance sampling and to appendix B.3 for details of efficient implementation. Algorithm 2 GreedySelection 1: input: a set of samples D, Dpool , {ω k , ω′ k } K k=1 , threshold τ , and B 2: A 0 = ∅ 3: for b ∈ [B] do 4: for all x ∈ D\A b-1 do 5: s x ← ∆ BALANCE (A b-1 {x}) 6: x b ← arg max x∈D\A b-1 s x 7: A b ← A b-1 {x b } 8: output: batch A B = {x 1 , . . . , x B } Algorithm 3 BALANCE-Clustering 1: input: C ⊂ D pool , Dpool , {ω k , ω′ k } K k=1 , threshold τ , coldness parameter β, and cluster number B 2: sample initial centroids O = {µ j } B j=1 ⊂ C with p(x) ∼ ∆ BALANCE (x) β 3: while O not converged do 4: for all x ∈ C do 5: a x ← arg max j I ∆BALANCE (x, µ j ) 6: S j ← {x ∈ C : a x = j} 7: for all j ∈ [B] do 8: µ j ← arg max y∈Sj x∈Sj I ∆BALANCE (x, y) 9: output: S 1:B , µ 1:B

3.3. STOCHASTIC BATCH SELECTION WITH POWER SAMPLING AND BALANCE-CLUSTERING

A simple approach to apply our new acquisition function to large batch is stochastic batch selection (Kirsch et al., 2021a) , where we randomly select a batch with power distribution p(x) ∼ ∆ BALANCE (x) β . We call this algorithm PowerBALANCE. Next, we sought to further improve PowerBALANCE through a novel acquisition-function-driven clustering procedure. Inspired by Kothawade et al. (2021) , we define a novel information measure I ∆BALANCE (x, y) for any two data samples x and y based on our acquisition function: I ∆BALANCE (x, y) = ∆ BALANCE (x) + ∆ BALANCE (y) -∆ BALANCE ({x, y}) Intuitively, I ∆BALANCE (x, y) captures the amount of overlap between x and y w.r.t. ∆ BALANCE . Therefore, it is natural to use it as a similarity measure for clustering, and use the cluster centroids as candidate queries. The BALANCE-Clustering algorithm is illustrated in algorithm 3. Concretely, we first sample a subset C ⊂ D pool with p(x) ∼ ∆ BALANCE (x) β similar to (Kirsch et al., 2021a) . The BALANCE-Clustering then runs an Lloyd's algorithm (with a non-Euclidean metric) to find B cluster centroids (see Line 3-8 in algorithm 3): it takes the subset C, {ω k , ω′ k } K k=1 , threshold τ , coldness parameter β, and cluster number B as input. It first samples initial centroids O with p(x) ∼ ∆ BALANCE (x) β . Then, it iterates the process of adjusting the clusters and centroids until convergence and outputs B cluster centroids as candidate queries.

4. EXPERIMENTS

In this section, we sought to show the efficacy of Batch-BALANCE on several diverse datasets, under both small batch setting and large batch setting. In the main paper, we focus on accuracy as the key performance metric as is commonly used in the literature; supplemental results with different evaluation metrics, including macro-average AUC, F1, and NLL, are provided in appendix D.

Datasets

In the main paper, we consider four datasets (i.e. MNIST (LeCun et al., 1998) , Repeated-MNIST (Kirsch et al., 2019) , Fashion-MNIST (Xiao et al., 2017) and EMNIST (Cohen et al., 2017) ) as benchmarks for the small-batch setting, and two datasets (i.e. SVHN (Netzer et al., 2011) , CIFAR (Krizhevsky et al., 2009) ) as benchmarks for the large-batch setting. The reason for making the splits is that for the more challenging classification tasks on SVHN and CIFAR-10, the performance improvement for all baseline algorithms from a small batch (e.g., with batch size < 50) is hardly visible. We split each dataset into unlabeled AL pool D pool , initial training dataset D train , validation dataset D val , test dataset D test and unlabeled dataset Dpool . Dpool is only used for calculating the Hamming distance between hypotheses and is never used for training BNNs. For more experiment details about datasets, see appendix C.

BNN models

At each AL iteration, we sample BNN posteriors given the acquired training dataset and select samples from D pool to query labels according to the acquisition function of a chosen algorithm. To avoid overfitting, we train the BNNs with MC dropout at each iteration with early stopping. for MNIST, Repeated-MNIST, EMNIST, and FashionMNIST, we terminate the training of BNNs with patience of 3 epochs. For SVHN and CIFAR-10, we terminate the training of BNNs with patience of 20 epochs. The BNN with the highest validation accuracy is picked and used to calculate the acquisition functions. Additionally, we use weighted cross-entropy loss for training the BNN to mitigate the bias introduced by imbalanced training data. The BNN models are reinitialized in each AL iteration similar to Gal et al. (2017) ; Kirsch et al. (2019) . It decorrelates subsequent acquisitions as the final model performance is dependent on a particular initialization. We use Adam optimizer (Kingma & Ba, 2017) for all the models in the experiments. For cSG-MCMC, we use ResNet-18 (He et al., 2016) and run 400 epochs in each AL iteration. We set the number of cycles to 8 and initial step size to 0.5. 3 samples are collected in each cycle. Acquisition criterion for Batch-BALANCE under different bach sizes For small AL batch with B < 50, Batch-BALANCE takes the greedy selection approach. For large AL batch with B ≥ 50, BALANCE takes the clustering approach described in section 3.3. In the small batch-mode setting, if b < 4, Batch-BALANCE enumerates all y 1:b configurations to compute the acquisition function ∆ (Batch-)BALANCE according to equation 2; otherwise, it uses M = 10, 000 MC samples of y 1:b and importance sampling to estimate ∆ Batch-BALANCE according to equation 6. All our results report the median of 6 trials, with lower and upper quartiles. Baselines For the small-batch setting, we compare Batch-BALANCE with Random, Variation Ratio (Freeman & Freeman, 1965) , Mean STD (Kendall et al., 2015) and BatchBALD. To the best of the authors' knowledge, Batch-BALD still achieves state-of-the-art performance for deep Bayesian AL with small batches. For large-batch setting, it is no longer feasible to run BatchBALD (Citovsky et al., 2021) ; we consider other baseline models both in Bayesian setting, e.g., PowerBALD, and Non-Bayesian setting, e.g., CoreSet and BADGE.

4.2. COMPUTATIONAL COMPLEXITY ANALYSIS

Table 1 shows the computational complexity of the batch-mode AL algorithms evaluated in this paper. Here, C denotes the number of classes, B denotes the acquisition size, K is the pair number of posterior samples and M is the sample number for y 1:b configurations. We assume the number of the hidden units is H. T is # iterations for BALANCE-Clustering to converge and is usually less than 5. In figure 2 we plot the computation time for a single batch (in seconds) by different algorithms. As the batch size increases, variants of Batch-BALANCE (including Batch-BALANCE and PowerBALANCE as its special case) both outperforms CoreSet in run time. In later subsections, we will demonstrate that this gain in computational efficiency does not come at a cost of performance. We refer interested readers to section B.4 for extended discussion of computational complexity. We then compare Batch-BALANCE with other baseline methods on three datasets with balanced classes-Repeated-MNIST, Fashion-MNIST and EMNIST-Balanced. The acquisition size B for Repeated-MNIST and Fashion-MNIST is 10 and is 5 for EMNIST-Balanced dataset. The threshold τ of Batch-BALANCE is annealed by setting τ = ε/4foot_2 . The learning curves of accuracy are shown in figure 3 (d)-(f ). For Repeated-MNIST dataset, BALD performs poorly and is worse than random selection. BatchBALD is able to cope with the replication after certain number of AL loops, which is aligned with result shown in Kirsch et al. (2019) . Batch-BALANCE is able to beat all the other methods on this dataset. An ablation study about repetition number and performance can be found in appendix D.2. For Fashion-MNIST dataset, Batch-BALANCE outperforms random selection but the other methods fail. For EMNIST dataset, Batch-BALANCE is slightly better than BatchBALD.

AL algorithms Complexity

Mean STD O (|D pool |(CK + log B)) Variation Ratio O (|D pool |(CK + log B)) PowerBALD O (|D pool |(CK + log B)) BatchBALD O (|D pool | BM K) CoreSet (2-approx) O(|D pool |HB) BADGE O(|D pool |HCB 2 ) PowerBALANCE O (|D pool |(C • 2K + log B)) Batch-BALANCE O (|D pool | BM • 2K) (GreedySelection) Batch-BALANCE O(|D pool |C • 2K + |C| 2 (C 2 • 2K + T )) (BALANCE-Clustering) We further compare different algorithms with two unbalanced datasets: EMNIST-ByMerge and EMNIST-ByClass. The τ for Batch-BALANCE is set ε/4 in each AL loop. B = 5 and K = 10 for all the methods. As pointed out by Kirsch et al. (2019) , BatchBALD performs poorly in unbalanced dataset settings. BALANCE and Batch-BALANCE can cope with the unbalanced data settings. The result is shown in figure 3 (g) and (h). Further results on other datasets and under different metrics are provided in appendix D.

Batch-BALANCE with MC dropout

We test different AL models on two larger datasets with larger batch size. The acquisition batch size B is set 1,000 and τ = ε/8. We use VGG-11 as the BNN and train it on all the labeled data with patience equal to 20 epochs in each AL iteration. The VGG-11 is trained using SGD with fixed learning rate 0.001 and momentum 0.9. The size of C for Batch-BALANCE is set to 2B. Similar to PowerBALD (Kirsch et al., 2021a) , we also find that PowerBALanCe and BatchBALanCe are insensitive to β and β = 1 works generally well. We thus set the coldness parameter β = 1 for all algorithms. The performance of different AL models on these two datasets is shown in figure 4 (a) and (b). PowerBALD, PowerBALANCE, BADGE, and BatchBALANCE get similar performance on SVHN dataset. For CIFAR-10 dataset, BatchBALANCE shows compelling performance. Note that PowerBALANCE also performs well compared to other methods. Batch-BALANCE with cSG-MCMC We test different AL models with cSG-MCMC on CIFAR-10. The acquisition batch size B is 5,000. The size of C for Batch-BALANCE is set to 3B. In order to apply CoreSet algorithm to BNN, we use the average activations of all posterior samples' final fully-connected layers as the representations. For BADGE, we use the label with maximum average predictive probability as the hallucinated label and use the average loss gradient of the last layer induced by the hallucinated label as the representation. We can see from figure 4 (c) that Batch-BALANCE achieve the best performance.

5. RELATED WORK

Pool-based batch-mode active learning Batch-mode AL has shown promising performance for practical AL tasks. Recent works, including both Bayesian (Houlsby et al., 2011; Gal et al., 2017; Kirsch et al., 2019) and non-Bayesian approaches (Sener & Savarese, 2017; Ash et al., 2019; Citovsky et al., 2021; Kothawade et al., 2021; Hacohen et al., 2022; Karanam et al., 2022) , have been enormous and we hardly do it justice here. We mention what we believe are most relevant in the following. Among the Bayesian algorithms, 2013) formalized a class of interactive optimization problems as adaptive submodular optimization problems and prove a greedy batch-mode approach to these problems is near-optimal as compared to the optimal batch selection policy. ELR (Roy & McCallum, 2001) focuses on a Bayesian estimate of the reduction in classification error and takes a one-step-look-ahead startegy. Inspired by ELR, WMOCU (Zhao et al., 2021) extends MOCU (Yoon et al., 2013) with a theoretical guarantee of convergence. However, none of these algorithms extend to the batch setting. Among the non-Bayesian approaches, Sener & Savarese (2017) proposed a CoreSet approach to select a subset of representative points as a batch. BADGE (Ash et al., 2019) selects samples by using the k-MEAMS++ seeding algorithm on the D pool representations, which are the gradient embeddings of DNN's last layer induced by hallucinated labels. Contemporary works propose AL algorithms that work for different settings including text classification (Tan et al., 2021) , domain shift and outlier (Kirsch et al., 2021b) , low-budget regime (Hacohen et al., 2022) , very large batches (e.g., 100K or 1M) (Citovsky et al., 2021) , rare classes and OOD data (Kothawade et al., 2021) . Bayesian neural networks Bayesian methods have been shown to improve the generalization performance of DNNs (Hernández-Lobato & Adams, 2015; Blundell et al., 2015; Li et al., 2016b; Maddox et al., 2019) , while providing principled representations of uncertainty. MCMC methods provides the gold standard of performance with smaller neural networks (Neal, 2012) . SG-MCMC methods (Welling & Teh, 2011; Chen et al., 2014; Ding et al., 2014; Li et al., 2016a) provide a promising direction for sampling-based approaches in Bayesian deep learning. cSG-MCMC (Zhang et al., 2019) proposes a cyclical stepsize schedule, which indeed generates samples with high fidelity to the true posterior (Fortuin et al., 2021; Izmailov et al., 2021) . Another BNN posterior approximation is MC dropout (Gal & Ghahramani, 2016; Kingma et al., 2015) . We investigate both the cSG-MCMC and MC dropout methods as representative BNN models in our empirical study. Semi-supervised learning Semi-supervised learning leverages both unlabeled and labeled examples in the training process (Kingma et al., 2014; Rasmus et al., 2015) . Some work has combined AL and semi-supervised learning (Wang et al., 2016; Sener & Savarese, 2017; Sinha et al., 2019) . Our methods are different from these methods since our methods never leverage unlabeled data to train the models, but rather use the unlabeled pool to inform the selection of data points for AL.

6. CONCLUSION AND DISCUSSION

We have proposed a scalable batch-mode deep Bayesian active learning framework, which leverages the hypothesis structure captured by equivalence classes without explicitly constructing them. Batch-BALANCE selects a batch of samples at each iteration which can reduce the overhead of retraining the model and save labeling effort. By combining insights from decision-theoretic active learning and diversity sampling, the proposed algorithms achieve compelling performance efficiently on active learning benchmarks both in small batch-and large batch-mode settings. Given the promising empirical results on the standard benchmark datasets explored in this paper, we are further interested in understanding the theoretical properties of the equivalence annealing algorithm under controlled studies as future work. (Shannon, 1948) . Kirsch et al. (2019) further proposed BatchBALD as an extension of BALD whereby the mutual information between a joint of multiple data points and the model parameters is estimated as

Appendices

∆ BatchBALD (x 1:b | D train ) ≜ I(y 1:b ; ω | x 1:b , D train ). Limitation of the BALD algorithm BALD can be ineffective when the hypothesis samples are heavily biased and cluttered towards sub-optimal hypotheses. Below, we provide a concrete example where such selection criterion may be undesirable. < l a t e x i t s h a 1 _ b a s e 6 4 = " o 5 0 s R p g  z R d 8 R G q b y o t I J N p C b T r 4 = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m k a I 8 F L x 4 r 2 g 9 o Q 9 l s N + 3 S z S b s T o Q S + h O 8 e F D E q 7 / I m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I J H C o O t + O 4 W N z a 3 t n e J u a W / / 4 P C o f H z S N n G q G W + x W M a 6 G 1 D D p V C 8 h Q I l 7 y a a 0 y i Q v B N M b u d + 5 4 l r I 2 L 1 i N O E + x E d K R E K R t F K D + O B N y h X 3 K q 7 A F k n X k 4 q k K M 5 K H / 1 h z F L I 6 6 Q S W p M z 3 M T 9 D O q U T D J Z 6 V + a n h C 2 Y S O e M 9 S R S N u / G x x 6 o x c W G V I w l j b U k g W 6 u + J j E b G T K P A d k Y U x 2 b V m 4 v / e b 0 U w 7 q f C Z W k y B V b L g p T S T A m 8 7 / J U G j O U E 4 t o U w L e y t h Y 6 o p Q 5 t O y Y b g r b 6 8 T t p X V e + 6 W r u v V R r 1 P I 4 i n M E 5 X I I H N 9 C A O 2 h C C x i M 4 B l e 4 c 2 R z o v z 7 n w s W w t O P n M K f + B 8 / g D z L Y 2 P < / l a t e x i t > h 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " 6 M G 2 U G q E j b s / V h + i D l C j R k R t y a g = " > A A A B 7 H i c b V B N S 8 N A F H y p X 7 V + V T 1 6 W S y C p 5 K I a I 8 F L x 4 r m L b Q h r L Z b t q l m 0 3 Y f R F K 6 G / w 4 k E R r / 4 g b / 4 b t 2 0 O 2 j q w M M y 8 Y d + b M J X C o O t + O 6 W N z a 3 t n f J u Z W / / 4 P C o e n z S N k m m G f d Z I h P d D a n h U i j u o 0 D J u 6 n m N A 4 l 7 4 S T u 7 n f e e L a i E Q 9 4 j T l Q U x H S k S C U b S S 3 x 8 m a A b V m l t 3 F y D r x C t I D Q q 0 B t U v m 2 N Z z B U y S Y 3 p e W 6 K Q U 4 1 C i b 5 r N L P D E 8 p m 9 A R 7 1 m q a M x N k C + W n Z E L q w x J l G j 7 F J K F + j u R 0 9 i Y a R z a y Z j i 2 K x 6 c / E / r 5 d h 1 A h y o d I M u W L L j 6 J M E k z I / H I y F J o z l F N L K N P C 7 k r Y m G r K 0 P Z T s S V 4 q y e v k / Z V 3 b u p X z 9 c 1 5 q N o o 4 y n M E 5 X I I H t 9 C E e 2 i B D w w E P M M r v D n K e X H e n Y / l a M k p M q f w B 8 7 n D / F U j s E = < / l a t e x i t > . . . < l a t e x i t s h a 1 _ b a s e 6 4 = " C 1 7 o B o t N / u C R 6 G D K Z l W d 1 N x 5 C v A = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l K 0 R 4 L X j x W t L X Q h r L Z b t q l m 0 3 Y n Q g l 9 C d 4 8 a C I V 3 + R N / + N 2 z Y H b X 0 w 8 H h v h p l 5 Q S K F Q d f 9 d g o b m 1 v b O 8 X d 0 t 7 + w e F R + f i k Y + J U M 9 5 m s Y x 1 N 6 C G S 6 F 4 G w V K 3 k 0 0 p 1 E g + W M w u Z n 7 j 0 9 c G x G r B 5 w m 3 I / o S I l Q M I p W u h 8 P a o N y x a 2 6 C 5 B 1 4 u W k A j l a g / J X f x i z N O I K m a T G 9 D w 3 Q T + j G g W T f F b q p 4 Y n l E 3 o i P c s V T T i x s 8 W p 8 7 I h V W G J I y 1 L Y V k o f 6 e y G h k z D Q K b G d E c W x W v b n 4 n 9 d L M W z 4 m V B J i l y x 5 a I w l Q R j M v + b D I X m D O X U E s q 0 s L c S N q a a M r T p l G w I 3 u r L 6 6 R T q 3 p X 1 f p d v d J s 5 H E U 4 Q z O 4 R I 8 u I Y m 3 E I L 2 s B g B M / w C m + O d F 6 c d + d j 2 V p w 8 p l T + A P n 8 w f 0 s Y 2 Q < / l a t e x i t > h 2 < l a t e x i t s h a 1 _ b a s e 6 4 = " U b j e V H c D s E a / A 6 7 A Y d t g d a t 2 T T 0 = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m 0 a I 8 F L x 4 r W l t o Q 9 l s J + 3 S z S b s b o Q S + h O 8 e F D E q 7 / I m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I B F c G 9 f 9 d g p r 6 x u b W 8 X t 0 s 7 u 3 v 5 B + f D o U c e p Y t h i s Y h V J 6 A a B Z f Y M t w I 7 C Q K a R Q I b A f j m 5 n f f k K l e S w f z C R B P 6 J D y U P O q L H S / a h / 2 S 9 X 3 K o 7 B 1 k l X k 4 q k K P Z L 3 / 1 B j F L I 5 S G C a p 1 1 3 M T 4 2 d U G c 4 E T k u 9 V G N C 2 Z g O s W u p p B F q P 5 u f O i V n V h m Q M F a 2 p C F z 9 f d E R i O t J 1 F g O y N q R n r Z m 4 n / e d 3 U h H U / 4 z J J D U q 2 W B S m g p i Y z P 4 m A 6 6 Q G T G x h D L F 7 a 2 E j a i i z N h 0 S j Y E b / n l V f J 4 U f W u q r W 7 W q V R z + M o w g m c w j l 4 c A 0 N u I U m t I D B E J 7 h F d 4 c 4 b w 4 7 8 7 H o r X g 5 D P H 8 A f O 5 w / 2 N Y 2 R < / l a t e x i t > h 3 < l a t e x i t s h a 1 _ b a s e 6 4 = " Q y i J / 0 U R m E j x 4 q r B L Q K w G + A 7 A k U = " > A A A B 7 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B i y W R o j 0 W v H i s Y D + g D W W z 3 b R L N 5 u w O x F K 6 I / w 4 k E R r / 4 e b / 4 b t 2 0 O 2 v p g 4 P H e D D P z g k Q K g 6 7 7 7 R Q 2 N r e 2 d 4 q 7 p b 3 9 g 8 O j 8 v F J 2 8 S p Z r z F Y h n r b k A N l 0 L x F g q U v J t o T q N A 8 k 4 w u Z v 7 n S e u j Y j V I 0 4 T 7 k d 0 p E Q o G E U r d c a D T F 1 5 s 0 G 5 4 l b d B c g 6 8 X J S g R z N Q f m r P 4 x Z G n G F T F J j e p 6 b o J 9 R j Y J J P i v 1 U 8 M T y i Z 0 x H u W K h p x 4 2 e L c 2 f k w i p D E s b a l k K y U H 9 P Z D Q y Z h o F t j O i O D a r 3 l z 8 z + u l G N b 9 T K g k R a 7 Y c l G Y S o I x m f 9 O h k J z h n J q C W V a 2 F s J G 1 N N G d q E S j Y E b / X l d d K + r n o 3 1 d p D r d K o 5 3 E U 4 Q z O 4 R I 8 u I U G 3 E M T W s B g A s / w C m 9 O 4 r w 4 7 8 7 H s r X g 5 D O n 8 A f O 5 w / w r 4 9 K < / l a t e x i t > h n 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " D Q I y K m Q v F a a E 7 z y 4 v + k W g 6 1 6 e 1 c = " > A A A B 7 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l E t M e C F 4 8 V T F t o Q 9 l s J + 3 S z S b s b o Q S + h u 8 e F D E q z / I m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 M B V c G 9 f 9 d k o b m 1 v b O + X d y t 7 + w e F R 9 f i k r Z N M M f R Z I h L V D a l G w S X 6 h h u B 3 V Q h j U O B n X B y N / c 7 T 6 g 0 T + S j m a Y Y x H Q k e c Q Z N V b y x 4 N c z g b V m l t 3 F y D r x C t I D Q q 0 B t W v / j B h W Y z S M E G 1 7 n l u a o K c K s O Z w F m l n 2 l M K Z v Q E f Y s l T R G H e S L Y 2 f k w i p D E i X K l j R k o f 6 e y G m s 9 T Q O b W d M z V i v e n P x P 6 + X m a g R 5 F y m m U H J l o u i T B C T k P n n Z M g V M i O m l l C m u L 2 V s D F V l B m b T 8 W G 4 K 2 + v E 7 a V 3 X v p n 7 9 c F 1 r N o o 4 y n A G 5 3 A J H t x C E + 6 h B T 4 w 4 P A M r / D m S O f F e X c + l q 0 l p 5 g 5 h T 9 w P n 8 A F H K O 2 A = = < / l a t e x i t > h n = {h 1 , . . . , h n } is structured such that d H (h i , h j ) = 2 1-i -2 1-j if i < j, 2 1-j -2 1-i o.w. where d H (h i , h j ) denotes the fraction of labels h i and h j disagree upon when making predictions on i.i.d. samples of data points. We further assume that for any subset of hypotheses S ⊆ H, there exists a data point whose label they agree upon. Assume each hypothesis h i has an equal probability and the target error rate is σ. On the one hand, note that BALD does not consider d H (h i , h j ), and therefore on average it requires log n examples to identify any target hypothesis. On the other hand, to achieve a target error rate of σ, one only needs to differentiate all pairs of hypotheses h i , h j of distance d H (h i , h j ) > σ (i.e., by selecting training examples to rule out at least one of h i , h j ). Therefore, a "smarter" AL policy could query examples to sequentially check the consistency of h 1 , h 2 , . . . , h n until all remaining hypotheses are within distance σ. It is easy to check that this requires log(1/σ) examples before reaching the error rate σ. The gap between BALD and the above policy log n log(1/σ) could be large as n increases.

A.2 EQUIVALENCE CLASS EDGE CUTTING

Consider the problem statement in section 2.1. If σ = 0 and tests are noise-free, this problem can be solved near-optimally by the equivalence class edge cutting (EC 2 ) algorithm (Golovin et al., 2010) . EC 2 employs an edge-cutting strategy based on a weighted graph G = (H, E), where vertices represent hypotheses and edges link hypotheses that we want to distinguish between. Here E ≜ {{h, h ′ } : r(h) ̸ = r(h ′ )} contains all pairs of hypotheses that have different equivalence classes. We define a weight function W : E → R ≥0 by W ({h, h ′ }) ≜ p(h) • p(h ′ ). A sample x with label y is said to "cut" an edge, if at least one hypothesis is inconsistent with y. Denote E(x, y) ≜ {{h, h ′ } ∈ E : p(y | x, h) = 0 ∨ p(y | x, h ′ ) = 0} as the set of edges cut by labeling x as y. The EC 2 objective is then defined as the total weight of edges cut by the current D train : f EC 2 (D train ) ≜ W (x,y)∈Dtrain E (x, y) . EC 2 algorithm greedily maximizes this objective per iteration. The acquisition function for EC 2 is ∆ EC 2 (x | D train ) ≜ E y [f (D train ∪ {(x, y)}) -f (D train ) | D train ] . (4)

A.3 THE EQUIVALENCE CLASS EDGE DISCOUNTING ALGORITHM

In the noisy setting, the acquisition function of Equivalence Class Edge Discounting algorithm (ECED) (Chen et al., 2016) takes undesired contribution by noise into account. Given a data point and its label (x, y), ECED discounts all model parameters by their likelihood ratio: λ h,y ≜ p(y|h,x) max y ′ p(y ′ |h,x) . After we get D train , the value of assigning label y to a data point x is defined as the total amount of edge weight discounted: δ(x, y | D train ) ≜ {h,h ′ }∈E p(h, D train )p(h ′ , D train ) • (1 -λ h,y λ h ′ ,y ) , where E = {{h, h ′ } : r(h) ̸ = r(h ′ )} consists of all unordered pairs of hypothesis corresponding to different equivalence classes. Further, ECED augments the above value function δ with an offset value such that the value of a non-informative test is 0. The offset value of labeling x as label y is defined as: ν(x, y | D train ) ≜ {h,h ′ }∈E p(h, D train )p(h ′ , D train ) • (1 -max h λ 2 h,y ). The overall acquisition function of ECED is: ∆ ECED (x | D train ) ≜ E y [δ(x, y | D train ) -ν(x, y | D train )] . B ALGORITHMIC DETAILS

B.1 DERIVATION OF ACQUISITION FUNCTIONS OF BALANCE AND BATCH-BALANCE

In each AL loop, the ECED algorithm selects a sample from AL pool according to the acquisition function ∆ ECED (x | D train ) ≜ E y   {ω,ω ′ }∈E W ω,ω ′ 1 -λ ω,y λ ω ′ ,y -1 -max ω λ 2 ω,y   , where E is the total edges with adjacent nodes in different equivalence classes and λ ω,y = p(y|ω) max y ′ p(y ′ |ω) . W ω,ω ′ is the weight for edge {ω, ω ′ } which is maintained by ECED algorithm. After we observe y of selected x, we update the weights of all edges with W ω,ω ′ = W ω,ω ′ • p(y | ω)p(y | ω ′ ). In the deep Bayesian AL setting, the offset term 1max ω λ 2 ω,y can be removed when we use deep BNN. However, we can not enumerate all {ω, ω ′ } ∈ E in this setting since there are an infinite number of hypotheses in the hypothesis space. Moreover, we can not even estimate the acquisition function of ECED on a subset of sampled hypotheses by MC dropouts since building equivalence classes with best ϵ is NP-hard. If we sample {ω, ω ′ } according to posterior p(ω | D train ) and check whether {ω, ω ′ } ∈ Ê by Hamming distance in the way we describe in section 3.1, we will get ∆ ECED (x | D train ) ≈E y   {ω,ω ′ }∈E W ω,ω ′ (1 -λ ω,y λ ω ′ ,y )   ≈E y E ω,ω ′ ∼p(ω|Dtrain) 1 dH(ω,ω ′ )>τ • W ω,ω ′ p(ω | D train )p(ω ′ | D train ) • (1 -λ ω,y λ ω ′ ,y ) ∝E y E ω,ω ′ ∼p(ω|Dtrain) 1 dH(ω,ω ′ )>τ • (1 -λ ω,y λ ω ′ ,y ) . Inspired by the weight discounting mechanism of ECED, we define the acquisition function of BAL-ANCE ∆ BALANCE (x | D train ) as ∆ BALANCE (x | D train ) ≜ E y E ω,ω ′ ∼p(ω|Dtrain) 1 dH(ω,ω ′ )>τ • (1 -λ ω,y λ ω ′ ,y ) . After we get K pairs of MC dropouts, the acquisition function ∆ BALANCE (x | D train ) can be approximated as follows: ∆ BALANCE (x | D train ) =E p(ω|Dtrain) E p(y|ω) E ω,ω ′ ∼p(ω|Dtrain) 1 dH(ω,ω ′ )>τ (1 -λ ω,y λ ω ′ ,y ) ≈ ŷ 1 2K K k=1 p(ŷ | ωk ) + p(ŷ | ω′ k ) 1 K K k=1 1 dH(ω k ,ω ′ k )>τ 1 -λ ωk ,ŷ λ ω′ k ,ŷ . In batch-mode setting, the acquisition function of Batch-BALANCE for a batch x 1:b is ∆ Batch-BALANCE (x 1:b | D train ) ≜ E y 1:b E ω,ω ′ ∼p(ω|Dtrain) 1 dH(ω,ω ′ )>τ • (1 -λ ω,y 1:b λ ω ′ ,y 1:b ) . Similar to the fully sequential setting, we can approximate ∆ Batch-BALANCE (x 1:b | D train ) with K pairs of MC dropouts. The x 1:b are chosen in a greedy manner. For iteration b inside a batch, the x 1:b-1 are fixed and x b is selected according to ∆ Batch-BALANCE (x 1:b | D train ) =E p(ω|Dtrain) E p(y 1:b |ω) E ω,ω ′ ∼p(ω|Dtrain) 1 dH(ω,ω ′ )>τ (1 -λ ω,y 1:b λ ω ′ ,y 1:b ) ≈ ŷ1:b 1 2K K k=1 p(ŷ 1:b | ωk ) + p(ŷ 1:b | ω′ k ) 1 K K k=1 1 dH(ω k ,ω ′ k )>τ 1 -λ ωk ,ŷ 1:b λ ω′ k ,ŷ 1:b . B.2 IMPORTANCE SAMPLING OF CONFIGURATIONS When b becomes large, it is infeasible to enumerate all label configurations y 1:b . We use M MC samples of y 1:b to estimate the acquisition function and importance sampling to further reduce the computational timefoot_3 . Given that p(y 1:b | ω) can be factorized as p(y 1:b-1 | ω) • p(y b | ω) , the acquisition function can be written as: ∆ Batch-BALANCE (x 1:b | D train ) ≜E y 1:b E p(ω|Dtrain) 1 dH(ω k ,ω ′ k )>τ (1 -λ ω,y 1:b λ ω ′ ,y 1:b ) =E p(ω|Dtrain) E p(y 1:b |ω) E ω,ω ′ ∼p(ω|Dtrain) 1 dH(ω k ,ω ′ k )>τ (1 -λ ω,y 1:b λ ω ′ ,y 1:b ) =E p(ω|Dtrain) E p(y 1:b-1 |ω) E p(y b |ω) E ω,ω ′ ∼p(ω|Dtrain) 1 dH(ω k ,ω ′ k )>τ (1 -λ ω,y 1:b λ ω ′ ,y 1:b ) Suppose we have M samples of y 1:b-1 from p(y 1:b-1 ), we perform importance sampling using p(y 1:b-1 ) to estimate the acquisition function: ∆ Batch-BALANCE (x 1:b | D train ) =E p(ω|Dtrain) E p(y 1:b-1 ) p(y 1:b-1 | ω) p(y 1:b-1 ) E p(y b |ω) E ω,ω ′ ∼p(ω|Dtrain) 1 dH(ω,ω ′ )>τ (1 -λ ω,y 1:b λ ω ′ ,y 1:b ) =E p(y 1:b-1 ) E p(ω|Dtrain) E p(y b |ω) p(y 1:b-1 | ω) p(y 1:b-1 ) E ω,ω ′ ∼p(ω|Dtrain) 1 dH(ω,ω ′ )>τ (1 -λ ω,y 1:b λ ω ′ ,y 1:b ) ≈ 1 M M ŷ1:b-1 ŷb 1 K K k=1 p(ŷ 1:b-1 | ωk )p(ŷ b | ωk ) + p(ŷ 1:b-1 | ω′ k )p(ŷ b | ω′ k ) p(ŷ 1:b-1 ) • 1 K K k=1 1 dH(ω k ,ω ′ k )>τ 1 -λ ωk ,ŷ 1:b λ ω′ k ,ŷ 1:b = 1 K 1 dH(ω k ,ω ′ k )>τ ⊤ 1 - P1:b-1 ⊗ Pb Â1:b ⊙ P ′ 1:b-1 ⊗ P ′ b Â′ 1:b   1 M P ⊤ 1:b-1 Pb + P ′⊤ 1:b-1 P ′ b 1 ⊤ P1:b-1 + P ′ 1:b-1   ⊤ . Here we save p(ŷ 1:b-1 | ωk ) and p(ŷ 1:b-1 | ω′ k ) for M samples in P1:b-1 and P ′ 1:b-1 . The shape of P1:b-1 and P ′ 1:b-1 is K × M . ⊙ is element-wise matrix multiplication and ⊗ is the outerproduct operator along first dimension. After the outer product operation, we can reshape the matrix by flattening all the dimensions after the 1st dimension. 1 is a matrix of 1s with shape K × 1. Kirsch et al. (2019) . It consists of two blocks of [convolution, dropout, max-pooling, relu] followed by a two-layer MLP that a two-layer MLP and one dropout between the two layers. The dropout probability is 0.5 in the dropout layers. Repeated-MNIST. Kirsch et al. (2019) show that applying BALD to a dataset that contains many (near) replicated data points leads to poor performance. We again randomly split the MNIST training dataset similar to the settings used on MNIST dataset. We replicate all the samples in AL pool two times and add isotropic Gaussian noise with a standard deviation of 0.1 after normalizing the dataset. The BNN architecture is the same as the one used on MNIST dataset. EMNIST. We further consider the EMNIST dataset under 3 different settings: EMNIST-Balanced, EMNIST-ByClass, and EMNIST-ByMerge. The EMNIST-Balanced contains 47 classes with balanced digits and letters. EMNIST-ByMerge includes digits and letters for a total of 47 unbalanced classes. EMNIST-ByClass represents the most useful organization for classification as it contains the segmented digits and characters for 62 classes comprising [0-9],[a-z], and [A-Z]. We randomly split the training set into D val with 18,800 images, Dpool with 18,800 images and D pool with the rest of the samples. Similar to Kirsch et al. (2019) , we do not use an initial dataset and instead perform the initial acquisition step with the randomly initialized model. The model architecture contains three blocks of [convolution, dropout, max-pooling, relu] , with 32, 64, and 128 3x3 convolution filters and 2x2 max pooling. We add a two-layer MLP following the three blocks. 4 dropout layers in total are in each block and MLP with dropout probability 0.5. Fashion-MNIST. Fashion-MNIST is a dataset of Zalando's article images that consists of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. We randomly split Fashion-MNIST training dataset into D val with 10,000 samples, Dpool with 10,000 samples, and D pool with the rest of samples. We obtain the initial training dataset that contains 20 samples with 2 samples in each class randomly chosen from the AL pool. The model architecture is similar to the one used on EMNIST dataset with 10 units in the last MLP. SVHN. We randomly select initial training dataset with 5,000 samples, Dpool with 2,000 samples, and validation dataset D val with 5,000 samples. Similarly for CIFAR-10 dataset, CIFAR-10. we random select initial training dataset with 5,000 samples, Dpool with 5,000 samples, and validation dataset D val with 5,000 samples.

C.2 IMPLEMENTATION DETAILS ON THE EMPIRICAL EXAMPLE IN FIGURE 1

We show an empirical example in figure 1 to provide some intuition as to why BALANCE and Batch-BALANCE are effective in practice. We train a BNN with an imbalanced MNIST training subset that contains 28 images for each digit in [1-8] and 1 image for digits 0 and 9. The crossentropy loss is reweighted to balance the training dataset during training. We obtain 200 posterior samples of BNN and use them to get the predictions on Dpool . We compute the Hamming distances for predictions all sample pairs and use these precomputed distances to plot the predictions with t-SNE (Van der Maaten & Hinton, 2008) . The equivalence classes are approximated by farthest-first traversal algorithm (FFT) (Gonzalez, 1985) . In figure 1 , the equivalence classes are highly imbalanced. The ground truth Dpool dataset labels represent the target hypotheses embedding. This figure highlights the scenario where the equivalence class-based methods, e.g. ECED and BALANCE are better than BALD.

D SUPPLEMENTAL EMPIRICAL RESULTS

In this section, we provide additional experimental details and supplemental results to demonstrate the competing algorithms.

D.1 EFFECT OF DIFFERENT CHOICES OF HYPERPARAMETERS

We compare BALD and BALANCE with batch size B = 1 and different K's on an imbalanced MNIST dataset which is created by removing a random portion of images for each class in the training dataset. figure 7 (a) shows that BALANCE performs the best with a large margin to the curve of BALD. Note that BALANCE with K = 50 is also better than BALD with K = 100. The BALANCE is robust to τ . However, when τ is set 0.3 and the test accuracy gets around 0.88, the accuracy improvement becomes slow. The reason for this slow improvement is that the threshold τ is too large and all the pairs of posterior samples are treated as in the same equivalence class and the acquisition functions for all the samples in the AL pool are zeros. In another word, the BALANCE degrades to random selection when τ is too large. We further pick an data point from this imbalanced MNIST dataset and gradually increase the posterior sample number K to estimate the acquisition function value ∆ BALANCE for this data point. For each posterior sample number K, we estimate the acquisition function ∆ BALANCE 10 times with 10 sets of posterior sample pairs. The mean and std for this K are calculated and shown in figure 8 .

D.2 EXPERIMENTS ON OTHER DATASETS

We compare different AL algorithms on tabular datasets including Human Activity Recognition Using Smartphones Data Set (Anguita et al., 2013) (HAR), Gas Sensor Array Drift (Vergara et al., 2012) (DRIFT), and Dry Bean Dataset (Koklu & Ozkan, 2020) , as well as a more difficult dataset CINIC-10 (Darlow et al., 2018) .

HAR, DRIFT and Dry Bean Dataset

We run 6 AL trials for each dataset and algorithm. In each iteration, the BNNs are trained with a learning rate of 0.01 and patience equal to 3 epochs. The BNNs all contain three-layer MLP with ReLU activation and dropout layers in between. The learning curves of all 5 algorithms on these 3 tabular datasets are shown in figure 9 . Batch-BALANCE outperforms all the other algorithms for these 3 datasets. For HAR dataset, both Batch-BALANCE and BatchBALD work better than random selection. In figure 9 (b) and (c), Mean STD, Variation Ratio and BatchBALD perform worse than random selection. We find similar effect for some other imbalanced datasets. CINIC-10 CINIC-10 is a large dataset with 270K images from two sources: CIFAR-10 ( Krizhevsky et al., 2009) and ImageNet (Rasmus et al., 2015) . The training set is split into an AL pool with 120K samples, 40K Dpool samples, 20K validation samples, and 200 starting training samples with 20 samples in each class. We use VGG-11 as the BNN. The number of sampled MC dropout pairs is 50 and the acquisition size is 10. We run 6 trials for this experiment. The learning curves of 5 algorithms are shown in figure 10 . We can see from figure 10 that Batch-BALANCE performs better than all the other algorithms by a large margin in this setting. Repeated-MNIST with different amounts of repetitions In order to show the effect of redundant data points on BathBALD and Batch-BALANCE, we ran experiments on Repeated-MNIST with an increasing number of repetitions. The learning curves of accuracy for Repeated-MNIST with different repetition numbers can be seen in figure 11 . A detailed model accuracy on the test dataset when the acquired training dataset size is 130 is shown in table 3. Even though Batch-BALANCE can improve data efficiency (Kirsch et al., 2019) , there are still large gaps between the learning curves of Batch-BALD and Batch-BALANCE and the gaps become larger when the number of repetitions increases. MNIST for an increasing number of repetitions. For all plots, the y-axis represents accuracy and x-axis represents the number of queried examples. We can see that BatchBALD also performs worse as the number of repetitions is increased. Batch-BALANCE outperforms BatchBALD with large margins and remains similar performance across different numbers of repetitions. In order to compare our algorithms with other AL algorithms in this small batch size regime, we further run PowerBALANCE, PowerBALD, BADGE and CoreSet on the Repeated-MNIST with repeat number 3. As shown in figure 12 , Batch-BALANCE achieves the best performance. Note that both PowerBALD and PowerBALANCE are efficient to select AL batch and show similar performance compared to BADGE algorithm.

CIFAR-100

For CIFAR-100, we use 100 fine-grained labels. The dataset is split into initial training dataset with 5,000 samples, Dpool with 5,000 samples, and validation dataset D val with 5,000 samples. Experiment is conducted with batch size B = 5, 000 and budget 25,000. The cSG-MCMC is used for BNN with epoch number 200, initial step size 0.5, and cycle number 4. We can see in figure 13 that both PowerBALANCE and Batch-BALANCE perform well in this dataset. 

D.5 COEFFICIENT OF VARIATION

To gain more insight into why BALANCE and Batch-BALANCE work consistently better than BALD and BatchBALD, we further investigate the dispersion of the estimated acquisition function values for those methods. Since Batch-BALANCE and BatchBALD extend their fully sequential algorithms similarly in a greedy manner, we only compare the acquisition functions of BALANCE and BALD. The coefficient of variation (CV) is chosen for the comparison of dispersion. It is defined as the ratio of the standard deviation to the mean. CV is a standardized measure of the dispersion of a probability distribution or frequency distribution. The value of CV is independent of the unit in which it is taken. We conduct the experiment on the imbalanced MNIST dataset in the setting of appendix C.2. We estimate the acquisition function values of BALANCE and BALD 5 times with 5 sets of K MC dropouts for each sample in the AL pool. Then, the CVs are calculated for these estimations. In figure 16 , we show histograms of CVs for both methods. The estimated acquisition function values of BALANCE are less dispersed, which shows potential for better performance.

D.6 PREDICTIVE VARIANCE

In order to directly compare the accuracy improvement of batches selected by different algorithms, instead of along the course of an AL trial, we conduct experiments with training sets of various sizes and compare the accuracy improvement of batches selected by AL algorithms with the same training set. The initial training set has 10 sampled randomly from Repeated-MNIST. In each step, we select 10 random samples and add them to training set. Hypotheses are drawn from BNN posterior given the current training set. We perform different AL algorithms and select batches with batch size 20. After each batch is added to training set, we can estimate the accuracy improvement of the batch. In each step, we perform each AL algorithm 20 times and estimate the mean and std of accuracy improvement. The mean and std of BNNs' accuracy are shown in figure 17 . We can see in figure 17 that our algorithms consistently select batches that have high accuracy improvement and low variance.

D.7 BATCH-BALANCE WITH MULTI-CHAIN CSG-MCMC

cSG-MCMC can be improved by sampling with multiple chains (Zhang et al., 2019) . In order to evaluate different AL algorithms with this improved parallel cSG-MCMC method, we conduct 



We use the conventional notation ω to represent the parameters of a BNN, and use ω and h interchangeably to denote a hypothesis. The likelihood ratio is used here (instead of the likelihood) so that the contribution of "non-informative examples" (e.g., p(y′ 1:b | ω, x 1:b ) = const ∀y ′ 1:b , ω) is zeroed out. Empirically we find that τ ∈ [ε/8, ε/2] works generally well for all datasets. A similar importance sampling procedure was proposed inKirsch et al. (2019) to estimate the mutual information. Here, we show how one can adapt the strategy to enable efficient estimation of ∆ Batch-BALANCE .



Figure 1: (a) Samples from posterior BNN via MC dropout. The embeddings are generated by applying t-SNE on the hypotheses' predictions on a random hold-out dataset. Colorbar indicates the (approximate) test accuracy of the sampled neural networks on the MNIST dataset. See section C.2 for details of the experimental setup. (b) Probability mass (y-axis) of equivalence classes (sorted by the average accuracy of the enclosed hypotheses as the x-axis).

Figure 2: Run time vs. batch size. 4.3 BATCH-MODE DEEP BAYESIAN AL WITH SMALL BATCH SIZE We compare 5 different models with acquisition sizes B = 1, B = 3 and B = 10 on MNIST dataset. K = 100 for all the methods. The threshold τ for Batch-BALANCE is annealed by setting τ to ε/2 in each AL loop. Note that when B = 3, we can compute the acquisition function with all y 1:b configurations for b = 1, 2, 3. When b ≥ 4, we approximate the acquisition function with importance sampling. Figure 3 (a)-(c) show that Batch-BALANCE are consistently better than other baseline methods for MNIST dataset.

Figure 3: Experimental results on MNIST, Repeated-MNIST, Fashion-MNIST, EMNIST-Balanced, EMNIST-ByClass and EMNIST-ByMerge datasets in the small-batch regime. For all plots, the yaxis represents accuracy and x-axis represents the number of queried examples.

Figure 4: Performance on SVHN and CIFAR-10 datasets in the large-batch regime.

B.3 Efficient implementation for greedy selection . . . . . . . . . . . . . . . . . . . 18 B.4 Detailed computational complexity discussion . . . . . . . . . . . . . . . . . . 18 C Experimental setup: Datasets and implementation details 19 C.1 Datasets used in the main paper . . . . . . . . . . . . . . . . . . . . . . . . . . 19 C.2 Implementation details on the empirical example in figure 1 . . . . . . . . . . . 20 D Supplemental empirical results 20 D.1 Effect of different choices of hyperparameters . . . . . . . . . . . . . . . . . . 20 D.2 Experiments on other datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 D.3 Additional evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 D.4 BALANCE via explicit partitioning over the hypothesis posterior samples . . . . 25 D.5 Coefficient of variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 D.6 Predictive variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 D.7 Batch-BALANCE with multi-chain cSG-MCMC . . . . . . . . . . . . . . . . . 26 A PRELIMINARY WORKS A.1 THE MOST INFORMATIVE SELECTION CRITERION BALD uses mutual information between the model prediction for each sample and parameters of the model as the acquisition function. It captures the reduction of model uncertainty by receiving a label y of a data point x: I (y; ω | x, D train ) = H (y | x, D train )-E p(ω|Dtrain) [H (y | x, ω, D train )] where H denotes the Shannon entropy

Figure 5: A stylized example where the most informative selection criterion underperforms the equivalence-class-based criterion.

Figure 6: Computation time (in seconds) vs. batch size for different AL algorithms

BALanCe K = 100 (a) ACC vs. # samples for different K's.

τ = 0.05 Batch-BALanCe τ = 0.15 Batch-BALanCe τ = 0.3 Batch-BALanCe τ = ε/2 Batch-BALanCe τ = ε/4 Batch-BALanCe τ = ε/8 (b) ACC vs. # samples for different τ 's.

Figure 7: Learning curves of different K and τ for BALANCE.

Figure 8: Estimated acquisition function values ∆ BALANCE of BALANCE vs. posterior sample number K

Figure 9: Experimental results on 3 tabular datasets. For all plots, the y-axis represents accuracy and x-axis represents the number of queried examples.

Figure 10: ACC vs. # samples on the CINIC-10 dataset.

Figure11: Performance of Random selection, BatchBALD, and Batch-BALANCE on Repeated-MNIST for an increasing number of repetitions. For all plots, the y-axis represents accuracy and x-axis represents the number of queried examples. We can see that BatchBALD also performs worse as the number of repetitions is increased. Batch-BALANCE outperforms BatchBALD with large margins and remains similar performance across different numbers of repetitions.

Figure 12: ACC vs. # samples on RepeatedMNIST dataset with repeat number 3.

Figure 13: ACC vs. # samples, cSG-MCMC, CIFAR-100

Figure 15: ACC vs. # samples for BALANCE-Partition and BALANCE.

Figure 16: Histograms for coefficient of variation.

Figure 17: We empirically show AL algorithms' predictive variance.

Figure 18: ACC vs. # samples, multi-chain cSG-MCMC, CIFAR-10

Computational complexity of AL algorithms.

Gal et al. (2017) choose a batch of samples with top acquisition functions. These methods can potentially suffer from choosing similar and redundant samples inside each batch.Kirsch et al. (2019) extendedHoulsby et al. (2011) and proposed a batch-mode deep Bayesian AL algorithm, namely BatchBALD.Chen & Krause (

The most informative selection criterion . . . . . . . . . . . . . . . . . . . . . 15 A.2 Equivalence class edge cutting . . . . . . . . . . . . . . . . . . . . . . . . . . 15 A.3 The equivalence class edge discounting algorithm . . . . . . . . . . . . . . . . 16 Derivation of acquisition functions of BALANCE and Batch-BALANCE . . . . 16 B.2 Importance sampling of configurations . . . . . . . . . . . . . . . . . . . . . . 17

The datasets are all split into starting training set, validation set, testing set, and AL pool. The AL pool is also used as Dpool . The τ for Batch-BALANCE is set ε/4 in each AL loop. See table 2 for more experiment details of these 3 datasets. Experment details for HAR, DRIFT and Dry Bean Dataset

acknowledgement

Acknowledgement. This work was supported in part by C3.ai DTI Research Award 049755, NSF award 2037026 and an NVIDIA GPU grant. Any opinions, findings, conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of any funding agencies.

B.3 EFFICIENT IMPLEMENTATION FOR GREEDY SELECTION

In algorithm 2, we can store p(ŷ P ′ b can be flattened to shape 1 × C b after matrix multiplication. We storeWe can compute λ ω,ŷ 1:b inside edge weight discount expression by⊙ is element-wise matrix multiplication and ⊗ is the outer-product operator along the first dimension. After the outer product operation, we can reshape the matrix by flattening all the dimensions after 1st dimension to maintain consistency. Similarly, we can compute Â′ 1:b , p(ŷ 1:b | ω′ k ) and λ ω′ ,ŷ 1:b with matrix operations. The indicator function 1 dH(ω k ,ω ′ k )>τ can be stored in a matrix with shape K × 1. The acquisition function can be computed with all matrix operations as follows:

B.4 DETAILED COMPUTATIONAL COMPLEXITY DISCUSSION

As demonstrated in figure 2 , figure 6 , and table 1, the computational complexity of our algorithm PowerBALANCE shares is comparable to PowerBALD. They all need to estimate the acquisition function value for each data point in the AL pool and then choose the top B data points after adding Gumbel-distributed noise to the log values. However, the power sampling-based methods have limited performance due to the lack of interaction between selected samples and non-selected samples during sampling. We can further improve the performance of PowerBALANCE with Batch-BALANCE. The computation complexity of Batch-BALANCE for large batch setting are proportional to B 2 when downsampled with subset size |C| = cB and c is a small constant. Its computational complexity is similar to that of BADGE and CoreSet. We also evaluated the negative log-likelihood (NLL) for different AL algorithms. NLL is a popular metric for evaluating predictive uncertainty (Quinonero-Candela et al., 2005) . As shown in figure 14 , Batch-BALANCE maintains a better or comparable quality of predictive uncertainty over test data.

D.4 BALANCE VIA EXPLICIT PARTITIONING OVER THE HYPOTHESIS POSTERIOR SAMPLES

Another way of estimating the acquisition function is to construct the equivalence classes explicitly first (e.g. by partitioning the hypothesis spaces into k Voronoi cells via max-diameter clustering and calculate the weight discounts of edges that connect different equivalence classes. Intuitively, explicitly constructing equivalence classes may introduce unnecessary edges as two closeby hypotheses can be partitioned into different equivalence classes; therefore leading to an overestimate of the edge weight discounted. We call this algorithm BALANCE-Partition.In order to compare with BALANCE and Batch-BALANCE, we sampled K pairs of MC dropouts to estimate the acquisition function of BALANCE-Partition. All the representations of 2K MC dropouts on Dpool are generated. We run FFT (Gonzalez, 1985) with Hamming distances and threshold τ on these representations to get approximated ECs. Each data point has at most τ Hamming distance to the corresponding cluster center. FFT is a 2-approx algorithm and the optimal solution with the same cluster number has cluster diameter ≥ τ 2 . After equivalence classes are returned, BALANCE-Partition calculates the edges discounts of all edges that connect different equivalence classes and estimates the acquisition function values of each data sample in the AL pool.Although a faster method that utilizes complete homogeneous symmetric polynomials (Javdani et al., 2014) can be implemented to estimate the acquisition function values for BALANCE-Partition, experiments in figure 15 show that BALANCE-Partition can not achieve better performance than BALANCE and increasing the MC dropout number does not improve performance significantly.

Method

repeat 1 time repeat 2 times repeat 3 times repeat 4 times Random 0.887 ± 0.017 0.883 ± 0.012 0.881 ± 0.013 0.895 ± 0.009 BatchBALD 0.917 ± 0.005 0.892 ± 0.023 0.883 ± 0.025 0.881 ± 0.014 Batch-BALANCE 0.926 ± 0.008 0.923 ± 0.008 0.929 ± 0.004 0.927 ± 0.010 

