SCALABLE BATCH-MODE DEEP BAYESIAN ACTIVE LEARNING VIA EQUIVALENCE CLASS ANNEALING

Abstract

Active learning has demonstrated data efficiency in many fields. Existing active learning algorithms, especially in the context of batch-mode deep Bayesian active models, rely heavily on the quality of uncertainty estimations of the model, and are often challenging to scale to large batches. In this paper, we propose Batch-BALANCE, a scalable batch-mode active learning algorithm, which combines insights from decision-theoretic active learning, combinatorial information measure, and diversity sampling. At its core, Batch-BALANCE relies on a novel decision-theoretic acquisition function that facilitates differentiation among different equivalence classes. Intuitively, each equivalence class consists of hypotheses (e.g., posterior samples of deep neural networks) with similar predictions, and Batch-BALANCE adaptively adjusts the size of the equivalence classes as learning progresses. To scale up the computation of queries to large batches, we further propose an efficient batch-mode acquisition procedure, which aims to maximize a novel information measure defined through the acquisition function. We show that our algorithm can effectively handle realistic multi-class classification tasks, and achieves compelling performance on several benchmark datasets for active learning under both low-and large-batch regimes. Reference code is released at https://github.com/zhangrenyuuchicago/BALanCe.

1. INTRODUCTION

Active learning (AL) (Settles, 2012) characterizes a collection of techniques that efficiently select data for training machine learning models. In the pool-based setting, an active learner selectively queries the labels of data points from a pool of unlabeled examples and incurs a certain cost for each label obtained. The goal is to minimize the total cost while achieving a target level of performance. A common practice for AL is to devise efficient surrogates, aka acquisition functions, to assess the effectiveness of unlabeled data points in the pool. There has been a vast body of literature and empirical studies (Huang et al., 2010; Houlsby et al., 2011; Wang & Ye, 2015; Hsu & Lin, 2015; Huang et al., 2016; Sener & Savarese, 2017; Ducoffe & Precioso, 2018; Ash et al., 2019; Liu et al., 2020; Yan et al., 2020) suggesting a variety of heuristics as potential acquisition functions for AL. Among these methods, Bayesian Active Learning by Disagreement (BALD) (Houlsby et al., 2011) has attained notable success in the context of deep Bayesian AL, while maintaining the expressiveness of Bayesian models (Gal et al., 2017; Janz et al., 2017; Shen et al., 2017) . Concretely, BALD relies on a most informative selection (MIS) strategy-a classical heuristic that dates back to Lindley (1956)-which greedily queries the data point exhibiting the maximal mutual information with the model parameters at each iteration. Despite the overwhelming popularity of such heuristics due to the algorithmic simplicity (MacKay, 1992; Chen et al., 2015; Gal & Ghahramani, 2016) , the performance of these AL algorithms unfortunately is sensitive to the quality of uncertainty estimations of the underlying model, and it remains an open problem in deep AL to accurately quantify the model uncertainty, due to limited access to training data and the challenge of posterior estimation. In figure 1 . We note from figure 1b that the probability mass of the models sampled from the BNN is centered around the mode of the approximate posterior distribution, while little coverage is seen on models of higher accuracy. Consequently, MIS tends to select data points that reveal the maximal information w.r.t. the sampled distribution, rather than guiding the active learner towards learning high accuracy models. In addition to the robustness concern, another challenge for deep AL is the scalability to large batches of queries. In many real-world applications, fully sequential data acquisition algorithms are often undesirable especially for large models, as model retraining becomes the bottleneck of the learning system (Mittal et al., 2019; Ostapuk et al., 2019) . Due to such concerns, batch-mode algorithms are designed to reduce the computational time spent on model retraining and increase labeling efficiency. Unfortunately, for most acquisition functions, computing the optimal batch of queries function is NP-hard (Chen & Krause, 2013); when the evaluation of the acquisition function is expensive or the pool of candidate queries is large, it is even computationally challenging to construct a batch greedily (Gal et al., 2017; Kirsch et al., 2019; Ash et al., 2019) . Recently, efforts in scaling up batch-mode AL algorithms often involve diversity sampling strategies (Sener & Savarese, 2017; Ash et al., 2019; Citovsky et al., 2021; Kirsch et al., 2021a) . Unfortunately, these diversity selection strategies either ignores the downstream learning objective (e.g., using clustering as by (Citovsky et al., 2021) ), or inherit the limitations of the sequential acquisition functions (e.g., sensitivity to uncertainty estimate as elaborated in figure 1 (Kirsch et al., 2021a) ). Motivated by these two challenges, this paper aims to simultaneously (1) mitigate the limitations of uncertainty-based deep AL heuristics due to inaccurate uncertainty estimation, and (2) enable efficient computation of batches of queries at scale. We propose Batch-BALANCE-an efficient batchmode deep Bayesian AL framework-which employs a decision-theoretic acquisition function inspired by Golovin et al. (2010); Chen et al. (2016) . Concretely, Batch-BALANCE utilizes BNNs as the underlying hypotheses, and uses Monte Carlo (MC) dropout (Gal & Ghahramani, 2016; Kingma et al., 2015) or Stochastic gradient Markov Chain Monte Carlo (SG-MCMC) (Welling & Teh, 2011; Chen et al., 2014; Ding et al., 2014; Li et al., 2016a) to estimate the model posterior. It then selects points that can most effectively tell apart hypotheses from different equivalence classes (as illustrated in figure 1 ). Intuitively, such disagreement structure is induced by the pool of unlabeled data points; therefore our selection criterion takes into account the informativeness of a query with respect to the target models (as done in BALD), while putting less focus on differentiating models with little disagreement on target data distribution. As learning progresses, Batch-BALANCE adaptively anneals the radii of the equivalence classes, resulting in selecting more "difficult examples" that distinguish more similar hypotheses as the model accuracy improves (section 3.1). When computing queries in small batches, Batch-BALANCE employs an importance sampling strategy to efficiently compute the expected gain in differentiating equivalence classes for a batch of examples and chooses samples within a batch in a greedy manner. To scale up the computation of queries to large batches, we further propose an efficient batch-mode acquisition procedure, which aims to maximize a novel combinatorial information measure (Kothawade et al., 2021) defined through our novel acquisition function. The resulting algorithm can efficiently scale to realistic batched learning tasks with reasonably large batch sizes (section 3.2, section 3.3, appendix B).



Figure 1: (a) Samples from posterior BNN via MC dropout. The embeddings are generated by applying t-SNE on the hypotheses' predictions on a random hold-out dataset. Colorbar indicates the (approximate) test accuracy of the sampled neural networks on the MNIST dataset. See section C.2 for details of the experimental setup. (b) Probability mass (y-axis) of equivalence classes (sorted by the average accuracy of the enclosed hypotheses as the x-axis).

, we demonstrate the potential issues of MIS-based strategies introduced by inaccurate posterior samples from a Bayesian Neural Network (BNN) on a multi-class classification dataset. Here, the samples (i.e. hypotheses) from the model posterior are grouped into equivalence classes (ECs) (Golovin et al., 2010) according to the Hamming distance between their predictions as shown in figure 1a. Informally, an equivalence class contains hypotheses that are close in their predictions for a randomly selected set of examples (See section 2.2 for its formal definition)

