INFORMATION CONDENSING ACTIVE LEARNING

Abstract

We introduce Information Condensing Active Learning (ICAL), a batch mode model agnostic Active Learning (AL) method targeted at Deep Bayesian Active Learning that focuses on acquiring labels for points which have as much information as possible about the still unacquired points. ICAL uses the Hilbert Schmidt Independence Criterion (HSIC) to measure the strength of the dependency between a candidate batch of points and the unlabeled set. We develop key optimizations that allow us to scale our method to large unlabeled sets. We show significant improvements in terms of model accuracy and negative log likelihood (NLL) on several image datasets compared to state of the art batch mode AL methods for deep learning.

1. Introduction

Machine learning models are widely used for a vast array of real world problems. They have been applied successfully in a variety of areas including biology (Ching et al., 2018) , chemistry (Sanchez-Lengeling and Aspuru-Guzik, 2018) , physics (Guest et al., 2018) , and materials engineering (Aspuru-Guzik and Persson, 2018) . Key to the success of modern machine learning methods is access to high quality data for training the model. However such data can be expensive to collect for many problems. Active learning (Settles, 2009) is a popular methodology to intelligently select the fewest new data points to be labeled while not sacrificing model accuracy. The usual active learning setting is pool-based active learning where one has access to a large unlabeled dataset D U and uses active learning to iteratively select new points from D U to label. Our goal in this paper is to develop an active learning acquisition function to select points that maximize the eventual test accuracy which is also one of the most popular criteria used to evaluate an active learning acquisition function. In active learning, an acquisition function is used to select which new points to label. A large number of acquisition functions have been developed over the years, mostly for classification (Settles, 2009) . Acquisition functions use model predictions or point locations (in input feature or learned representation space) to decide which points would be most helpful to label to improve model accuracy. We then query for the labels of those points and add them to the training set. While the past focus for acquisition functions has been the acquisition of one point at a time, each round of label acquisition and retraining of the ML model, particularly in the case of deep neural networks, can be expensive. Furthermore in several applications like biology, it can be much faster to do acquisition of a fixed number of points in parallel versus sequentially. There have been several papers, particularly in the past few years, that try to avoid this issue by acquiring points in batch. As our goal is to apply AL in the context of modern ML models and data, we focus in this paper on batch-mode AL. Acquisition functions can be broadly thought of as belonging to two categories. The ones from the first category directly focus on minimizing the error rate post-acquisition. A natural choice of such an acquisition function might be to acquire labels for points with the highest uncertainty or points closest to the decision boundary (Uncertainty sampling can be directly linked to minimizing error rate in the context of active learning Mussmann and Liang (2018) ). In the other category, the goal is to get as close as possible to the true underlying model. Thus here, acquisition functions select points which give the most amount of knowledge regarding a model's parameters where knowledge is defined as the statistical dependency between the parameters of the model and the predictions for the selected points. Mutual information (MI) is the usual choice for the dependency, though other choices are possible. For well-specified model spaces, e.g. in physics, such a strategy can identify the correct model. In machine learning, however, models are usually mis-specified, and thus the metric of evaluation even for model-identification acquisition functions is how successful they are at reducing test error. Given this reality, we follow the viewpoint of trying to minimize the error rate of the model post-acquisition. Our strategy is to select points that we expect would provide substantial information about the labels of the rest of the unlabeled set, thus reducing model uncertainty. We propose acquiring a batch of points B such that the model's predictions on B have as high a statistical dependency as possible with the model's predictions on the entire unlabeled set D U . Thus we want a batch B that condenses the most amount of information about the model's predictions on D U . We call our method Information Condensing Active Learning (ICAL). A key desideratum for our acquisition function is to be model agnostic. This is partly because the model distribution can be very heterogeneous. For example, ensembles which are often used as a model distribution can consist of just decision trees in a random forest or different architectures for a neural network. This means we cannot assume any closed form for the model's predictive distribution, and have to resort to Monte Carlo sampling of the predictions from the model to estimate the dependency between the model's predictions on the query batch and the unlabeled set. MI, however, is known to be hard to approximate using just samples (Song and Ermon, 2019) . Thus to scale the method to larger batch sizes, we use the Hilbert-Schmidt Independence Criterion (HSIC), one of the most powerful extant statistical dependency measures for high dimensional settings. Another advantage of HSIC is that it is differentiable, which as we will discuss later, can allow applications of the acquisition function to areas where MI would be difficult to make work. To summarize, we introduce Information Condensing Active Learning (ICAL) which maximizes the amount of information being gained with respect to the model's predictions on the unlabeled set of points. ICAL is a batch mode acquisition function that is model agnostic and can be applied to both classification and regression tasks. We then develop an algorithm that can scale ICAL to large batch sizes when using HSIC as the dependency measure between random variables. As our method only needs samples from the posterior predictive distribution which can be obtained for both regression and classification tasks, it is applicable to both.

2. Related work

A review of work on acquisition functions for active learning prior to the recent focus on deep learning is given by Settles (2009) . The BALD (Bayesian Active Learning by Disagreement) (Houlsby et al., 2011) acquisition function chooses a query point which has the highest mutual information about the model parameters. This turns out to be the point on which individual models sampled from the model distribution are confident about in their prediction but the overall predictive distribution for that point has high entropy. In other words this is the point on which the models are individually confident but disagree on the most. In Guo and Schuurmans (2008) which builds on Guo and Greiner (2007) , they formulate the problem as an integer program where they select a batch such that the post acquisition model is highly confident on the training set and has low uncertainty on the unlabeled set. While the latter aspect is related to what we do, they need to retrain their model for every candidate batch they search over in the course of trying to find the optimal batch. As the total number of possible batches is exponential in the size of the unlabeled set, this can get too computationally expensive for neural networks limiting the applicability of this approach. Thus as far as we know, Guo and Schuurmans (2008) has only been applied to logistic regression. BMDR (Wang and Ye, 2015) queries points that are as close to the classifier decision boundary as possible while still being representative of the overall sample distribution. The representativeness is measured using the maximum mean discrepancy (MMD) (Gretton et al., 2012) of the input features between the query batch and the set of all points, with a lower MMD indicating a more representative query batch. However this approach is limited to classification problems, as it is based on a decision boundary. BMAL (Hoi et al., 2006) selects a batch such that the Fisher information matrices for the total unlabeled set and the selected batch are as close as possible. The Fisher information matrix is however quadratic in the number of parameters and thus infeasible to compute for modern deep neural networks. FASS (Filtered Active Subset Selection) (Wei et al., 2015) picks the most uncertain points and then selects a subset of those points that are as similar as possible to the whole candidate batch which favors points that can represent the diversity of the initial set of most uncertain points. Recently active learning methods have been extended to the deep learning setting. Gal et al. (2017) adapts BALD (Houlsby et al., 2011) to the deep learning setting by using Monte Carlo Dropout (Gal and Ghahramani, 2016) to do inference for their Bayesian Neural Network. They extend BALD to the batch setting for neural networks with BatchBALD (Kirsch et al., 2019) . In Pinsler et al. (2019) , they adapt the Bayesian Coreset (Campbell and Broderick, 2018) approach for active learning, though their approach requires a batch size that changes for every acquisition. As the neural network decision boundary is intractable, DeepFool (Ducoffe and Precioso, 2018) uses the concept of adversarial examples (Goodfellow et al., 2014) to find points close to the decision boundary. However this approach is again limited to classification tasks. FF-Comp (Geifman and El-Yaniv, 2017) , DAL (Gissin and Shalev-Shwartz, 2019) , Sener and Savarese (2017), and BADGE (Ash et al., 2019) operate on the learned representation, as that is the only way the methods incorporate feedback from the training labels into the active learning acquisition function, and they are thus not model-agnostic, as they are not extendable to any model distribution where it is difficult to have a notion of a common representation -as in a random forests or ensembles, etc. where the learned representation is a distribution and not a single point. This is also the case with the model distribution -MC-dropout -we use in this paper. There is also extensive prior work on exploiting Gaussian Processes (GPs) for Active Learning (Houlsby et al., 2011; Krause et al., 2008) . However GPs are hard to scale especially for modern image datasets.

3. Background

Statistical background The entropy of a distribution is defined as HpY q " ´řxPX p x logpp x q, where p x is the probability of the x. Mutual information (MI) between two random variables is defined as IrX; Y s " ř xPX ř yPY ppx, yq logp ppx,yq ppxqppyq q, where ppx, yq is the joint probability of x, y. Note that IrX; Y s " HpY q ´HpY |Xq " HpXq ´HpX|Y q. By posterior predictive distribution y x we mean ş θ ppy|x, θqppθ|Dqdθ where y is the prediction, x the input point, θ the model parameters, and D the training data. M is the distribution of models (parametrized by θ) we wish to choose from via active learning. As mentioned before, we use MC-dropout for our model distribution by sampling random dropout masks and use the same set of dropout masks across points to generate joint predictions. Hilbert-Schmidt Independence Criterion (HSIC) Suppose we have two (possibly multivariate) distributions X , Y and we want to measure the dependence between them. A well known way to measure it is using distance covariance which intuitively, measures the covariance between the distances of pairs of samples from the joint distribution P XY and the product of marginal distributions P pXq, P pY q (Székely et al., 2007) . HSIC can simply be thought of as distance covariance except in a kernel space (Sejdinovic et al., 2013b) . A (sample) kernel matrix k X is a matrix whose ijth element is kpx i , x j q where k is the kernel function and x i , x j are the i, jth samples from X. Further details are in the Appendix. 

4. Motivation

As mentioned previously, our goal is to acquire points that will give us as much information as possible about the still-unlabeled points, thereby increasing the confidence of the model's predictions. As we will demonstrate shortly, there are situations where modern active learning methods do not select the points that optimally decrease the uncertainty of prediction on the unlabeled data. More formally, the examples below show that the choice of x P U that optimizes oft-used acquisition functions may not be optimal for decreasing the entropy of predictions ( ř x 1 PU,x 1 ‰x Hpy x 1 q) over the remaining points post-acquisition. If we wish to optimize test-set accuracy, this can be problematic: for well-calibrated models, we should expect worse average entropy (uncertainty) to roughly correspond to an increase in the number of errors. This is similar to cross entropy loss being a good proxy for 0-1 loss. Below we illustrate our points with two examples and from results on EMNIST. Example 1 Suppose we have an image dataset which is highly imbalanced with 90% cars, 9% planes, and 1% ships. Then a small increase in accuracy for the car category would lead to a much larger reduction in the overall error rate versus a large increase in accuracy for the ships category. However, given the dominance of the cars category in the loss, the uncertainty of prediction on the ships category is likely to be much higher. Thus the max-entropy criterion is more likely to choose points from the pool set that turn out to be ships. Example 2 Similar to the previous example, here we demonstrate that picking the point with the most amount of information with respect to the model parameters is not optimal for decreasing the prediction uncertainty on the still unlabeled data. The main idea behind this example is that if you have points which form a non-trivial fraction of the dataset and have a lot of correlation between their predictive distributions, then while any of the points may not give a lot of information about which underlying model is the best one, getting the labels for one of the points will greatly reduce the predictive uncertainty for the labels for the other points given the predictive distribution correlation. As these points are a non-trivial fraction of the dataset, reducing the predictive uncertainty on them will have a big impact on the error rate. The example in the Appendix formalizes this intuition. These observations motivate our formulation of the Information Condensing Active Learning (ICAL) acquisition function that selects the set of points whose acquisition would maximize the information gained about the predictive distribution on the unlabeled set. As posterior prediction entropy should be minimized by maximizing Mutual Information (MI) between predictions for unlabeled points and prediction for selected points, ideally ICAL would use MI or related criteria to select points.

EMNIST results

In Figure 1 , we show the average posterior entropy of the model's predictions for our method compared to BatchBALD, BayesCoreset, and Random acquisition. As can be seen from the figure, ICAL reduces the average posterior entropy much more effectively compared to the other two. Details of this experiment are in Section 6.2.

5. Information Condensing Active Learning (ICAL)

In this section we present our acquisition function. As before, let D train be the training points, D U the unlabeled points, y x the random variable denoting the prediction for x by the model trained on D train , and d the dependency measure being used. Then α ICAL ptx 1 , . . . , x B u, dq " 1 |D U | ÿ x 1 PD U dpy x 1 , ty x1 , . . . , y x B uq that is, we try to find the batch that has highest average dependency with respect to the unlabeled points' marginal predictive distribution.

Scaling α ICAL estimation

As we mentioned in the introduction, we can use MI as the dependency measure d but it is tricky to estimate MI using just samples from the distribution, particularly high-dimensional or continuous variables. Furthermore, MI estimators are usually not differentiable. Thus if we wanted to apply ICAL to domains where the pool set is continuous and infinite (for example, if we wanted to query gene expression perturbations for a cell), we would run into obstacles. This motivates our choice of HSIC as the dependency measure. In addition to being differentiable, HSIC has better empirical sample complexity for measuring dependency as opposed to estimators for MI. Indeed, popular MI estimators have been found to have variance with respect to ground truth MI that increases exponentially with the MI value Song and Ermon (2019) . HSIC has also been successfully used in the related context of feature selection via dependency maximization in the past Da Veiga (2015); Song et al. (2012) . Furthermore, HSIC is the Maximum Mean Discrepancy (MMD) between the joint distribution and the production of marginals. MMD is known to be ď 1 2 KL-divergence Ramdas et al. (2015) and thus HSIC ď 1 2 MI. Thus we use HSIC as the dependency measure for the rest of the paper. Naively implementing α ICAL pB, HSICq would require Op|D U |m 2 B ¨Cq steps per candidate batch being evaluated where C is the number of classes, m is the number of samples taken from ppy 1:B q (Opm 2 Bq to estimate HSIC which we need to do |D U | times). However, recall that HSIC is a function of solely the kernel matrices k x corresponding to the random variables (Appendix) -in this case y x , x P D U . Now one can define the matrix k ˚" 1 |D U | ř xPD U k x . We can then prove the following propositions (proofs are in the Appendix). Proposition 1 k ˚is a valid kernel matrix. Proposition 2 ř x 1 PD U { HSICpk x 1 , k xPB q " { HSICp ř xPD U k x , k xPB q where k xPB " k x1 , . . . , k x B , x i P B and { HSIC denotes the sample for HSIC. Using this reformulation, we only have to compute k ˚" 1 |D U | ř xPD U k x once per acquisition round. This lowers the computation cost to Op|D U |m 2 ¨C `m2 B ¨Cq. Estimating HSIC would still require m to increase very rapidly with B (proportional to the dimension of the joint distribution). To get around this but still maintain batch diversity, we try two strategies. For regular ICAL, we average the kernel matrices of points in the candidate batch. We then subsample r points from D U every time a point is added to the batch and only compare the dependency with those. This effectively introduces noise in the HSIC estimation. We find in practice, that this is sufficient to acquire a diverse batch, as evidenced by Figure 3 . This seems to be the case even for very large batches, and has the added benefit of further lowering the computational cost for evaluating a candidate batch to Oprm 2 ¨C `2 ¨m2 ¨Cq. We use r " 200 for all our experiments. We develop another strategy we call ICAL-pointwise which computes the marginal increase in dependence as a result of adding a point to the batch. If a point is highly correlated with elements of the current batch, the marginal increase would be negligible, making the point much less likely to be selected. The two variants perform very similarly despite ICAL-pointwise's slight advantage in the early acquisitions. ICAL-pointwise however requires much less time for equivalent performance which we discuss briefly in Section 5.2 and more fully in the Appendix. For ease of presentation, we use ICAL in the Results section and defer the full description and evaluation of ICAL-pointwise to the Appendix. As there are an exponential number of candidate batches, an exhaustive search to find the optimal batch is infeasible. For ICAL we use a greedy forward selection strategy to build the batch and find that it performs well empirically. As the arg max over all of D U has to be computed every time a new point is being selected for the batch, and we have to perform this operation for each point that is added to the batch, this gives a computation cost of Oppr 2 m 2 `|D U |m 2 B `m2 Bq¨Cq " Op|D U |m 2 B ¨Cq. It is possible that global nonlinear optimization of the batch ICAL criterion would work even better than greedy optimization already does with respect to state of the art methods. Efficient techniques for doing this optimization are not obvious and beyond the scope of this work. Even if we used gradient based techniques to construct the batch, gradient based optimization for nonlinear problems usually only leads to local and not global optima. We note however that greedy forward selection is a popular technique that has been successfully used in a large variety of contexts (Da Veiga, 2015; Blanchet et al., 2008) . Optimizations to scale ICAL even further as well as the full Algorithm are detailed in the Appendix.

6. Results

We demonstrate the effectiveness of ICAL using standard image datasets including MNIST (LeCun et al., 1998) , Repeated MNIST (Kirsch et al., 2019) , Extended MNIST (EMNIST) (Cohen et al., 2017) , fashion-MNIST, and CIFAR-10 ( Krizhevsky et al., 2009) . We compare ICAL with three state of the art methods for batched active learning acquisition -BatchBALD, FASS, and BayesCoreset. We also compare against BALD and Max Entropy (MaxEnt) which are not explicitly designed for batched selection, as well as against a Random acquisition baseline. Details of the acquisition functions are in the Appendix. ICAL consistently outperforms BatchBALD, FASS, and BayesCoreset on accuracy and negative log likelihood (NLL). Throughout our experiments, for each dataset we hold out a fixed test set for evaluating model performance after training and a fixed validation set for training purposes. We retrain the model from the beginning after each acquisition to avoid correlation of subsequently trained models, and we use early stopping after 3 (6 for ResNet18) consecutive epochs of validation accuracy drop. Following (Gal et al., 2017) , we use Neural Networks with MC dropout (Gal and Ghahramani, 2016) as a variational approximation for Bayesian Neural Networks. We simply use a mixture of rational quadratic kernels for HSIC, which has been used successfully with kernel based statistical dependency measures in the past, with mixture length scales of t0.2, 0.5, 1, 2, 5u as in (Bińkowski et al., 2018) . All models are optimized with the Adam optimizer (Kingma and Ba, 2014) using learning rate of 0.001 and betas (0.9,0.999). The small batch size experiments are repeated 6 times with different seeds and a different initial training set for each run, with balanced label distribution across all classes. The same set of seeds is used for different methods on the same task. 8 different seeds are used for large batch size experiments using CIFAR datasets.

6.1. MNIST and Repeated MNIST

We first examine ICAL's performance on MNIST, which is a standard image dataset for handwritten digits. We further test out the scenario where duplicated data points exist (repeated MNIST) as proposed in (Kirsch et al., 2019) . Each data point in MNIST is replicated three times in repeated-MNIST, and isotropic Gaussian noise with std=0.1 is added after normalizing the image. We use a CNN consists of two convolutional layers with 32 and 64 5x5 convolution filters, each followed by MC dropout, max-pooling and ReLU. One fully connected layer with 128 hidden units and MC dropout is used after convolutional layers and the output soft-max layer has dimension of 10. All dropout uses probability of 0.5, and the architecture achieved over 99% accuracy on full MNIST. We use a validation set of size 1024 for MNIST and 3072 for repeated-MNIST, and a balanced test set of size 10,000 for both datasets. All models are trained for up to 30 epochs for MNIST and up to 40 epochs for repeated-MNIST. We sample an initial training set of size 20 (2 per class) and conduct 30 acquisitions of batch size 10 on both datasets, and we use 50 MC dropout samples to estimate the posterior. The test accuracy and negative log-likelihood (NLL) are shown in Figure 2 . ICAL significantly improves the NLL and outperforms all other baselines on accuracy, with higher margins on the earlier acquisition rounds. The performance is consistent across all runs (the variance is smaller than other baselines), and is robust even in the repeated-MNIST setup, where all the other greedy methods show worsen performance. We check the frequency that replicas of a single sample were included in acquired batch and as shown in Appendix Figure 8 , our method (as well as BatchBALD, BayesCoreset and random) acquired no redundant sample whereas FASS and max entropy acquired up to 3 copies of some samples. 

6.2. EMNIST

We then extend the task to a more sophisticated dataset named Extended-MNIST, which consists of 47 classes of 28x28 images of both digits and letters. We used the balanced EMNIST where each class has 2400 training examples. We use a validation set of 16384 and test set of size 18800 (400 per class), and train for up to 40 epochs. We use a CNN consisting of three convolutional layers with 32, 64, and 128 3x3 convolution filters, each followed by MC dropout, 2x2 max-pooling and ReLU. A fully connected layer with 512 hidden units and MC dropout is used after convolutional layers. We use an initial train set of 47 (1 per class) and make 60 acquisitions of batch size 5. 50 MC dropout samples are used to estimate the posterior. The results are in Figure 4 . We do substantially better in terms of both accuracy and NLL compared to all other methods. A clue as to why our method outperforms on EMNIST can be found in Figure 3 . ICAL is able to acquire more diversed and balanced batches while all other methods have overly/under-represented classes (note that BatchBALD, Random and MaxEnt each totally miss examples from one of classes). This indicates that our method is much more robust in terms of performance even when the number of classes increases, whereas other alternatives degenerate.

6.3. Fashion-MNIST

We also examine ICAL's performance on fashion-MNIST which consists of 10 classes of 28x28 Zalando's article images. We use a validation set of 3072 and test set of size 10000 (1000 per class), and train for up to 40 epochs. The network architecture is the same as the one used in MNIST task. We use an initial train set of 20 (2 per class) and make 30 acquisitions of batch size 10. 100 MC dropout samples are used to estimate the posterior. As shown in Figure 4 , we again do significantly better in terms of both accuracy and NLL compared to all other methods. Note that almost all baselines were inferior to random baseline except ICAL, showing the robustness of our method. 

6.4. CIFAR

Finally we test our method on the CIFAR-10 and CIFAR-100 datasets Krizhevsky et al. (2009) in a large batch size setting. CIFAR-10 consists of 10 classes with 6000 images per class whereas CIFAR-100 has 100 classes with 600 images per class. We use a validation set of size 1024, and a balanced test set of size 10,000 for both datasets. For CIFAR-10, we start with an initial training set of 10000 examples (1000 per class) while for CIFAR-100, we start with 20000 examples (200 per class). We do 10 acquisitions on CIFAR-10 and 7 acquisitions on CIFAR-100 with batch size of 3000. We use a ResNet18 with additional 2 fully connected layers with MC dropouts, and train for up to 60 epochs with learning rate 0.1 (allow early stopping). We run with 8 different seeds. The results are in Figure 5 . Note that we are unable to compare against BatchBALD for either CIFAR dataset as it runs out of memory. For CIFAR-10, ICAL dominates all other methods for all acquisitions except two -when the acquired dataset size is 19000 and when it is 28000. ICAL also has the highest area under curve (auc) for accuracy compared to all other methods; with p-value ď 0.007 except for BALD and Max Entropy for which we have better auc with p-value 0.24, 0.15 respectively. ICAL also achieves the highest accuracy at the end of all 10 acquisitions. With CIFAR-100, on all acquisitions ICAL outperforms a majority of the methods. Furthermore, ICAL again finishes with the highest accuracy by a significant margin at the end of the acquisition rounds and it again have the highest auc compared to all other methods. Detailed comparison results are in the Appendix Table 2 . 

7. Conclusion

We develop a novel batch mode active learning acquisition function ICAL that is model agnostic and applicable to both classification and regression tasks (as it relies on only samples from the posterior predictive distribution). We develop key optimizations that enable us to scale our method to large acquisition batch and unlabeled set sizes. We show that we are robustly able to outperform state of the art methods for batch mode active learning on a variety of image classification tasks in a deep neural network setting. p k 1j " 1; j " k, 1 ď k ď 3 p k 14 " 1; 4 ď k ď 10 p k i1 " 1, p 10 i2 " 1; 1 ď k ď 9, 2 ď i ď L ω1 ω2 ω3 ω4 ω5 ω6 ω7 ω8 ω9 ω10 x1 1 2 3 4 4 4 4 4 4 4 x2 . . . xL 1 1 1 1 1 1 1 1 1 2 Table 1 : Labels that the different points x i take with probability 1 under different models. The columns are the different models ω k , and the rows are the different points. Given that we have no other information about the models, we update the posterior probabilities for the models as follows -if a model ω k outputs label l for a point x but after acquisition, the label for x is not l, then we know that is not the correct model and thus its posterior probability is 0 (so it is eliminated). Otherwise we have no way of distinguishing between the remaining models so they all have equal posterior probability. Then for x 1 the mutual information is Iry 1 , ω|x 1 , D train s " Hry 1 |x 1 s ´Eppω|Dtrainq rHry 1 |x 1 , ωss " 0.94 For x 2 . . . x L , Iry 2´L , ω|x 2...L , D train s " 0.325. However selecting x 1 would decrease the expected posterior entropy Hry 2´L |x 2...L , x 1 , y 1 , D train s from 0.325 to only 0.287. Acquiring any of x 2...L instead of x 1 , however, would decrease that entropy to 0, which would cause a much larger decrease in the expected posterior entropy averaged over x 1...L if L is large enough. The detailed calculations are in the later subsection. While x 2...L may not contribute much to the entropy of the joint predictive distribution or to the MI with respect to the model parameters compared to x 1 , collectively they will be weighted L ´1 times more than x 1 when looking at the accuracy. We should thus expect a well-calibrated model to have a higher uncertainty, and thus make a lot more errors on x 2...L , if x 1 is acquired versus if any of x 2...L are acquired. For instance, in the above example, as L increases, the expected error rate would approach « 0.7 ˆp1{7 ˆ6{7q ˆ2 " 0.17 (0.7 as 0.3 of the times the value of x 1 would also fix what the true model is reducing error rate on all x to 0) if x 1 is acquired as the errors for x 2...L are correlated, whereas the rate would approach 0 were any of x 2...L to be acquired.

Derivation for Example 2

For x 1 , the mutual information between the predicted label y 1 and model parameters is: " 0.325 Iry 1 , After acquiring x 1 , assuming the true label for x 1 is 1, then we update the posterior over the model parameter such that p 1 pw 1 q| y1"1 " 1 and p 1 pw k q| y1"1 " 0 for 1 ă k ď 10. Then the expected averaged posterior entropy for x 1...L is: 1 L ´1 L ÿ i"2 Hry i |x i s| y1"1 " 1 L ´1 L ÿ i"2 Hr 10 ÿ k"1 ppy i |x i , ω k qp 1 pω k q| y1"1 s " 1 L ´1 ˆpL ´1q ˆp´p1 ˆlogp1q `0 ˆlogp0qqq " 0 Similarly, we could compute the case where the true label for x 1 is 2-4: 1 L ´1 L ÿ i"2 Hry i |x i s| y1"2 " 0 1 L ´1 L ÿ i"2 Hry i |x i s| y1"3 " 0 1 L ´1 L ÿ i"2 Hry i |x i s| y1"4 " 1 L ´1 ˆpL ´1q ˆp´p 6 7 logp 6 7 q `1 7 logp 1 7 qqq " 0.41 The expectation of the averaged posterior entropy with respect to predicted label for y 1 (since we don't know the true label) is: Hry 2´L , ω|x 2...L , x 1 , y 1 D train s " E y1"ppy1|Dtrainq r 1 L ´1 L ÿ i"2 Hry i |x i s| y1 s " 1 10 ˆ0 `1 10 ˆ0 `1 10 ˆ0 `7 10 ˆ0.41 " 0.287

Baseline acquisition function details

Max entropy selects the points that maximize the predictive entropy αpx, Mq " Hpy|x, D train q " ´ÿ c ppy " c|x, D train q logpppy " c|x, D train qq BatchBALD BatchBALD (Kirsch et al., 2019) tries to find a batch of points that has the highest mutual information with respect to the model parameters. BALD is the non-batched version of BatchBALD. Formally α BatchBALD ptx 1 , . . . , x B u, ppωqq " Hpy 1 , . . . , y B q ´Eppωq rHpy 1 , . . . , y B |ωqs Filtered active submodular selection (FASS) FASS (Wei et al., 2015) samples the β ˆB most uncertain points B 1 and then subselect B points that are as representative of B 1 as possible. For the measure of uncertainty, FASS uses entropy Hpy|x, D train q. To measure the representativeness of B to B 1 , FASS tries to choose B to maximize the following function f pBq " ÿ yPY ÿ iPV y max sPBXV y wpi, sq Here V y Ď B 1 is the set of points in B 1 with predicted label, y and wpi, sq " d ´||x i ´xs || 2 2 is the similarity function between points indexed by i, s where x i , x s P X and d is the maximum distance between two points. The idea here is that if a point in B already exists that is close to some point x 1 P B 1 , then f pBq will favor adding points to the batch that are close to points other than x 1 , thus increasing the batch diversity. Note that FASS is equivalent to Max Entropy if β " 1.

Bayesian Coresets

In Pinsler et al. (2019) , they try to build a batch such that the log posterior after acquiring that batch best approximates the complete data log posterior (i.e. the log posterior after acquiring the entire pool set). Their approach closely follows the general Bayesian Coreset (Campbell and Broderick, 2018) approach which constructs a weighted subset of data that approximates the full dataset. Crucially (Pinsler et al., 2019) assume that the posterior predictive distribution Y p of a point p is independent of that of the corresponding distribution Y p 1 of another point p 1 -an assumption we do not make. We show in the next section why avoiding such an assumption lets us more effectively minimize the error with respect to the test distribution versus just optimizing for maxmizing information gain for the model posterior. As (Pinsler et al., 2019) require a variable batch size whereas all other methods (including ours) use a fixed batch size, for fairness of comparison, if the batch for this approach is smaller than the batch size being used, we fill the rest of the batch with random points. In practice, we only observe this being necessary for CIFAR. Random The points are selected uniformly at random from the unlabeled pool. Thus αpx, Mq is the uniform distribution.

Further statistical background

A divergence Λ between two distributions is a measure of the discrepancy or difference between two distributions P, Q. A key property of a divergence is that it is 0 if and only if P, Q are the same distribution. In this paper, we will be using the KL divergence and the MMD, which are respectively defined as D KL pP ||Qq " ´ÿ xPX P pxq logp Qpxq P pxq q M M D 2 k pP, Qq " EkpX, X 1 q `kpY, Y 1 q ´2kpX, Y q where k is a kernel in the Reproducing Kernel Hilbert Space (RKHS) H and µ k is the mean embedding of the distribution into H as per the kernel k. We can then use the notion of divergence to define the dependency d between a set of random variables X 1:n as follows dpX 1:n q " ΛpP 1:n , b i P i q

7.1. Further scaling to large batch sizes

To scale to large batch sizes, instead of adding points to the batch to be acquired one at a time, we can add points in minibatches of size L. While this comes at the cost of possible diversity in the batch, we find that the tradeoff is acceptable for the datasets we experimented with. This gives a final computation cost of Op |D U |m 2 B¨C L q where C is the number of classes. By contrast the corresponding runtime for BatchBALD is OpD U | ¨B ¨C ¨m ¨m1 q where m 1 is the number of sampled configurations of y 1:n´1 . For all experiments with ICAL, we were able to use L " 1 without any scaling difficulties. For ICAL-pointwise, we used L " B 15 only for CIFAR-10 and CIFAR-100. As alluded to previously, ICAL-pointwise can accommodate much larger L compared to ICAL before its performance degrades, allowing for much greater scaling. We evaluate this aspect of ICAL-pointwise in the Appendix. The final algorithm is given in Algorithm 1. 

ICAL-pointwise

To evaluate the marginal dependency increase if a candidate point x is added to batch B, we sample a set R from the pool set D U and compute the pairwise dHSIC of both B and B 1 " B Y txu with respect to each point in R. Let the resulting vectors (each of length |R|) with the dHSIC scores be d B and d B 1 . Then the marginal dependency increase statistic M x for point p is M x " 1 |R| ř i maxppd i B 1 {d i B q, 1q where i is the ith element of the vector. When then modify the α ICAL as follows -α 1 ICAL pB Y txuq " α ICAL pB Y txuq ¨pM x ´1q and use the point with the highest value of α 1 ICAL as the point to acquire. Note that as we want to get as accurate an estimate of M x as possible, we ideally want to choose as large a set R as possible. In general, we also want to choose |R| to be greater than the number of classes. This makes ICAL-pointwise more memory intensive compared to ICAL. We also tried another criterion for batch selection based on the minimal-redundancy-maximalrelevance Peng et al. (2005) but that had significantly worse performance compared to ICAL and ICAL-pointwise. In Figure 6 , we analyze the performance of ICAL versus ICAL-pointwise when their parameters are set such that computational cost is about the same. As can be seen they are broadly similar with ICAL-pointwise having a slight advantage in earlier acquisitions and ICAL being slightly better in later ones. We also analyze the relative performance as the mini-batch size L changes in Figure 7 . In the Figure , iter " B L is the number of iterations taken to build the entire acquisition batch (note that the actual acquisition happens after the entire batch has been built). ICAL-pointwise requires more computation Relative performance of ICAL and ICAL-pointwise on smaller datasets (EM-NIST,FashionMNIST,MNIST and CIFAR10) with parameters set to equivalent computation cost time than ICAL in small L setup, however if time is the major constraint, ICAL-pointwise is to be preferred as its performance degrades more slowly as L, the size of the minibatch, increases. As the performance usually peaks at L " 1, if one is trying to get the best performance or if memory is a constraint, then ICAL is to be preferred. iter " B L is the number of iterations taken to build the entire acquisition batch of size B (note that the actual acquisition happens after the entire batch has been built) 



Let the batch to acquire be denoted by B with B " |B|. Given a model distribution M, training data D train , unlabeled data D U , input space X , set of labels Y and an acquisition function αpx, Mq, we decide which batch of points to query next via: B ˚" arg max B αpB, Mq

Figure 1: Mean posterior entropy of the predictions after each acquisition on EMNIST.

Figure 2: Performance on MNIST and repeated-MNIST. Accuracy and NLL after each acquisition.

Figure 3: Histogram of the labels of all acquired points using different active learning methods on EMNIST (47 classes). ICAL acquires more diverse and balanced batches while all other methods have overly/under-represented classes.

Figure 4: Performance on EMNIST and fashion-MNIST, ICAL significantly improves the accuracy and NLL.

Figure 5: Performance on CIFAR-10 and CIFAR-100 with batch size=3000 using 8 seeds

Information Condensing Active Learning (ICAL) (M, T, D train , D U , B, K, r, L) Train M on D train repeat B " tu while |B| ă B do Y U " the predictive distribution for x P D U according to M R " Set of r randomly selected points from D U x 1 " argmax x α ICAL pB Y txu, HSICq with the optimizations as specified in Section 5.1 and 5.2 B " B Y tx 1 u end while D train " D train Y B Retrain M on D train until T iterations reached Return M

Figure 6:

Figure 7: Relative performance of ICAL and ICAL-pointwise on CIFAR100 with different mini-batch size L.

Figure 9: CIFAR10 performance with different L. iter " BL is the number of iterations taken to build the entire acquisition batch of size B (note that the actual acquisition happens after the entire batch has been built)

ω|x 1 , D train s " Hry 1 |x 1 s ´Eppω|Dtrainq rHry 1 |x 1 , ωss

Appendix Motivating example 2

Suppose we have a model distribution with 10 possible models ω 1 , . . . , ω 10 with equal prior probability of being the true model (ppw i q " 0.1 for @i). Let the datapoints be x 1 , . . . , x L with their labels taking 4 possible values. We define p k ij " ppy i " j|x i , ω k q as the probability of the jth class for the ith datapoint given by the kth model. Let where P 1:n is the joint distribution of X 1:n , P i the marginal of X i with bP i being the product of marginals. For D KL the dependency is exactly MI as defined above. For M M D the dependency is the Hilbert-Schmidt Independence Criterion (HSIC).

Hilbert-Schmidt Independence Criterion (HSIC)

Formally, if X, Y are drawn from the joint distribution P XY , then their HSIC is defined as -HSICpP XY , k, lq " E x,x 1 ,y,y 1 rkpx, x 1 qlpy, y 1 qs `Ex,x 1 rkpx, x 1 qsE y,y 1 rlpy, y 1 qs ´2E x,y rE x 1 rkpx, x 1 qsE y 1 rkpy, y 1 qss where px, yq and px 1 , y 1 q are independent pairs drawn from P XY . Note that HSICpP XY q " 0 if and only if P XY " P X P Y , that is, if X, Y are independent, for chracteristic kernels k and l.For the case where we are measuring the joint dependence between d variables, we can use the HSIC statistic (Sejdinovic et al., 2013a; Pfister et al., 2018) . The computational complexity of HSIC is bounded by the time taken to compute the kernel matrix which is Opm 2 dq where m is the number of samples and d the number of random variables. We use { HSIC to denote the empirical estimator of the HSIC statistic.

Proof of Proposition 1

k ˚is positive semidefinite (psd) and symmetric as the sum of psd symmetric matrices is also psd symmetric.

Proof of Proposition 2

We show here thatbut the extension to the arbitrary sums is straightforward. Here { dHSIC is the estimator for dHSIC which is the d-variable version of HSIC. It is defined aswhere k j is the kernel of the jth random variable and X j i is the ith observation for the jth random variable. The estimator { dHSIC is defined as (Sejdinovic et al., 2013a )As dHSIC reduces to HSIC when d " 2, the proof for HSIC also follows. Using the definition of { dHSIC above,

Diversity of acquired samples in repeated-MNIST

To check if ICAL's acquisition batches are diversed enough, we plot the number of times different number of copies of a same sample has been acquired by each method. As shown in figure 8 , our method (as well as BatchBALD, BayesCoreset and Random) successfully avoided acquiring redundant copies of the same sample, whereas FASS and Max Entropy acquired up to 3 copies of the same replica in most acquisitions. This proves that the batched active learning strategies are better in diversity.Figure 8 : Frequencies where different numbers of copies (1-3) of a same sample has been acquired by each method.Further CIFAR-10 and CIFAR-100 results Further CIFAR results are in Table 2 . For CIFAR-100, Random has a high p-value but that is mainly because it performs a bit better in the beginning vs. all other methods but its performance quickly degrades and it is far below ICAL in the final iteration.

Runtime and memory considerations

BatchBALD runs out of memory on CIFAR-10 and CIFAR-100 and thus we are unable to compare against it for those two datasets. For the MNIST-variant datasets, ICAL takes about a minute for building the batch to acquire (batch sizes of 5 and 10). For CIFAR-10 (batch size 3000), with L " 1, the runtime is about 20 minutes but it scales linearly with 1{L (Figure 10 ). Thus it is only 5 minutes for L " 30 ( iter " 100) which is already sufficient to give comparable performance to L " 1 (Figure 9 ). For CIFAR-100 (batch size 3000), the performance does degrade with high L but as we mentioned previously, ICAL-pointwise holds up a lot better in terms of performance with high L (Figure 7 ) and thus if time is a strong consideration, that variant should be used instead.

