INFORMATION CONDENSING ACTIVE LEARNING

Abstract

We introduce Information Condensing Active Learning (ICAL), a batch mode model agnostic Active Learning (AL) method targeted at Deep Bayesian Active Learning that focuses on acquiring labels for points which have as much information as possible about the still unacquired points. ICAL uses the Hilbert Schmidt Independence Criterion (HSIC) to measure the strength of the dependency between a candidate batch of points and the unlabeled set. We develop key optimizations that allow us to scale our method to large unlabeled sets. We show significant improvements in terms of model accuracy and negative log likelihood (NLL) on several image datasets compared to state of the art batch mode AL methods for deep learning.

1. Introduction

Machine learning models are widely used for a vast array of real world problems. They have been applied successfully in a variety of areas including biology (Ching et al., 2018) , chemistry (Sanchez-Lengeling and Aspuru-Guzik, 2018), physics (Guest et al., 2018) , and materials engineering (Aspuru-Guzik and Persson, 2018) . Key to the success of modern machine learning methods is access to high quality data for training the model. However such data can be expensive to collect for many problems. Active learning (Settles, 2009) is a popular methodology to intelligently select the fewest new data points to be labeled while not sacrificing model accuracy. The usual active learning setting is pool-based active learning where one has access to a large unlabeled dataset D U and uses active learning to iteratively select new points from D U to label. Our goal in this paper is to develop an active learning acquisition function to select points that maximize the eventual test accuracy which is also one of the most popular criteria used to evaluate an active learning acquisition function. In active learning, an acquisition function is used to select which new points to label. A large number of acquisition functions have been developed over the years, mostly for classification (Settles, 2009) . Acquisition functions use model predictions or point locations (in input feature or learned representation space) to decide which points would be most helpful to label to improve model accuracy. We then query for the labels of those points and add them to the training set. While the past focus for acquisition functions has been the acquisition of one point at a time, each round of label acquisition and retraining of the ML model, particularly in the case of deep neural networks, can be expensive. Furthermore in several applications like biology, it can be much faster to do acquisition of a fixed number of points in parallel versus sequentially. There have been several papers, particularly in the past few years, that try to avoid this issue by acquiring points in batch. As our goal is to apply AL in the context of modern ML models and data, we focus in this paper on batch-mode AL. Acquisition functions can be broadly thought of as belonging to two categories. The ones from the first category directly focus on minimizing the error rate post-acquisition. A natural choice of such an acquisition function might be to acquire labels for points with the highest uncertainty or points closest to the decision boundary (Uncertainty sampling can be directly linked to minimizing error rate in the context of active learning Mussmann and Liang ( 2018)). In the other category, the goal is to get as close as possible to the true underlying model. Thus here, acquisition functions select points which give the most amount of knowledge regarding a model's parameters where knowledge is defined as the statistical dependency between the parameters of the model and the predictions for the selected points. Mutual information (MI) is the usual choice for the dependency, though other choices are possible. For well-specified model spaces, e.g. in physics, such a strategy can identify the correct model. In machine learning, however, models are usually mis-specified, and thus the metric of evaluation even for model-identification acquisition functions is how successful they are at reducing test error. Given this reality, we follow the viewpoint of trying to minimize the error rate of the model post-acquisition. Our strategy is to select points that we expect would provide substantial information about the labels of the rest of the unlabeled set, thus reducing model uncertainty. We propose acquiring a batch of points B such that the model's predictions on B have as high a statistical dependency as possible with the model's predictions on the entire unlabeled set D U . Thus we want a batch B that condenses the most amount of information about the model's predictions on D U . We call our method Information Condensing Active Learning (ICAL). A key desideratum for our acquisition function is to be model agnostic. This is partly because the model distribution can be very heterogeneous. For example, ensembles which are often used as a model distribution can consist of just decision trees in a random forest or different architectures for a neural network. This means we cannot assume any closed form for the model's predictive distribution, and have to resort to Monte Carlo sampling of the predictions from the model to estimate the dependency between the model's predictions on the query batch and the unlabeled set. MI, however, is known to be hard to approximate using just samples (Song and Ermon, 2019) . Thus to scale the method to larger batch sizes, we use the Hilbert-Schmidt Independence Criterion (HSIC), one of the most powerful extant statistical dependency measures for high dimensional settings. Another advantage of HSIC is that it is differentiable, which as we will discuss later, can allow applications of the acquisition function to areas where MI would be difficult to make work. To summarize, we introduce Information Condensing Active Learning (ICAL) which maximizes the amount of information being gained with respect to the model's predictions on the unlabeled set of points. ICAL is a batch mode acquisition function that is model agnostic and can be applied to both classification and regression tasks. We then develop an algorithm that can scale ICAL to large batch sizes when using HSIC as the dependency measure between random variables. As our method only needs samples from the posterior predictive distribution which can be obtained for both regression and classification tasks, it is applicable to both.

2. Related work

A review of work on acquisition functions for active learning prior to the recent focus on deep learning is given by Settles (2009). The BALD (Bayesian Active Learning by Disagreement) (Houlsby et al., 2011) acquisition function chooses a query point which has the highest mutual information about the model parameters. This turns out to be the point on which individual models sampled from the model distribution are confident about in their prediction but the overall predictive distribution for that point has high entropy. In other words this is the point on which the models are individually confident but disagree on the most. In Guo and Schuurmans (2008) which builds on Guo and Greiner (2007) , they formulate the problem as an integer program where they select a batch such that the post acquisition model is highly confident on the training set and has low uncertainty on the unlabeled set. While the latter aspect is related to what we do, they need to retrain their model for every candidate batch they search over in the course of trying to find the optimal batch. As the total number of possible batches is exponential in the size of the unlabeled set, this can get too computationally expensive for neural networks limiting the applicability of this approach. Thus as far as we know, Guo and Schuurmans (2008) has only been applied to logistic regression. BMDR (Wang and Ye, 2015) queries points that are as close to the classifier decision boundary as possible while still being representative of the overall sample distribution. The representativeness is measured using the maximum mean discrepancy (MMD) (Gretton et al., 2012) of the input features between the query batch and the set of all points, with a lower MMD indicating a more representative query batch. However this approach is limited to classification problems, as it is based on a decision boundary. BMAL (Hoi et al., 2006) selects a batch such that the Fisher information matrices for the total unlabeled set and the selected batch are as close as possible. The Fisher information matrix is however quadratic in the number of parameters and thus infeasible to compute for modern deep neural networks. FASS (Filtered Active Subset Selection) (Wei et al., 2015) picks the most uncertain points and then selects a subset of those points

