INFORMATION CONDENSING ACTIVE LEARNING

Abstract

We introduce Information Condensing Active Learning (ICAL), a batch mode model agnostic Active Learning (AL) method targeted at Deep Bayesian Active Learning that focuses on acquiring labels for points which have as much information as possible about the still unacquired points. ICAL uses the Hilbert Schmidt Independence Criterion (HSIC) to measure the strength of the dependency between a candidate batch of points and the unlabeled set. We develop key optimizations that allow us to scale our method to large unlabeled sets. We show significant improvements in terms of model accuracy and negative log likelihood (NLL) on several image datasets compared to state of the art batch mode AL methods for deep learning.

1. Introduction

Machine learning models are widely used for a vast array of real world problems. They have been applied successfully in a variety of areas including biology (Ching et al., 2018) , chemistry (Sanchez-Lengeling and Aspuru-Guzik, 2018), physics (Guest et al., 2018) , and materials engineering (Aspuru-Guzik and Persson, 2018) . Key to the success of modern machine learning methods is access to high quality data for training the model. However such data can be expensive to collect for many problems. Active learning (Settles, 2009) is a popular methodology to intelligently select the fewest new data points to be labeled while not sacrificing model accuracy. The usual active learning setting is pool-based active learning where one has access to a large unlabeled dataset D U and uses active learning to iteratively select new points from D U to label. Our goal in this paper is to develop an active learning acquisition function to select points that maximize the eventual test accuracy which is also one of the most popular criteria used to evaluate an active learning acquisition function. In active learning, an acquisition function is used to select which new points to label. A large number of acquisition functions have been developed over the years, mostly for classification (Settles, 2009) . Acquisition functions use model predictions or point locations (in input feature or learned representation space) to decide which points would be most helpful to label to improve model accuracy. We then query for the labels of those points and add them to the training set. While the past focus for acquisition functions has been the acquisition of one point at a time, each round of label acquisition and retraining of the ML model, particularly in the case of deep neural networks, can be expensive. Furthermore in several applications like biology, it can be much faster to do acquisition of a fixed number of points in parallel versus sequentially. There have been several papers, particularly in the past few years, that try to avoid this issue by acquiring points in batch. As our goal is to apply AL in the context of modern ML models and data, we focus in this paper on batch-mode AL. Acquisition functions can be broadly thought of as belonging to two categories. The ones from the first category directly focus on minimizing the error rate post-acquisition. A natural choice of such an acquisition function might be to acquire labels for points with the highest uncertainty or points closest to the decision boundary (Uncertainty sampling can be directly linked to minimizing error rate in the context of active learning Mussmann and Liang ( 2018)). In the other category, the goal is to get as close as possible to the true underlying model. Thus here, acquisition functions select points which give the most amount of knowledge regarding a model's parameters where knowledge is defined as the statistical dependency between the parameters of the model and the predictions for the selected points. Mutual information (MI) is the usual choice for the dependency, though other choices are possible. For well-specified model spaces, e.g. in physics, such a strategy can identify the correct model. In machine learning, however, models are usually mis-specified, and thus the

