ACTIVE LEARNING IN CNNS VIA EXPECTED IMPROVEMENT MAXIMIZATION

Abstract

Deep learning models such as Convolutional Neural Networks (CNNs) have demonstrated high levels of effectiveness in a variety of domains, including computer vision and more recently, computational biology. However, training effective models often requires assembling and/or labeling large datasets, which may be prohibitively time-consuming or costly. Pool-based active learning techniques have the potential to mitigate these issues, leveraging models trained on limited data to selectively query unlabeled data points from a pool in an attempt to expedite the learning process. Here we present "Dropout-based Expected IMprOve-mentS" (DEIMOS), a flexible and computationally-efficient approach to active learning that queries points that are expected to maximize the model's improvement across a representative sample of points. The proposed framework enables us to maintain a prediction covariance matrix capturing model uncertainty, and to dynamically update this matrix in order to generate diverse batches of points in the batch-mode setting. Our active learning results demonstrate that DEIMOS outperforms several existing baselines across multiple regression and classification tasks taken from computer vision and genomics.

1. INTRODUCTION

Deep learning models (LeCun et al., 2015) have achieved remarkable performance on many challenging prediction tasks, with applications spanning computer vision (Voulodimos et al., 2018) , computational biology (Angermueller et al., 2016) , and natural language processing (Socher et al., 2012) . However, training effective deep learning models often requires a large dataset, and assembling such a dataset may be difficult given limited resources and time. Active learning (AL) addresses this issue by providing a framework in which training can begin with a small initial dataset and, based on an objective function known as an acquisition function, choosing what data would be the most useful to have labelled (Settles, 2009) . AL has successfully streamlined and economized data collection across many disciplines (Warmuth et al., 2003; Tong & Koller, 2001; Danziger et al., 2009; Tuia et al., 2009; Hoi et al., 2006; Thompson et al., 1999) . In particular, pool-based AL selects points from a given set of unlabeled pool points for labelling by an external oracle (e.g. a human expert or biological experiment). The resulting labeled points are then added to the training set, and can be leveraged to improve the model and potentially query additional pool points (Settles, 2011) . Until recently, few AL approaches have been formulated for deep neural networks such as CNNs due to their lack of efficient methods for computing predictive uncertainty. Most acquisition functions used in AL require reliable estimates of model uncertainty in order to make informed decisions about which data labels to request. However, recent developments have led to the possibility of computationally tractable predictive uncertainty estimation in deep neural networks. In particular, a framework for deep learning models has been developed viewing dropout (Srivastava et al., 2014) as an approximation to Bayesian variational inference that enables efficient estimation of predictive uncertainty (Gal & Ghahramani, 2016) . Our approach, which we call "Dropout-based Expected IMprOvementS" (DEIMOS), builds upon prior work aiming to make statistically optimal AL queries by selecting those points that minimize expected test error (Cohn et al., 1996; Gorodetsky & Marzouk, 2016; Binois et al., 2019; Roy & McCallum, 2001) . We extend such approaches to CNNs through a flexible and computationally efficient algorithm that is primarily motivated by the regression setting, for which relatively few AL methods have been proposed, and extends to classification. Many AL approaches query the single point in the pool that optimizes a certain acquisition function. However, querying points one at a time necessitates model retraining after every acquisition, which can be computationally-expensive, and can lead to time-consuming data collection (Chen & Krause, 2013). Simply greedily selecting a certain number of points with the best acquisition function values typically reduces performance due to querying similar points (Sener & Savarese, 2018 ). Here we leverage uncertainty estimates provided by dropout in CNNs to create a dynamic representation of predictive uncertainty across a large, representative sample of points. Importantly, we consider the full joint covariance rather than just point-wise variances. DEIMOS acquires the point that maximizes the expected reduction in predictive uncertainty across all points, which we show is equivalent to maximizing the expected improvement (EI). DEIMOS extends to batch-mode AL, where batches are assembled sequentially by dynamically updating a representation of predictive uncertainty such that each queried point is expected to result in a significant, non-redundant reduction in predictive uncertainty. We evaluate DEIMOS and find strong performance compared to existing benchmarks in several AL experiments spanning handwritten digit recognition, alternative splicing prediction, and face age prediction.

2. RELATED WORK

AL is often formulated using information theory (MacKay, 1992b) . Such approaches include querying the maximally informative batch of points as measured using Fisher information in logistic regression (Hoi et al., 2006) , and Bayesian AL by Disagreement (BALD) (Houlsby et al., 2011) , which acquires the point that maximizes the mutual information between the unknown output and the model parameters. Many AL algorithms have been developed based on uncertainty sampling, where the model queries points about which it is most uncertain (Lewis & Catlett, 1994; Juszczak & Duin, 2003) . AL via uncertainty sampling has been applied to SVMs using margin-based uncertainty measures (Joshi et al., 2009) . AL has also been cast as an uncertainty sampling problem with explicit diversity maximization (Yang et al., 2015) to avoid querying correlated points. EI has been used as an acquisition function in Bayesian optimization for hyperparameter tuning (Eggensperger et al., 2013) . Other AL objectives similar to ours have been explored in (Cohn et al., 1996; Gorodetsky & Marzouk, 2016; Binois et al., 2019; Roy & McCallum, 2001) , making statistically optimal queries that minimize expected prediction error, which often reduces to querying the point that is expected to minimize the learner's variance integrated over possible inputs. DEIMOS extends EI and integrated variance approaches, which have traditionally been applied to Gaussian Processes, mixtures of Gaussians, and locally weighted regression, to deep neural networks. Until recently, few AL approaches have proven effective in deep learning models such as CNNs, largely due to difficulties in uncertainty estimation. Although Bayesian frameworks for neural networks (MacKay, 1992a; Neal, 1995) have been widely studied, these methods have not seen widespread adoption due to increased computational cost. However, theoretical advances have shown that dropout, a common regularization technique, can be viewed as performing approximate variational inference and enables estimation of model and predictive uncertainty (Gal & Ghahramani, 2016) . Simple dropout-based AL objectives in CNNs have shown promising results in computer vision classification applications (Gal et al., 2017) . Several new algorithms show promising results for batch-mode AL on complex datasets. Batchbald (Kirsch et al., 2019) extends BALD to batch-mode while avoiding redundancy by greedily constructing a query batch that maximizes the mutual information between the joint distribution over the unknown outputs and the model parameters. Batch-mode AL in CNN classification has also been formulated as a core-set selection problem (Sener & Savarese, 2018) with data points represented (embedded) using the activations of the model's penultimate fully-connected layer. The queried batch of points then corresponds to the centers optimizing a robust (i.e. outlier-tolerant) k-Center objective for these embeddings.

