IALE: IMITATING ACTIVE LEARNER ENSEMBLES

Abstract

Active learning (AL) prioritizes the labeling of the most informative data samples. However, the performance of AL heuristics depends on the structure of the underlying classifier model and the data. We propose an imitation learning scheme that imitates the selection of the best expert heuristic at each stage of the AL cycle in a batch-mode pool-based setting. We use DAGGER to train the policy on a dataset and later apply it to datasets from similar domains. With multiple AL heuristics as experts, the policy is able to reflect the choices of the best AL heuristics given the current state of the AL process. Our experiment on well-known datasets show that we both outperform state of the art imitation learners and heuristics.

1. INTRODUCTION

The high performance of deep learning on various tasks from computer vision (Voulodimos et al., 2018) to natural language processing (NLP) (Barrault et al., 2019) also comes with disadvantages. One of their main drawbacks is the large amount of labeled training data they require. Obtaining such data is expensive and time-consuming and often requires domain expertise. Active Learning (AL) is an iterative process where during every iteration an oracle (e.g. a human) is asked to label the most informative unlabeled data sample(s). In pool-based AL all data samples are available (while most of them are unlabeled). In batch-mode pool-based AL, we select unlabeled data samples from the pool in acquisition batches greater than 1. Batch-mode AL decreases the number of AL iterations required and makes it easier for an oracle to label the data samples (Settles, 2009) . As a selection criteria we usually need to quantify how informative a label for a particular sample is. Well-known criteria include heuristics such as model uncertainty (Gal et al., 2017; Roth & Small, 2006; Wang & Shang, 2014; Ash et al., 2020) , data diversity (Sener & Savarese, 2018 ), query-by-committee (Beluch et al., 2018) , and expected model change (Settles et al., 2008) . As ideally we label the most informative data samples at each iteration, the performance of a machine learning model trained on a labeled subset of the available data selected by an AL strategy is better than that of a model that is trained on a randomly sampled subset of the data. Besides the above mentioned, in the recent past several other data-driven AL approaches emerged. Some are modelling the data distributions (Mahapatra et al., 2018; Sinha et al., 2019; Tonnaer, 2017; Hossain et al., 2018) as a pre-processing step, or similarly use metric-based meta-learning (Ravi & Larochelle, 2018; Contardo et al., 2017) as a clustering algorithm. Others focus on the heuristics and predict the best suitable one using a multi-armed bandits approach (Hsu & Lin, 2015) . Recent approaches that use reinforcement learning (RL) directly learn strategies from data (Woodward & Finn, 2018; Bachman et al., 2017; Fang et al., 2017) . Instead of pre-processing data or dealing with the selection of a suitable heuristic they aim to learn an optimal selection sequence on a given task. However, these pure RL approaches not only require a huge amount of samples they also do not resort to existing knowledge, such as potentially available AL heuristics. Moreover, training the RL agents is usually very time-intensive as they are trained from scratch. Hence, imitation learning (IL) helps in settings where very few labeled training data and a potent algorithmic expert are available. IL aims to train, i.e., clone, a policy to transfer the expert to the related few data problem. While IL mitigates some of the previously mentioned issues of RL, current approaches are still limited with respect to their algorithmic expert and their acquisition size (including that of Liu et al. ( 2018)), i.e., some only pick one sample per iteration, and were so far only evaluated on NLP tasks. We propose an batch-mode AL approach that enables larger acquisition sizes and that allows to make use of a more diverse set of experts from different heuristic families, i.e., uncertainty, diversity, expected model-change, and query-by-committee. Our policy extends previous work (see Section 2) by learning at which stage of the AL cycle which of the available strategies performs best. We use Dataset Aggregation (DAGGER) to train a robust policy and apply it to other problems from similar domains (see Section 3). We show that we can (1) train a policy on image datasets such as MNIST, Fashion-MNIST, Kuzushiji-MNIST, and CIFAR-10, (2) transfer the policy between them, and (3) transfer the policy between different classifier architectures (see Section 4).

2. RELATED WORK

Next to the AL approaches for traditional ML models (Settles, 2009) also ones that are applicable to deep learning have been proposed (Gal et al., 2017; Sener & Savarese, 2018; Beluch et al., 2018; Settles et al., 2008; Ash et al., 2020) . Below we discuss AL strategies that are trained on data. Generative Models. Explicitly modeled data distributions capture the informativeness that can be used to select samples based on diversity. Sinha et al. ( 2019) propose a pool-based semi-supervised AL where a discriminator discriminates between labeled and unlabeled samples using the latent representations of a variational autoencoder. The representations are used to pick data points that are most diverse and representative (Tonnaer, 2017). Mirza & Osindero (2014) use a conditional generative adversarial network to generate samples with different characteristics from which the most informative are selected using the uncertainty measured by a Bayesian neural network (Kendall & Gal, 2017; Mahapatra et al., 2018) . Such approaches are similar to ours (as they capture dataset properties) but instead we model the dataset implicitly and infer a selection heuristic via imitation. In contrast to them we consider batch-mode AL with acquisition sizes ≥ 1, and work on a poolinstead of a stream-settings. While Bachman et al. (2017) propose a strategy to extend the RL-based approaches to a pool setting, they do still not work on batches. Instead, we allow batches of arbitrary acquisition sizes. Fan et al. (2018) propose a meta-learning approach that trains a student-teacher pair via RL. The teacher also optimizes data teaching by selecting labeled samples from a minibatch that lets the student learn faster. In contrast, our method learns to selects samples from an unlabeled pool, i.e., in a missing target scenario. The analogy of teacher-student is related, however, the objective, method and available (meta-)data to learn a good teacher (policy) are different.

Metric

Multi-armed Bandit (MAB). Baram et al. (2004) treat the online selection of AL heuristics from an ensemble as the choice in a multi-armed bandit problem. COMB uses the known EXP4 algorithm to solve it, and ranks AL heuristics according to a semi-supervised maximum entropy criterion (Classification Entropy Maximization) over the samples in the pool. Building on this Hsu & Lin (2015) learn to select an AL strategy for an SVM-classifier, and use importance-weighted accuracy extension to EXP4 that better estimates each AL heuristics' performance improvement, as an unbiased estimator for the test accuracy. Furthermore, they reformulate the MAB setting so that the heuristics are the bandits and the algorithm selects the one with the largest performance improvement, in contrast to COMB's formulation where unlabeled samples are the bandits. Chu & Lin (2016 ) extend Hsu & Lin (2015) to a setting where the selection of AL heuristics is done through a linear weighting, aggregating experience over multiple datasets. They adapt the semi-supervised reward scheme from Hsu & Lin (2015) to work with their deterministic queries. In our own work, we instead learn a unified AL policy instead of selecting from a set of available heuristics. This allows our policy to learn interpolation between batches of samples proposed by single heuristics and furthermore, to exploit the classifier's internal state, so that it is especially suited for deep learning models.



Learning. Metric learners such as Ravi & Larochelle (2018) use a set of statistics calculated from the clusters of un-/labeled samples in a Prototypical Network's (Snell et al., 2017) embedding space, or learn to rank (Li et al., 2020) large batches. Such statistics use distances (e.g. Euclidean distance) or are otherwise converted into class probabilities. Two MLPs predict either a quality or diversity query selection using backpropagation and the REINFORCE gradient (Mnih & Rezende, 2016). However, while they rely on statistics over the classifier's embedding and explicitly learn two strategies (quality and diversity) we use a richer state and are not constrained to specific strategies. Reinforcement Learning (RL). The AL cycle can be modeled as a sequential decision making problem. Woodward & Finn (2018) propose a stream-based AL agent based on memory-augmented neural networks where an LSTM-based agent learns to decide whether to predict a class label or to query the oracle. Matching Networks (Bachman et al., 2017) extensions allow for pool-based AL. Fang et al. (2017) use Deep Q-Learning in a stream-based AL scenario for sentence segmentation.

