CONTINUAL ACTIVE LEARNING

Abstract

While active learning (AL) improves the labeling efficiency of machine learning (by allowing models to query the labels of data samples), a major problem is that compute efficiency is decreased since models are typically retrained from scratch at each query round. In this work, we develop a new framework that circumvents this problem by biasing further training towards the recently labeled sets, thereby complementing existing work on AL acceleration. We employ existing and novel replay-based Continual Learning (CL) algorithms that are effective at quickly learning new samples without forgetting previously learned information, especially when data comes from a shifting or evolving distribution. We call this compute-efficient active learning paradigm "Continual Active Learning" (CAL). We demonstrate that standard AL with warm starting fails, both to accelerate training, and that naive fine-tuning suffers from catastrophic forgetting due to distribution shifts over query rounds. We then show CAL achieves significant speedups using a plethora of replay schemes that use model distillation, and that select diverse/uncertain points from the history, all while maintaining performance on par with standard AL. We conduct experiments across many data domains, including natural language, vision, medical imaging, and computational biology, each with very different neural architectures (Transformers/CNNs/MLPs). CAL consistently provides a 2-6x reduction in training time, thus showing its applicability across differing modalities.

1. INTRODUCTION

While neural networks have been immensely successful in a variety of different supervised settings, most deep learning approaches are data-hungry and require significant amounts of computational resources. From a large pool of unlabeled data, active learning (AL) approaches select subsets of points to label by imparting the learner with the ability to query a human annotator. Such methods incrementally add points to the pool of labelled samples by 1) training a model from scratch on the current labelled pool and 2) using some measure of model uncertainty and/or diversity to select a set of points to query the annotator (Settles, 2009; 2011; Wei et al., 2015; Ash et al., 2020; Killamsetty et al., 2021) . AL has been shown to reduce the amount of data required for training, but can still be computationally expensive to employ since it requires retraining the model, typically from scratch, when new points are labelled at each round. A simple way to tackle this problem is to warm start the model parameters between rounds to reduce the convergence time. However, the observed speedups tend to still be limited since the model must make several passes through an ever-increasing pool of data. Moreover, warm starting alone in some cases can hurt generalization, as discussed in Ash & Adams (2020) and Beck et al. (2021) . Another extension to this is to solely train on the newly labeled batch of examples to avoid re-initialization. However, as we show in Section 3.3, naive fine-tuning fails to retain accuracy on previously seen examples since the distribution of the query pool may drastically change with each round. This problem of catastrophic forgetting while incrementally learning from a series of new tasks with shifting distribution is a central question in another paradigm called Continual Learning (CL) (French, 1999; McCloskey & Cohen, 1989; McClelland et al., 1995; Kirkpatrick et al., 2017c) . CL has recently gained popularity, and many algorithms have been introduced to allow models to quickly adapt to new tasks without forgetting (Riemer et al., 2018; Lopez-Paz & Ranzato, 2017; Chaudhry et al., 2019; Aljundi et al., 2019b; Chaudhry et al., 2020; Kirkpatrick et al., 2017b) . In this work, we propose Continual Active Learning (CAL), which applies continual learning strategies to accelerate batch active learning. In CAL, we propose applying CL to enable the model to learn the newly labeled points without forgetting previously labeled points while using past samples efficiently using replay-based methods. As such, we observe that CAL methods attain significant speedups over standard AL in terms of training time. Such speedups are beneficial for the following reasons: • As neural networks grow in size (Shoeybi et al., 2019) , the environmental and financial costs to train these models increase as well (Bender et al., 2021; Dhar, 2020; Schwartz et al., 2020) . Reducing the number of gradient updates required for AL will help mitigate such costs, especially with large-scale models. • Reducing the compute required for AL makes AL-based tools more accessible for deployment on edge computing platforms, IoT, and other low-resource devices (Senzaki & Hamelain, 2021). • Developing new AL algorithms/acquisition functions, or searching for architectures as done with NAS/AutoML, that are well-suited specifically for AL can require hundreds or even thousands of runs. Since CAL's speedups are agnostic to the AL algorithm and the neural architecture, such experiments can be significantly sped up. The importance of speeding up the training process in machine learning is well recognized and is evidenced by the plethora of optimized machine learning training literature seen in the computing systems community (Zhihao Jia & Aiken.; Zhang et al., 2017; Zheng et al., 2022) . In addition, CAL demonstrates a practical application for CL methods. Many of the settings used to benchmark CL methods in recent works are somewhat contrived and unrealistic. Most CL works consider the class/domain incremental setting, where only the samples that belong to a subset of the set of classes/domains of the original dataset are available to the model at any given time. This setting rarely occurs in practice, representing the worst-case scenario and therefore should not be the only benchmark upon which CL methods are evaluated. We posit that the validity of future CL algorithms may be determined based on their performance in the CAL setting in addition to their performance in existing benchmarks. To the best of our knowledge, this application of CL algorithms for batch AL has never been explored. Our contributions can be summarized as follows: (1) We first demonstrate that active learning can be viewed as a continual learning problem and propose the CAL framework; (2) we benchmark several existing CL methods (CAL-ER, CAL-DER, CAL-MIR) as well as novel methods (CAL-SD, CAL-SDS2) and evaluate them on several datasets based on the accuracy/speedup they can attain over standard AL.

2. RELATED WORK

Active learning has demonstrated label efficiency (Wei et al., 2015; Killamsetty et al., 2021; Ash et al., 2020) over passive learning. In addition to these empirical advances there has been extensive work on theoretical aspects as well over the past decade (Hanneke, 2009; 2007; Balcan et al., 2010) where Hanneke (2012) shows sample complexity advantages over passive learning in noise-free classifier learning for VC classes. However, recently there has been an interest in speeding up active learning because most deep learning involves networks with a huge numbers of parameters. 



Kirsch et al. (2019); Pinsler et al. (2019); Sener & Savarese (2018) aim to reduce the number of query iterations by having large query batch sizes. However, they do not exploit the learned models from previous rounds for the subsequent ones and are therefore complementary to CAL. Works such as Coleman et al. (2020a); Ertekin et al. (2007); Mayer & Timofte (2020); Zhu & Bento (2017) speed up the selection of the new query set by appropriately restricting the search space or by using generative methods. These works can be easily integrated into our framework because CAL works on the training side of active learning, not on the query selection. On the other hand, Lewis & Catlett (1994); Coleman et al. (2020b); Yoo & Kweon (2019) use a smaller proxy model to reduce computation overhead, however, they still follow the standard active learning protocol, and therefore can be accelerated when integrated with CAL.

