CONTINUAL ACTIVE LEARNING

Abstract

While active learning (AL) improves the labeling efficiency of machine learning (by allowing models to query the labels of data samples), a major problem is that compute efficiency is decreased since models are typically retrained from scratch at each query round. In this work, we develop a new framework that circumvents this problem by biasing further training towards the recently labeled sets, thereby complementing existing work on AL acceleration. We employ existing and novel replay-based Continual Learning (CL) algorithms that are effective at quickly learning new samples without forgetting previously learned information, especially when data comes from a shifting or evolving distribution. We call this compute-efficient active learning paradigm "Continual Active Learning" (CAL). We demonstrate that standard AL with warm starting fails, both to accelerate training, and that naive fine-tuning suffers from catastrophic forgetting due to distribution shifts over query rounds. We then show CAL achieves significant speedups using a plethora of replay schemes that use model distillation, and that select diverse/uncertain points from the history, all while maintaining performance on par with standard AL. We conduct experiments across many data domains, including natural language, vision, medical imaging, and computational biology, each with very different neural architectures (Transformers/CNNs/MLPs). CAL consistently provides a 2-6x reduction in training time, thus showing its applicability across differing modalities.

1. INTRODUCTION

While neural networks have been immensely successful in a variety of different supervised settings, most deep learning approaches are data-hungry and require significant amounts of computational resources. From a large pool of unlabeled data, active learning (AL) approaches select subsets of points to label by imparting the learner with the ability to query a human annotator. Such methods incrementally add points to the pool of labelled samples by 1) training a model from scratch on the current labelled pool and 2) using some measure of model uncertainty and/or diversity to select a set of points to query the annotator (Settles, 2009; 2011; Wei et al., 2015; Ash et al., 2020; Killamsetty et al., 2021) . AL has been shown to reduce the amount of data required for training, but can still be computationally expensive to employ since it requires retraining the model, typically from scratch, when new points are labelled at each round. A simple way to tackle this problem is to warm start the model parameters between rounds to reduce the convergence time. However, the observed speedups tend to still be limited since the model must make several passes through an ever-increasing pool of data. Moreover, warm starting alone in some cases can hurt generalization, as discussed in Ash & Adams (2020) and Beck et al. (2021) . Another extension to this is to solely train on the newly labeled batch of examples to avoid re-initialization. However, as we show in Section 3.3, naive fine-tuning fails to retain accuracy on previously seen examples since the distribution of the query pool may drastically change with each round. This problem of catastrophic forgetting while incrementally learning from a series of new tasks with shifting distribution is a central question in another paradigm called Continual Learning (CL) (French, 1999; McCloskey & Cohen, 1989; McClelland et al., 1995; Kirkpatrick et al., 2017c) . CL has recently gained popularity, and many algorithms have been introduced to allow models to quickly adapt to new tasks without forgetting (Riemer et al., 2018; Lopez-Paz & Ranzato, 2017; Chaudhry et al., 2019; Aljundi et al., 2019b; Chaudhry et al., 2020; Kirkpatrick et al., 2017b) . In this work, we propose Continual Active Learning (CAL), which applies continual learning strategies to accelerate batch active learning. In CAL, we propose applying CL to enable the model to learn the newly labeled points without forgetting previously labeled points while using past samples efficiently using replay-based methods. As such, we observe that CAL methods attain significant speedups over standard AL in terms of training time. Such speedups are beneficial for the following reasons: • As neural networks grow in size (Shoeybi et al., 2019) , the environmental and financial costs to train these models increase as well (Bender et al., 2021; Dhar, 2020; Schwartz et al., 2020) . Reducing the number of gradient updates required for AL will help mitigate such costs, especially with large-scale models. • Reducing the compute required for AL makes AL-based tools more accessible for deployment on edge computing platforms, IoT, and other low-resource devices (Senzaki & Hamelain, 2021) . • Developing new AL algorithms/acquisition functions, or searching for architectures as done with NAS/AutoML, that are well-suited specifically for AL can require hundreds or even thousands of runs. Since CAL's speedups are agnostic to the AL algorithm and the neural architecture, such experiments can be significantly sped up. The importance of speeding up the training process in machine learning is well recognized and is evidenced by the plethora of optimized machine learning training literature seen in the computing systems community (Zhihao Jia & Aiken.; Zhang et al., 2017; Zheng et al., 2022) . In addition, CAL demonstrates a practical application for CL methods. Many of the settings used to benchmark CL methods in recent works are somewhat contrived and unrealistic. Most CL works consider the class/domain incremental setting, where only the samples that belong to a subset of the set of classes/domains of the original dataset are available to the model at any given time. This setting rarely occurs in practice, representing the worst-case scenario and therefore should not be the only benchmark upon which CL methods are evaluated. We posit that the validity of future CL algorithms may be determined based on their performance in the CAL setting in addition to their performance in existing benchmarks. To the best of our knowledge, this application of CL algorithms for batch AL has never been explored. Our contributions can be summarized as follows: (1) We first demonstrate that active learning can be viewed as a continual learning problem and propose the CAL framework; (2) we benchmark several existing CL methods (CAL-ER, CAL-DER, CAL-MIR) as well as novel methods (CAL-SD, CAL-SDS2) and evaluate them on several datasets based on the accuracy/speedup they can attain over standard AL.

2. RELATED WORK

Active learning has demonstrated label efficiency (Wei et al., 2015; Killamsetty et al., 2021; Ash et al., 2020) over passive learning. In addition to these empirical advances there has been extensive work on theoretical aspects as well over the past decade (Hanneke, 2009; 2007; Balcan et al., 2010) where Hanneke (2012) shows sample complexity advantages over passive learning in noise-free classifier learning for VC classes. However, recently there has been an interest in speeding up active learning because most deep learning involves networks with a huge numbers of parameters. Kirsch et al. (2019) ; Pinsler et al. (2019) ; Sener & Savarese (2018) aim to reduce the number of query iterations by having large query batch sizes. However, they do not exploit the learned models from previous rounds for the subsequent ones and are therefore complementary to CAL. Works such as Coleman et al. (2020a) ; Ertekin et al. (2007) ; Mayer & Timofte (2020) ; Zhu & Bento (2017) speed up the selection of the new query set by appropriately restricting the search space or by using generative methods. These works can be easily integrated into our framework because CAL works on the training side of active learning, not on the query selection. On the other hand, Lewis & Catlett (1994) ; Coleman et al. (2020b) ; Yoo & Kweon (2019) use a smaller proxy model to reduce computation overhead, however, they still follow the standard active learning protocol, and therefore can be accelerated when integrated with CAL. Lastly, there exist a few prior works that explore continual/transfer learning and active learning in the same context. Perkonigg et al. (2021) propose an approach that allows active learning algorithms to be applied to data streams in the context of medical imaging, by introducing a module that detects domain shifts. This differs from our work which uses algorithms that prevent catastrophic forgetting, to accelerate active learning. Zhou et al. (2021) consider a setting in which standard active learning is used to finetune a pre-trained model, and uses transfer learning to do so. Thus, this work does not consider continual learning and active learning in the same setting and is therefore not related to our work. On preventing catastrophic forgetting, in this work, we mostly focus on the replay-based algorithms that are currently state-of-the-art methods in continual learning. However, as demonstrated in Section 3.3 on how active learning rounds can be seen as continual learning, one can apply other methods such as EWC (Kirkpatrick et al., 2017a) , structural regularization (Li et al., 2021) or functional regularization based methods as well. (Titsias et al., 2020) .

3.1. BATCH ACTIVE LEARNING Define

[n] = {1, ..., n}, and let X and Y denote the input and output domains respectively. AL typically starts with an unlabelled dataset U = {x i } i∈ [n] , where each x i ∈ X . The AL setting allows the model f , with parameters θ, to query a user for labels for any x ∈ U, but the total number of labels is limited to a budget b, where b < n. Throughout the work, we consider classification tasks so the output of f (x; θ) is a probability distribution over classes. The goal of AL is to ensure that f can attain low error when trained only on the set of b labelled points. Algorithm 1 details the general AL procedure. Lines 3-6 construct the seed set D 1 , by randomly sampling a subset of points from U and labelling them. Lines 7-14 iteratively expand the labelled set for T rounds by training the model from a random initialization on D t until convergence and selecting b t points (where t∈[T ] b t = b) from U based on some selection criteria that is dependent on θ t . The selection criteria generally selects samples based model uncertainty and/or diversity (Lewis & Gale, 1994; Dagan & Engelson, 1995; Settles; Killamsetty et al., 2021; Wei et al., 2015; Ash et al., 2020; Sener & Savarese, 2017) . In this work, we primarily consider uncertainty sampling Lewis & Gale (1994) ; Dagan & Engelson (1995) ; Settles, though we also test other selection criteria in Section A in the Appendix. Algorithm 1 1: procedure ACTIVELEARNING(f , U, b 1:T , T ) 2: t ← 1, L ← ∅ ▷ Initialize 3: U t ∼ U ▷ Draw b 1 samples from U 4: D t ← {(x i , y i )|x i ∈ U t } ▷ Provide labels 5: U ← U \ U t 6: L ← L ∪ D t 7: while t ≤ T do 8: Randomly initialize θ init 9: θ t ← Train(f, θ init , L) 10: U t ← Select(f, θ t , U, b t ) ▷ Select b t points from U based on θ t 11: D t ← {(x i , y i )|x i ∈ U t } 12: U ← U \ U t ; L ← L ∪ D t ; t ← t + 1 13: return L Uncertainty Sampling is a widely-used practical AL method that selects U t = {x 1 , ..., x bt } to label from U by choosing the samples that maximize a notion of model uncertainty. We consider entropy (Dagan & Engelson, 1995) as the uncertainty metric, so if h(x) ≜ -i∈[k] f (x; θ) i log f (x; θ) i , then U t ∈ arg max A:|A|=bt x∈A h(x).

3.2. CONTINUAL LEARNING

We define D 1:n = i∈[n] D i . In CL, the dataset consists of T tasks {D 1 , ..., D T } that are presented to the model sequentially, where D t = {(x i , y i )} i∈[nt] and n t is the cardinality of D t . At time t ∈ [T ], the data/label pairs are sampled from the current task (x, y) ∼ D t , and the model generally has only limited access to the history D 1:t-1 . The CL objective is to efficiently adapt the model to D t while ensuring that performance on previously learnt tasks D 1:t-1 does not degrade appreciably. Ideally, given a loss function ℓ : X × Y → R, initial parameters θ t-1 , and a model f , θ t can be obtained by solving the CL optimization problem (Aljundi et al., 2019b; Chaudhry et al., 2019; Lopez-Paz & Ranzato, 2017) : arg min θ E (x,y)∼Dt ℓ(y, f (x; θ)) s.t E (x ′ ,y ′ )∼D1:t-1 ℓ(y ′ , f (x ′ ; θ)))≤ E (x ′ ,y ′ )∼D1:t-1 ℓ(y ′ , f (x ′ ; θ t-1 ))) In this work, we focus on replay based CL techniques which attempt to approximately solve the CL optimization problem by using samples from D 1:t-1 to regularize the model while adapting to D t . Algorithm 2 outlines the general replay-based CL algorithm, in which the objective is to adapt f parametrized by θ 0 to D while using samples from the history M. Inside the training loop, B current consists of m points randomly sampled from D. B replay consists of m ′ points that are chosen based on some customizable selection criteria from M. In line 6, θ t is computed based on some update rule that utilizes both B replay and B current . Note that many CL works also consider the problem of selecting which samples should be retained in M, which is relevant in the scenario where D 1:T is too large to store in memory or when T is unknown (Aljundi et al., 2019b) . However, this constraint does not apply to the CAL setting, so in the subsequent sections we consider M = D 1:t-1 . Algorithm 2 1: procedure CONTINUALTRAIN(f , θ 0 , D, M, m, m ′ ) 2: t ← 1 3: while not converged do 4: B current ← {(x i , y i )} m i=1 ∼ D ▷ Sample m points from current task 5: B replay ← Select(f, θ t-1 , M, m ′ ) ▷ Sample replay m ′ points from history 6: θ t ← Update(f, θ t-1 , B current , B replay ) 7: t ← t + 1 8: return θ t

3.3. ACTIVE LEARNING AS CONTINUAL LEARNING

A clear inefficiency of standard AL stems from the fact that the model f must be retrained from scratch on the labelled pool at every round. In this work, we employ CL-inspired techniques to adapt to the newly labelled points, while significantly reducing the number of updates needed on samples labelled in previous rounds. We demonstrate that catastrophic forgetting indeed occurs in AL, when a model is fine-tuned only on the newly labelled points at every round. In Figure 1 , task t indicates the set of points from the training dataset that were selected at the round t of querying based on entropy sampling. On the y-axis, we report the accuracy of each set immediately after the model has been fine-tuned on the points that were just labelled at a particular round. It is evident that the model forgets old information from the precipitous drops in performance for task t -1 as soon as the model is adapted to new task t when points are added to the labelled set. Note that task 1, after the initial drop, tends to increase in performance in the subsequent AL rounds since the points belonging to the initial round are chosen uniformly at random (shown in Algorithm 1) and thus is an unbiased estimate of the full dataset. This trend is generally not present in any of the later tasks, which are sampled from distributions that are conditioned on the model parameters θ t . It is also interesting to note that the model performs considerably worse on all of the tasks (aside from task 1) than it does on the test set, despite the fact that the model has been trained on the labelled pool. This experiment suggests that 1) the distribution of each task t > 1 is distinct from the true data distribution and 2) techniques designed to combat catastrophic forgetting are necessary in order to effectively incorporate new information between successive AL rounds. Algorithm 3 1: procedure CAL(f , U, b, T , m, m ′ ) 2: t ← 1, L ← ∅ ▷ Initialize 3: U t ∼ U ▷ Draw b 1 samples from U 4: D t ← {(x i , y i )|x i ∈ U t } ▷ Provide labels 5: U ← U \ U t 6: L ← L ∪ D t 7: while t ≤ T do 8: θ t ← ContinualTrain(f , θ t-1 , D t , D 1:t-1 , m, m ′ ) 9: U t+1 ← Select(f, θ t , U, b t ) ▷ Select b t points from U based on θ t 10: D t+1 ← {(x i , y i )|x i ∈ U t+1 } 11: U ← U \ U t+1 ; L ← L ∪ D t+1 ; t ← t + 1 12: return L To ameliorate the problem of catastrophic forgetting, we use CL techniques. The continual active learning (CAL) approach is shown in Algorithm 3. The key difference of CAL from standard AL (Algorithm 1) can be found in line 8. Instead of standard training, replay-based CL is used to adapt f to D t while retaining performance on D 1:t-1 . The speedup comes from two points: 1) the number of gradient updates computed for samples from D 1:t-1 is less than that of samples in D t for reasonable choices of m ′ and 2) the model tends to converge faster since its parameters are warm-started. We compare several CAL methods and assess their performance based on their performance on the test set and the speedup they attain compared to standard AL. In the rest of the section L c ≜ E (x,y)∼Bcurrent [ℓ(y, f (x; θ))] Experience Replay (CAL-ER) is the simplest and oldest replay-based method (Ratcliff, 1990; Robins, 1995) . In this approach, B current and B replay are interleaved to create a minibatch B of size m + m ′ and B replay is chosen uniformly at random from D 1:t-1 . The parameters θ of model f are updated based on the gradient computed on B. Maximally Interferred Retrieval (CAL-MIR) addresses the problem of selecting samples from D 1:t-1 , by choosing the m ′ points that are most likely to be forgotten (Aljundi et al., 2019a) . Given a batch of m labelled samples B current sampled from D t and model parameters θ, θ v is computed by taking a "virtual" gradient step i.e. θ v = θ -η∇L c where η is the learning rate. Then for every example x in the history, s M IR (x) = ℓ(f (x; θ), y) -ℓ(f (x; θ v ), y) or the change in loss after taking a single gradient step is computed. The m ′ samples with the highest s M IR score are selected to form B replay . B curent and B replay are concatenated together to form the minibatch (as in CAL-ER), upon which the gradient update is computed. In practice, selection is done on a random subset of D 1:t-1 for speed. Dark Experience Replay (CAL-DER) uses a distillation based approach to regularize updates (Buzzega et al., 2020) . Suppose g(x; θ) denotes the presoftmax logits of classifier f (x; θ) i.e f (x; θ) = softmax(g(x; θ)). In DER, every x ′ ∈ D 1:t-1 has an associated z ′ which corresponds to the logits produced by the model at the end of the task when x was first observed. In other words, if x ′ ∈ D t ′ , then z ′ ≜ g(x ′ ; θ * t ′ )) where t ′ ∈ [t -1] and θ * t ′ are the parameters obtained after round t ′ . DER minimizes L DER as expressed below: L DER ≜ L c + E (x ′ ,y ′ ,z ′ )∼Breplay α ∥g(x ′ ; θ) -z ′ ∥ 2 2 +β ℓ(y ′ , f (x ′ ; θ)) , where B current is a batch sampled from D t , B replay is a batch sampled from D 1:t-1 , and α and β are tuneable hyperparameters. The first term ensures that samples from the current task are classfied correctly. The second term consists of a classification loss and a mean squared error (MSE) based distillation loss that are applied on samples from the history. Scaled Distillation (CAL-SD) is a new CL approach we propose in this work specifically tailored towards the CAL setting. SD addresses the stability-plasticity dilemma that is commonly found in both biological and artificial neural networks (Abraham & Robins, 2005; Mermillod et al., 2013) . A network is stable if it can effectively retain past information but cannot adapt to new tasks efficiently, whereas a network that is plastic can quickly learn new tasks but is prone to forgetting. The trade-off between stability and plasticity is a well-known constraint in CL (Mermillod et al., 2013) . In the context of CAL, we would like the model to be plastic during the early rounds and stable during the later rounds. We apply this intuition to develop SD, which minimizes L SD at round t as shown below: L replay ≜ E (x ′ ,y ′ ,z ′ )∼Breplay [α D KL (softmax(z ′ ) || f (x ′ ; θ)) + (1 -α) ℓ (y ′ , f (x ′ ; θ))] , L SD ≜ λ t L c + (1 -λ t ) L replay , ) where, λ t ≜ 1 1 + |D1:t-1| |Dt| (5) Similar to CAL-DER, L replay is a sum of two losses: a distillation loss and a classification loss. The distillation loss in L replay minimizes the KL divergence between the posterior probabilities produced by f and softmax(z ′ ), where z ′ is defined in the DER section. We use a KL divergence term instead of a MSE loss on the logits, so that the distillation loss and the classification losses are on the same scale. α ∈ [0, 1] is a tuneable hyperparameter. L SD is a convex combination of the classification loss on the current task and L replay . The weight of each term is determined adaptively by the stability/plasticity trade-off term λ t . Higher values of λ t indicate higher model plasticity, since minimizing the classification error of samples from the current task is prioritized. D 1:t-1 increases with t, λ t decreases and the model becomes more stable in the later rounds of training. Scaled Distillation w/ Submodular Sampling (CAL-SDS2) CAL-SDS2 is another a new CL approach we introduce in this work. CAL-SDS2 uses CAL-SD to regularize the model and utilizes submodular sampling to select a diverse set of points from the history to replay. Submodular functions are well suited to capture notions of diversity and representativeness (Lin & Bilmes, 2011; Wei et al., 2015; Bilmes, 2022) , and the greedy algorithm can approximately maximize a monotone submodular function up to a 1 -e -1 factor guarantee (Fisher et al., 1978; Minoux, 1978; Mirzasoleiman et al., 2015) . We define a submodular function G below: G(S) ≜ xi∈A max xj ∈S w ij + λ log 1 + xi∈S h(x i ) , The first term of G is the facility location function, where w ij is a similarity score between samples x i and x j . In our experiments, w ij = exp (-∥z i -z j ∥ 2 /2σ 2 ) where z i is the penultimate layer representation of model f for x i and σ is a hyperparameter. The second term is a a concave over modular function (Liu et al., 2013) and h(x i ) is some measure of model uncertainty. In order to speed up SDS2, we randomly subsample from the history before performing submodular maximization so S ⊂ A ⊂ D 1:t-1 . The objective of CAL-SDS2 is to ensure that the set of samples that are replayed are both difficult and diverse, similar to the motivation of the heuristic employed in Wei et al. (2015) .

4. RESULTS

In this section, we evaluate the validation performance of the model when we train on different fractions (b/n) of the full dataset. We compute the factor speedup attained by a CAL method by dividing the runtime of AL over the runtime of the CAL method. We test the CAL methods on a variety of different datasets spanning multiple modalities. The two methods that do not utilize CAL are AL w/ WS (Active Learning with Warm Starting) and AL. We plot speedup vs mean test accuracy (computed over three random seeds) at different labelling budgets (b/n) for each of the five datasets we consider in this work. Qualitatively, methods that are plotted towards the top right corners are preferable. The results are also available in tabular form in Appendix A. We adapt the AL framework proposed in Beck et al. (2021) for all experiments presented in this section. In the main paper, we show results for uncertainty sampling based acquisition function, but provide results on other acquisition functions as well in Appendix B. Our objective is to demonstrate 1) at least one CAL method exists that can match or outperform a standard active learning technique while achieving a significant speedup for every budget and dataset and 2) models that have been trained using a CAL method behave no differently than standard models.

4.1. EXPERIMENTAL SETUP

FMNIST The FMNIST dataset is a dataset consisting of 70,000 28×28 grayscale images of fashion items belonging to 10 classes (Xiao et al., 2017) . A ResNet-18 architecture (He et al., 2016) and SGD is used. We apply data augmentations, as in Beck et al. (2021) , consisting of random horizontal flips and random croppings. On this dataset, we find that a CAL method matches or outperforms the performance of standard AL in every setting we test 2. CIFAR-10 CIFAR-10 consists of 60,000 32×32 color images with 10 different categories (Krizhevsky, 2009) . We use a ResNet-18 and use the SGD optimizer for all CIFAR-10 experiments. We apply data augmentations consisting of random horizontal flips and random croppings. From the results shown in Figure 3 , there is at least one CAL method that outperforms standard AL for every budget that we examine. Amazon Polarity Similar to Coleman et al. (2020b) , we use Amazon Polarity Review (Zhang et al., 2015) dataset, which is an NLP dataset consisting of reviews from Amazon and their corresponding star-ratings (5 classes). We consider total unlabelled pool of size 2M sentences and use VDCNN-9 Schwenk et al. ( 2017) architecture, trained with Adam optimizer. As observed from Figure 5 , CAL methods achieve speedups while having competitive performance with standard AL procedure.

MedMNIST

COLA (Warstadt et al., 2018) is an another commonly used NLP dataset, which was recently considered in Active Learning setting (Ein-Dor et al., 2020) . It aims to check linguistic acceptibility of a sentence, that is, binary classification. We use BERT (Devlin et al., 2019) backbone trained with Adam optimizer. We consider an unlabled pool of size 7000 and remaining as test; similar to Ein-Dor et al. ( 2020) we use entropy sampling for the acquisition function and report accuracy. Figure 6 reports the performance and speedup of CAL methods with increasing budget, which shows their competitive performance with standard AL procedure. Single-Cell Cell Type Identity Classification Recent single-cell RNA sequencing (scRNA-seq) technologies has enabled large-scale characterization of hundreds of thousands to millions of cells in complex tissues, and accurate cell type annotation is a crucial step in the study of such datasets. To this end, several deep learning models have been proposed to automatically label new scRNA-seq datasets (Xie et al., 2021) . The HCL dataset is a highly class-imbalanced dataset that consists of scRNA-seq data for 562,977 cells across 63 cell types represented in 56 human tissues. (Han et al., 2020) . The data is divided into training, validation and test sets via an 80/10/10 split whilst ensuring similar class proportions across splits. We use the ACTINN model (Ma & Pellegrini, 2019) , a four-layer multi-layer perceptron that predicts the cell-type for each cell given its expression of 28832 genes, and use the SGD optimizer for all experiments. From the results shown in Figure 7 , the majority of the CAL methods outperforms standard AL for every subset size that we examine. 

4.2. SCORE CORRELATION BETWEEN STANDARD AND CAL MODELS

We test whether or not CAL models behave the same way as models that have been trained using standard AL. Specifically, we assess the degree to which the uncertainty scores of CAL models are correlated with standard models. In Figure 8 , we show the pairwise correlation between all the entropy scores of the models we used in the FMNIST and CIFAR-10 experiments at the end of training (after training on 50% of the data). From the results, it is evident that the all the entropy scores are positively correlated, providing an explanation as to why CAL models are able to perform on par with standard models. Figure 8 : The correlation of entropy scores on the test set between models trained using AL/CAL at the end of FMNIST and CIFAR-10 experiments is shown.

5. CONCLUSION

In this work, we proposed the framework of CAL and demonstrated its efficacy in speeding up AL across multiple datasets by applying techniques adapted from CL. Across vision, natural language, medical imaging, and biological datasets, we observe that there is always a CAL method that either matches or outperforms standard AL while achieving considerable speedups. Since CAL is independent of model architecture and AL strategy, this framework is applicable to a broad range of settings. Furthermore, CAL provides a novel application for CL so future CL algorithms can be assessed based on their performance on CAL as well as other existing CL benchmarks.

A.1 RESULTS IN TABULAR FORM

In this section, we report all results presented in Section 3.1 and Section 3.2 in tabular form. All methods highlighted in blue are methods that use CAL. CAL-ER 92.6 ± 0.1 93.9 ± 0.2 94.5 ± 0.1 94.9 ± 0.2 94.9 ± 0.2 1.5× 1.4× 2.0× 2.4× 2.8 × CAL-MIR 92.6 ± 0.3 93.9 ± 0.2 94.5 ± 0.0 94.9 ± 0.1 94.9 ± 0.0 0.9 × 1.2× 1.3× 1.5× 1.7× CAL-DER 92.7 ± 0.1 93.9 ± 0.1 94.5 ± 0.1 94.8 ± 0.2 94.9 ± 0.1 1.4 × 2.0× 2.4× 2.7× 3.1× CAL-SD 92.6 ± 0.1 94.0 ± 0.2 94.5 ± 0.1 94.8 ± 0.2 94.9 ± 0.1 1.4 × 2.0× 2.4× 2.7× 3.1× CAL-SDS2 92.6 ± 0.1 94.0 ± 0.2 94.6 ± 0.2 94.9 ± 0.1 94.9 ± 0.1 1.1× 1.5× 1.7× 1.9× 2.1× AL w/ WS 92.7 ± 0.3 93.8 ± 0.2 94.4 ± 0.1 94.6 ± 0.1 94.4 ± 0.2 1.1× 1.4× 1.5× 1.5× 1.5× AL 92.6 ± 0.3 93.8 ± 0.0 94.4 ± 0.1 94.9 ± 0.2 94.9 ± 0.1 1.0× 1.0× 1.0× 1.0× 1.0× to subsample the history before finding the m ′ samples to replay, but this parameter is not tuned for any of the presented results. We list the specific set of hyperparameters we use for all the main experimental results in this section.

A.2.1 FMNIST

All experiments for FMNIST used a ResNet-18 with an SGD optimizer, with learning rate of 0.01 and batch size of 64. For all the CAL methods, we fix m ′ = 128. A NVIDIA GeForce RTX 1080 GPU was used to run all the reported experiments. We use Adam optimizer with standard parameters with learening rate of 0.001 and a batch size 128. For all the CAL methods, we fix m ′ = 128. All reported models were trained on an NVIDIA GeForce 1080 Ti.

CAL-MIR

C = 256, G(S) = xi∈A max xj ∈S w ij , where S ⊆ A and w ij is a similarity score between samples x i and x j . In our experiments,  w ij = exp (-∥z i -z j ∥ 2 /2σ 2 ) where LL V is the log-likelihood on the validation set V, and LL T is the log-likelihood on the subset S. CAL-ER 92.6 ± 0.1 93.9 ± 0.2 94.6 ± 0.2 95.0 ± 0.1 94.9 ± 0.0 CAL-MIR 92.5 ± 0.1 93.8 ± 0.3 94.6 ± 0.1 94.8 ± 0.1 94.9 ± 0.2 CAL-DER 92.7 ± 0.1 93.8 ± 0.1 94.5 ± 0.1 94.7 ± 0.1 95.0 ± 0.2 CAL-SD 92.8 ± 0.1 93.9 ± 0.1 94.7 ± 0.1 94.8 ± 0.3 94.9 ± 0.1 CAL-SDS2 92.8 ± 0.0 93.8 ± 0.2 94.5 ± 0.1 94.8 ± 0.2 94.9 ± 0.1 AL w/ WS 92.5 ± 0.1 93.8 ± 0.3 94.0 ± 0.2 94.3 ± 0.2 94.3 ± 0.0 AL 92.7 ± 0.4 93.9 ± 0.1 94.5 ± 0.1 94.7 ± 0.3 94.8 ± 0.1 



Figure 1: This figure shows the performance of a ResNet-18 on CIFAR-10, in the active learning setting where the model is only trained on newly labelled points. At each round, 5% of the full dataset is added to the labelled pool.

Figure 2: FMNIST Results

Figure 5: Amazon Polarity Results

C = 256 CAL-DER α = .1, β = 1 CAL-SD α = .25 CAL-SDS2 C = 256, α = .25, σ = 0.1, λ = 1 A.2.2 CIFAR-10 All experiments for CIFAR-10 used a ResNet-18 with an SGD optimizer, with learning rate of 0.02 and a batch size of 20. For all the CAL methods, we fix m ′ = 40. Training is done on an NVIDIA GeForce RTX 2080. CAL-MIR C = 100 CAL-DER α = .1, β = 1 CAL-SD α = .25 CAL-SDS2 C = 100, α = .25, σ = 0.1, λ = 0.1 A.2.3 MEDMNIST All experiments for MedMNIST used a ResNet-18 with an Adam optimizer, with learning of 0.001 and a batch size of 128. For all CAL methods, we fix m ′ = 128. All reported models were trained on an NVIDIA GeForce RTX 2080. CAL-MIR C = 270, m ′ = 128 CAL-DER m ′ = 128, α = .1, β = 1 CAL-SD m ′ = 128, α = .5 CAL-SDS2 C = 270, m ′ = 128, α = .5, σ = 0.1, λ = 10 A.2.4 AMAZON POLARITY REVIEW Throughout our experiments, we sample 2M sentences, and use them as the total training set instead.

FMNIST Results

CIFAR-10 Results

MedMNIST Results

Amazon Polarity Results

COLA Results.

Single-Cell Cell-Type Identity Classification Results SDS2), β ∈ {0.75, 1} (used in CAL-DER), σ ∈ {0.1, 1} (used in CAL-SDS2), and λ ∈ {0.1, 1, 10} (used in CAL-SDS2). C is the hyperparameter used in CAL-MIR and CAL-SDS2

where z i is the penultimate layer representation of model f for x i and σ is a hyperparameter.

± 0.1 94.1 ± 0.1 94.8 ± 0.1 95.1 ± 0.3 95.2 ± 0.2 CAL-MIR 92.6 ± 0.2 94.1 ± 0.4 94.9 ± 0.2 95.0 ± 0.2 95.2 ± 0.2 CAL-DER 91.8 ± 0.5 93.1 ± 0.1 94.3 ± 0.3 94.6 ± 0.1 94.8 ± 0.2 CAL-SD 92.5 ± 0.1 93.8 ± 0.1 94.8 ± 0.0 95.1 ± 0.2 95.2 ± 0.0 CAL-SDS2 87.8 ± 1.1 93.4 ± 0.1 94.6 ± 0.1 95.0 ± 0.2 95.2 ± 0.1 AL w/ WS 92.8 ± 0.0 94.0 ± 0.3 94.6 ± 0.1 94.8 ± 0.1 95.0 ± 0.2 AL 92.7 ± 0.1 94.1 ± 0.3 94.9 ± 0.1 95.0 ± 0.2 95.2 ± 0.1 FMNIST with Margin Score Sampling ± 0.1 89.3 ± 0.1 92.2± 0.2 93.4± 0.1 93.8 ± 0.0 CAL-MIR 81.9 ± 0.1 89.6 ± 0.2 92.2 ± 0.4 93.6 ± 0.0 94.0 ± 0.2 CAL-DER 83.0 ± 0.2 89.5 ± 0.2 92.2 ± 0.2 93.2 ± 0.2 93.6 ± 0.0 CAL-SD 82.6 ± 0.4 89.9 ± 0.4 92.4 ± 0.2 93.5 ± 0.1 93.8 ± 0.2 CAL-SDS2 82.5 ± 0.2 90.2 ± 0.2 92.5 ± 0.2 93.8 ± 0.2 94.1 ± 0.1 AL w/ WS 83.1 ± 0.1 90.3 ± 0.3 93.0 ± 0.2 93.5 ± 0.3 93.6 ± 0.2 AL 75.1 ± 1.2 87.1 ± 1.0 90.2± 0.5 92.0± 0.0 92.8 ± 0.5

CIFAR-10 with Margin Score Sampling

FMNIST with FASS ± 0.2 89.8 ± 0.2 92.5± 0.2 93.4± 0.4 93.7 ± 0.2 CAL-MIR 82.2 ± 0.3 89.4 ± 0.2 92.3 ± 0.1 93.4 ± 0.0 93.5 ± 0.1 CAL-DER 83.1 ± 0.3 89.7 ± 0.2 91.9 ± 0.1 93.1 ± 0.2 93.5 ± 0.1 CAL-SD 83.0 ± 0.3 90.0 ± 0.3 92.5 ± 0.1 93.5 ± 0.1 94.0 ± 0.1 CAL-SDS2 83.0 ± 0.1 90.1 ± 0.1 92.7 ± 0.2 93.5 ± 0.2 94.0 ± 0.0 AL w/ WS 82.8 ± 0.4 90.3 ± 0.1 92.8 ± 0.2 93.6 ± 0.1 93.7 ± 0.3 AL 72.5 ± 2.0 86.6 ± 0.4 90.1± 0.4 91.7± 0.2 92.9 ± 0.2

CIFAR-10 with FASS ± 0.0 93.9 ± 0.2 94.3 ± 0.1 94.7 ± 0.1 94.7 ± 0.2 CAL-MIR 92.5 ± 0.0 93.9 ± 0.4 94.3 ± 0.2 94.4 ± 0.2 94.6 ± 0.1 CAL-DER 92.7 ± 0.1 93.9 ± 0.2 94.3 ± 0.3 94.7 ± 0.2 94.9 ± 0.3 CAL-SD 92.6 ± 0.1 93.8 ± 0.1 94.4 ± 0.3 94.6 ± 0.1 94.7 ± 0.1 CAL-SDS2 92.6 ± 0.1 93.9 ± 0.2 94.4 ± 0.2 94.6 ± 0.3 94.7 ± 0.2 AL w/ WS 92.5 ± 0.1 93.6 ± 0.1 93.9 ± 0.1 94.1 ± 0.1 94.3 ± 0.1 AL 92.5 ± 0.2 93.8 ± 0.1 94.2 ± 0.1 94.6 ± 0.2 94.7 ± 0.2

FMNIST with GLISTER ± 0.3 89.2 ± 0.2 91.9 ± 0.2 93.0 ± 0.1 93.3 ± 0.1 CAL-MIR 81.6 ± 0.3 89.3 ± 0.4 91.7 ± 0.2 92.9 ± 0.1 93.5 ± 0.2 CAL-DER 82.8 ± 0.4 89.5 ± 0.4 91.7 ± 0.4 92.8± 0.6 93.1 ± 0.2 CAL-SD 82.5 ± 0.3 89.6 ± 0.2 92.1 ± 0.2 93.1 ± 0.2 93.8 ± 0.1 CAL-SDS2 81.4 ± 0.4 89.1 ± 0.2 92.1 ± 0.2 93.2 ± 0.3 93.9 ± 0.1 AL w/ WS 81.7 ± 0.4 89.3 ± 0.4 92.1 ± 0.3 93.0 ± 0.1 93.3 ± 0.4 AL 81.0 ± 0.6 88.5 ± 0.5 91.5 ± 0.3 93.0 ± 0.2 93.4 ± 0.3



annex

CAL-DER α = .25, β = 0.75 CAL-SD α = .5 CAL-SDS2 C = 256, α = .75, σ = 1, λ = 1 A.2.5 COLA For all of our experiments we use Huggingface's transformer library Wolf et al. (2020) and use a maximum sentence length of 100. We use Adam optimizer and a learning rate of 5 • 10 -5 , use a batch size of 25 and m ′ = 25. Models were trained on a single GeForce 1080 Ti.

A.2.6 SINGLE-CELL CELL-TYPE IDENTITY CLASSIFICATION

All experiments use SGD optimizer with standard parameters with learning rate of 0.001 and a batch size 128. For all the CAL methods, we fix m ′ = 128. Training is done on an NVIDIA A100-PCIE-40GB.

B RESULTS FOR ADDITIONAL ACTIVE LEARNING STRATEGIES

In this section, we demonstrate that CAL methods are able to accelerate AL strategies other than entropy sampling without incurring any significant performance drops. We test multiple AL strategies on and FMNIST Xiao et al. (2017) and CIFAR-10 Krizhevsky (2009) . Note that the speedups are approximately the same as the ones reported in Section A since the training time is generally independent of the selected AL strategy.

B.1 OVERVIEW OF STRATEGIES

Margin Score Sampling This strategy is another form of uncertainty sampling Settles (2009) as described in the main paper. Instead of the entropy of f (x; θ), the margin score is used as the entropy score i.e. h(x) ≜ 1 -(f (x; θ) i -f (x; θ) j ) where i and j are the indices corresponding to the highest and second highest values of f (x; θ) respectively. 2015) is a two-staged selection method that uses both uncertainty sampling and submodular maximization. Initially, a set of samples A of cardinality c * b t is chosen from U using uncertainty sampling, where c > 1 is a tuneable hyperparameter. Next, U t is constructed by greedily selecting samples that maximize a submodular set function G : 2 A → R + defined on ground set A. Entropy is once again used as the uncertainty metric for the initial stage. For the second stage, G is defined to be the facility location function Wei et al. (2015) expressed below:

C ADDITIONAL DETAILS ON SINGLE-CELL CELL-TYPE IDENTITY CLASSIFICATION DATASET

The human cell landscape (HCL) dataset consists of scRNA-seq data for 562,977 cells across 63 cell types represented in 56 human tissues. Each cell type may be present in multiple tissues. The cell type classes are highly imbalanced, with the rarest cell type, human embryonic stem cell, accounting for 0.00006 % of the total dataset and the most common, fibroblast, accounting for 0.06%. The raw data is first normalized for library-size and scaled to 10000 reads in total, followed by log-transformation. We visualize the dataset using UMAP 9. 

