A SIMPLE YET POWERFUL DEEP ACTIVE LEARNING WITH SNAPSHOTS ENSEMBLES

Abstract

Given an unlabeled pool of data and the experts who can label them, active learning aims to build an agent that can effectively acquire data to be queried to the experts, maximizing the gain in performance when trained with them. While there are several principles for active learning, a prevailing approach is to estimate uncertainties of predictions for unlabeled samples and use them to define acquisition functions. Active learning with the uncertainty principle works well for deep learning, especially for large-scale image classification tasks with deep neural networks. Still, it is often overlooked how the uncertainty of predictions is estimated, despite the common findings on the difficulty of accurately estimating uncertainties of deep neural networks. In this paper, we highlight the effectiveness of snapshot ensembles for deep active learning. Compared to the previous approaches based on Monte-Carlo dropout or deep ensembles, we show that a simple acquisition strategy based on uncertainties estimated from parameter snapshots gathered from a single optimization path significantly improves the quality of the acquired samples. Based on this observation, we further propose an efficient active learning algorithm that maintains a single learning trajectory throughout the entire active learning episodes, unlike the existing algorithms training models from scratch for every active learning episode. Through the extensive empirical comparison, we demonstrate the effectiveness of snapshot ensembles for deep active learning.

1. INTRODUCTION

The progress of deep learning is largely driven by data, and we often work with well-curated and labeled benchmark data for model developments. However, in practice, such nicely labeled data are rarely available. Many of the data accessible to practitioners are unlabeled, and more importantly, labeling such data incurs costs due to human resources involved in the labeling process. Active Learning (AL) may reduce the gap between the ideal and real-world scenarios by selecting the informative samples from the unlabeled pool of data, so after being labeled and trained with them, a model can maximally improve the performance. The main ingredient of an AL algorithm is the acquisition function which ranks the samples in an unlabeled pool with respect to their utility for improvement. While there are several possible design principles (Ren et al., 2021) , in this paper, we mainly focus on the acquisition functions based on the uncertainty of the predictions. Intuitively speaking, given a model trained with the data acquired so far, an unlabeled example exhibiting high predictive uncertainty with respect to the model would be a "confusing" sample which would substantially improve the model if being trained with the label acquired from experts. A popular approach in this line is Bayesian Active Learning by Disagreement (BALD) (Houlsby et al., 2011) , where a committee of multiple models predicts an unlabeled sample, and the degree of disagreement is measured as a ranking factor. Here, the multiple models are usually constructed in a Bayesian fashion, and their disagreement reflects the model uncertainty about the prediction. BALD is demonstrated to scale well for modern deep neural networks for high-dimensional and large-scale data (Gal et al., 2017) . Similar to BALD, many AL algorithms based on uncertainty employ a committee of models to estimate the uncertainty of predictions. The problem is, for deep neural networks trained with highdimensional data, it is often frustratingly difficult to accurately estimate the uncertainty. To address this, Gal et al. (2017) proposed to use Monte-Carlo DropOut (MCDO) (Gal and Ghahramani, 2017) , an instance of variational approximation to the posteriors and predictive uncertainty, while Rakesh and Jain (2021) suggested using more generic spike-and-slab variational posteriors (Louizos et al., 2017) . Nevertheless, variational approximations tend to underestimate the posterior variances (Blei et al., 2017; Le Folgoc et al., 2021) , so the uncertainty-based acquisition functions computed from them may be suboptimal. Alternatively, one can employ Deep Ensemble (DE) (Lakshminarayanan et al., 2017) , where a single model is trained multiple times with the same data but with different random seeds for initialization and mini-batching. Despite being simple to implement, DE works surprisingly well, surpassing most of the Bayesian Neural Networks (BNN) alternatives in terms of accuracy and predictive uncertainty (Fort et al., 2021; Ovadia et al., 2019) . To this end, Beluch et al. (2018) highlighted the effectiveness of DE as a way to estimate uncertainty for acquisition functions and demonstrated excellent performance. A drawback of DE is that it is computationally expensive, as it requires multiple models to be trained and maintained for inference. As an alternative, Snapshot Ensemble (SE) (Huang et al., 2017; Garipov et al., 2018) proposes to collect multiple model snapshots (checkpoints) within a single learning trajectory, rather than collecting them at the end of the multiple learning trajectories as in DE. Compared to DE, SE enables the construction of a decent set of models without having to go through multiple training runs while not losing too much accuracy. Inspired by the advantage of SE, we study its use in the context of AL. Specifically, we estimate the uncertainties from SE and use them to evaluate the uncertainty-based acquisition functions. Through extensive empirical comparisons, we demonstrate that the AL based on SE significantly outperforms existing approaches, even comparable to or better than the one with DE. This result is somewhat surprising since it is often reported that SE is less accurate than DE (Ashukha et al., 2020) . Moreover, based on this observation, we propose a novel AL algorithm that can substantially reduce the number of training steps required until the final acquisition. Typically, an AL algorithm alternates between acquiring labels based on a model and re-training the model with the newly acquired labels. Here, for every re-training step, the old models are discarded, and a new model is trained from scratch. Instead, we suggest maintaining a model on a single learning trajectory throughout the entire AL procedure and gathering snapshots from the trajectory to compute acquisition functions. We show that this can significantly reduce the number of training steps without sacrificing too much accuracy. In summary, our contributions are as follows: • We propose to use SE for the uncertainty-based acquisition functions for AL and demonstrate its effectiveness through various empirical evaluations. • We propose a novel AL algorithm where a single learning trajectory is maintained and used to compute acquisition functions throughout the entire AL procedure. We demonstrate that our algorithm could achieve decent accuracy with much fewer training steps.

2.1. SETTINGS AND BASIC ACTIVE LEARNING ALGORITHM

In this paper, we mainly discuss K-way classification problem, where the goal is to learn a classifier f (•; θ), parameterized by θ, taking an input x ∈ R d to produce a K-dimensional probability vector, that is, f (x; θ) ∈ [0, 1] K such that K k=1 f k (x; θ) = 1. To learn θ, we need a labeled dataset consisting of pairs of an input x and corresponding label y ∈ {1, . . . , K}, but in AL, we are given only an unlabeled dataset U = {x i } n i=1 without labels. An AL algorithm is defined with a classifier model f (•; θ) and an acquisition function a : R d → R measuring how useful an unlabeled example x is to the classifier f (•; θ). Given f and a, an AL algorithm alternates between acquiring the labels for chosen unlabeled samples and training the classifier with the labeled samples. A single iteration of acquiring samples and training the classifier is called an episode. In the first episode, m samples are randomly chosen from U, and the labels are acquired for them to constitute an initial training set D train . The classifier is then trained with D train to obtain θ 1 , or in case of the ensemble-based AL, a set of parameters {θ (s) 1 } S s=1 . The labeled samples are removed from U. For all subsequent episodes t ≥ 2, with the parameters {θ (s) t-1 } S s=1 from the previous episode, the samples remaining in U are ranked with the values of the acquisition function, and the top m of them are selected to be labeled. The newly labeled m samples are then appended to the labeled training set D train , and the classifier is trained from scratch with the extended D train to obtain {θ (s) t } S s=1 . The algorithm terminates when it reaches the predetermined number of episodes T , and the goal of AL is to maximize the accuracy of the classifier after the final episode.

2.2. ACQUISITION FUNCTIONS

Here, we review some popular choices for the acquisition functions, especially the ones based on predictive uncertainties. We consider the following acquisition functions which estimate the uncertainty of the predictions via a set of parameters {θ (s) } S s=1 . This set of parameters defines a committee of models {f (•; θ (s) )} S s=1 . Maximum Entropy (ME). ME measures the predictive entropy of a given example x which can be approximated as, H[y|x, D train ] = - K k=1 p(y = k|x, D train ) log p(y = k|x, D train ) ≈ - K k=1 1 S S s=1 f k x; θ (s) log 1 S S s=1 f k x; θ (s) . Larger entropy means that the model is more uncertain about the prediction. Variation Ratio (VR). VR (Freeman, 1965) measures how certain the model is about its prediction for x, or how many of the committee agree with the prediction, and is calculated as 1-f m /S, where f m is the frequency of a mode prediction over S committee members. Similarly, Least Confident (LC) sampling chooses the least confident sample as 1-max k p(y = k|x, D train ), and Margin (MAR) sampling chooses examples with the smallest difference between the largest and the second largest probabilities. Bayesian Active Learning by Disagreement (BALD). BALD (Houlsby et al., 2011) measures the mutual information between the label and the parameter given an input x and the training data D train . It can also be interpreted as measuring disagreement among the predictions of the committee members. BALD is maximized when each of the committee members is certain about their own predictions (small H[y|x, θ (s) ]), but the predictions disagree with each other, so the averaged prediction becomes uncertain (high H[y|x, D train ]). I[y, θ|x, D train ] = H[y|x, D train ] -E θ|Dtrain H[y|x, θ] ≈ - K k=1 1 S S s=1 f k x; θ (s) log 1 S S s=1 f k x; θ (s) + 1 S S s=1 K k=1 f k x; θ (s) log f k x; θ (s) .

2.3. ESTIMATING UNCERTAINTY

In Bayesian AL, the model parameter θ is treated as a random variable with prior p(θ) and the posterior p(θ|D train ) is approximated via a set of samples {θ (s) } S s=1 . There are several ways to approximate it. Variational approximations. For variational approximation, an easy-to-handle variational distribution q(θ) is introduced, learned to minimize D KL [q(θ)∥p(θ|D train )], and used as a proxy for p(θ|D train ). That is, once obtained approximate distribution q(θ), we draw θ (1) , . . . , θ (S) i.i.d. ∼ q(θ). A popular choice for q(θ) is mean-field Gaussian distribution (Blundell et al., 2015) . In AL literature, MCDO is widely used after (Gal et al., 2017) , where the dropout (Srivastava et al., 2014) is applied to the model f and the randomness due to the dropout is interpreted as an approximate posterior q(θ). While relatively simple to implement, the variational approximations are known to underestimate the posterior variances. Deep ensembles. DE (Lakshminarayanan et al., 2017 ) trains f (•; θ) multiple times with the same D train but with different random initializations to obtain {θ (s) } S s=1 . DE is simple to implement, and yet its performance is remarkable, achieving state-of-the-art across various applications. The power of DE is mainly from its property to pick parameters from multiple modes (Fort et al., 2021) , so the committee constructed from them yields a diverse set of predictions. Even if it is not explicitly assuming the prior for p(θ), DE can roughly be interpreted as an approximate Bayesian inference method (Wilson and Izmailov, 2021; D'Angelo and Fortuin, 2021), so the parameters {θ (s) } S s=1 can be interpreted as posterior samples approximating the uncertainty of the models. Snapshot ensembles. DE is expensive, both for training and inference since it has to keep multiple models with different parameters. SE (Huang et al., 2017; Garipov et al., 2018) is an idea to reduce the training cost of DE, where the multiple parameters {θ (s) } S s=1 are gathered within a single training run rather than collected from multiple training runs. To obtain diverse parameters, the learning rate of the training run is carefully chosen to encourage the optimization path to explore a wide area of the loss surface, and the parameter "snapshots" are periodically captured during the training run. The SE usually underperforms DE with the same number of parameters gathered, but it can collect the parameters much faster since it requires only a single training run.

3.1. ACTIVE LEARNING WITH SNAPSHOT ENSEMBLES

We first present an AL algorithm based on SE that is simple and efficient. In each episode, we store parameter snapshots at regular intervals during the classifier training stage, which are then used to compute the acquisition function at the end of the episode. This approach incurs no additional computation cost for training, unlike AL based on DE. In the final episode, we have several options to choose from when we train the classifier using the acquired data D train : following a single learning trajectory and picking the parameter at the last step as a point estimate, or applying DE to obtain an ensembled model. Algorithm 1 summarizes our SE-based AL algorithm, where the final classifier is obtained with vanilla Stochastic Gradient Descent (SGD), but DE can be applied instead. In Section 5, we demonstrate that this simple modification with SE, albeit no increase in the training and inference costs, significantly improves the performance, even outperforming the AL with DE.

3.2. ACTIVE LEARNING WITH SNAPSHOT ENSEMBLES AND FINE-TUNING

Algorithm 1 trains classifiers for intermediate episodes and discards them after computing the acquisition function. However, considering that the acquisition function computed from a single learning trajectory works well, we can improve efficiency by keeping a single trajectory throughout all episodes and using the resulting parameter of the previous episode as initialization for the next episode. We call this strategy SE + FT. There are two things to note here. First, since we continuously fine-tune a single model, we need fewer training steps for each episode than the vanilla AL. Second, although the intermediate classifiers may be less accurate than those trained from scratch, as expected, this is less important since what really matters is the accuracy after the final episode. We argue that the intermediate episodes are to acquire samples quickly for the final training, so the accuracy of the classifier during that process is not important. Also, our AL procedure with fine-tuning is reminiscent of continual learning, in the sense that a single model is continually being trained for multiple episodes having different data. To this end, we employ two commonly used tricks in continual learning literature. Randomly draw m samples from U, remove them from U, and set them as Dtrain.  for t = 1, . . . , T do Randomly initialize θ0. Set Θ ← ∅. for j = 1, . . . , N do Draw a mini-batch B from Dtrain. θj ← θj-1 -η(j)∇ θ J (B, θj-1). if j ≥ Nthres ∧ mod (j -Nthres, ⌊ N -N thres S ⌋) = 0 then Θ ← Θ ∪ {θj}. end end if t < T then Compute a(x, if t = T then Randomly initialize θ0. end Set Θ ← ∅. for j = 1, . . . , N do Draw a mini-batch B from Dtrain. θj ← θj-1 - η(j)∇ θ (J(B,θj-1)+ λ1 {t>1} ∥θj-1 -θ0∥ 2 ). if j ≥ Nthres ∧ mod (j -Nthres, ⌊ N -N thres S ⌋) = 0 then Θ ← Θ ∪ {θj}. end end if t < T then Compute a(x, Θ) for all x ∈ U. Pick top m samples, remove them from U, and append them to Dtrain // Reuse in the next episode. θ0 ← θN . else Set θ * ← θN . end end Replay buffer. In principle, for the purpose of fine-tuning, we may use only the newly acquired data from the previous episode for fine-tuning. However, this would cause catastrophic forgetting (McCloskey and Cohen, 1989), so the acquisition function based on it may be biased towards recently acquired data. In order to prevent this, we adopt the idea of using a replay buffer (Jung et al., 2016; Rolnick et al., 2019; Aljundi et al., 2019) , where we draw some portion of the data from the newly acquired data and the remaining portion from the past data. We empirically find that this significantly improves the stability of the acquisition functions. Regularization. Similar to the replay buffer, we regularize the fine-tuning procedure to avoid deviating too much from the previous parameter (Kirkpatrick et al., 2017) . That is, optimize the parameter θ with the ℓ 2 -regularizer ∥θ -θ 0 ∥ 2 where θ 0 is the starting point of the fine-tuning (the parameters passed from the previous episodes). We also find that this regularization improves the quality of the acquisition functions, leading to better classification accuracy in the final episode. Algorithm 2 summarizes the AL with fine-tuning. The parts that are different from the AL without fine-tuning are marked as blue.

4. RELATED WORKS

Active Learning. Based on how an unlabeled example is fed into an AL agent, AL can broadly be categorized into membership query synthesis, where the agent even generates examples in the sample space, stream-based selective sampling, where the agent decides whether a given input is helpful if labeled in online settings, and pool-based AL, where the agent can access a large unlabeled pool (Settles, 2009) . Pool-based AL can be further divided into uncertainty-based approach, diversity-based approach, and hybird approach (Ren et al., 2021) . Geifman and El-Yaniv (2019) first introduced neural architecture search into AL, claiming that the over-parameterized model could lead to overfitting especially in the earlier episodes, and therefore, uncertainty estimates could be inaccurate. Similarly, Munjal et al. (2022) argued that the optimal hyperparameters may vary with the size of labeled examples and used Bayesian hyperparameter optimization (AutoML) every episode. Ensemble. Ensembles of neural networks have shown competitive performance improvement and are widely used in machine learning and deep learning. In addition, it shows improvements in estimates for predictive uncertainty. DE (Lakshminarayanan et al., 2017) is one of the well-behaved methods estimating the uncertainty of deep neural networks. This method works with several classifiers trained with the same dataset and architecture but with different seeds for a random number generator. However, the biggest limitation of ensemble is its computational costs because it requires training multiple models (Izmailov et al., 2018) . Active Learning with Ensemble. The uncertainty of an example for neural networks can be estimated by using an ensemble of neural networks in the context of AL. Gal 

5. EXPERIMENTS

In this section, through an extensive empirical comparison on three image classification benchmarks (CIFAR-10, CIFAR-100 (Krizhevsky et al., 2009) , and Tiny ImageNet (Le and Yang, 2015)), we would like to demonstrate the followings: • Measuring uncertainty via SE for AL is effective, comparable, or even better than the one based on DE across various choices of uncertainty-based acquisition functions. • AL with SE + FT can significantly reduce the training cost without sacrificing too much accuracy. • The reason why SE is effective for AL is that it can build a committee of models yielding diverse predictions. Interestingly, for the purpose of acquiring samples of better qualities, the diversity in predictions of the models is more important than the accuracy of the models. We compare the AL algorithms with three variables -acquisition functions (VR, ME, BALD, MAR), algorithms to measure uncertainty (DE, SE, MCDO), and how to train the classifier in the final episode (single model via a vanilla SGD, DE). We report the results with ResNet-18 (He et al., 2016) . Please refer to Appendix B for more details, such as experimental protocols or hyperparameter settings. The test accuracy results on CIFAR-10, CIFAR-100, and Tiny ImageNet are summarized in Table 1 , Table 2 , and Table 3 , respectively, according to the proportion of labeled examples. Due to the limited resources, we report the results with four and three acquisition functions for CIFAR-10 and CIFAR-100, respectively, and the results with only VR for Tiny ImageNet.

5.1. ANALYSIS OF THE MAIN RESULTS

Effectiveness of SE for AL. 

5.2. ANALYSIS OF THE UNCERTAINTY ESTIMATION

In this section, we analyze SE in more detail to see why it is more effective than DE for AL. We conjecture that SE builds a committee of models whose predictions are more diverse than the ones from other methods within a single trajectory, and this is a key to its success in AL. To verify this, we measure average KL-divergences and pair-wise disagreement values (Melville and Mooney, 2005) of the predictions computed from SE, DE, and MCDO. The disagreement between two class probabilities f (•; θ (i) ) and f (•; θ (j) ) on an example x is calculated as d i,j (x) = 1 {arg max k f k (x;θ (i) )̸ =arg max k f k (x;θ (j) )} . As summarized in Fig. 2 (left), SE generally exhibits much higher KL-divergences and disagreement values among their predictions compared to DE and MCDO. As reported in the literature (Fort et al., 2021) , DE shows higher disagreements than MCDO but still much less than SE. We believe that this is mainly due to the nature of AL, where we usually work with a relatively small amount of data and less number of training steps than in typical supervised learning settings. DE parameters are collected at the end of each training run, so the models are likely to reach the local optimum. On the other hand, SE collects the parameter snapshots during a single training run, so some of them may not converge to the local optimum. This degrades the classification accuracy, but as we point out in Section 5.1, for the purpose of acquisition, the diversity in predictions within a single trajectory is more important than the accuracy of the individual models. and the number of correctly classified examples is visualized as light gray areas. This can be interpreted as an unnormalized reliability diagram (Murphy and Winkler, 1977) . For instance, for the bin with VR = 0.8 (all disagree), SE exhibits a test error of 72.1%, while DE shows only 46.7%. However, for the bin with VR = 0.0 (consensus), SE and DE display error rates of 6.1% and 15.5%, respectively. As examples with high VR scores are initially selected to be labeled, DE is more likely to query examples that the model is already familiar with but has misclassified as confusing. To provide a qualitative comparison, Fig. 3 shows the map of class prediction probabilities for the committee constructed from SE in the parameter space. The committee members tend to agree on samples with low VR values, whereas they exhibit disagreement on those with high VR values, as expected. This observation underscores the ability of SE to capture parameter snapshots within the same minima while generating diverse predictions on challenging examples with slight changes in the parameter space. Such a phenomenon is closely related to recent advancements in mode connectivity (Garipov et Number of snapshots S. One may increase the number of snapshots collected to better acquire examples to be labeled since SE does not incur additional training costs. However, collecting more snapshots linearly increases the inference time for AL due to multiple forward passes. Overall, there is no significant difference in performance gains from the increase in the number of snapshots. Increasing the number of snapshots results in lower performance than random at the beginning, but performance tends to increase as episodes continue. Based on these findings, one might come up with a strategy that uses a small number of snapshots at first and then increases the number of snapshots in latter episodes. All results used SE with VR acquisition function and were averaged over two trials including the ones with random acquisition. We used as the same hyperparameter settings described in Appendix B. Here, the total epochs N and SE staring epoch N thres are fixed to 200 and 150, respectively. The jump between snapshots differs accordingly.

SE learning rate.

In Algorithm 1 and Algorithm 2, a learning rate is adjusted by the learning rate scheduler η(.), and we used a high constant learning rate while collecting snapshots except for CIFAR-100 due to the instability of training. We maintain the same learning rate for CIFAR-100. The high constant learning rate used when collecting snapshots contributes to diverse predictions and, consequently, the success of SE in the AL context. The learning rate during training was all fixed to 0.001. When the SE learning rate value was too small (η = 0.0001), the performance was inferior. However, too high learning rate may cause a model to diverge and move to different mode values or meaningless areas in the weight space, which surely degrades the acquisition quality. We also include the previous result with CIFAR-100 dataset in Table 9 , where we used η = 0.005 during SE. In our preliminary experiments, the VR acquisition function showed robustness despite the decrease in the accuracy of the model's predictions because it uses the count of members who agree rather than predicted probabilities, whereas other acquisition functions did not perform well. Here, the number of snapshots S and SE starting epoch N thres are fixed to 5 and 150, respectively. The total number of epochs N was also fixed as 200. The starting point of SE N thres . We tried four different starting points of SE, or burn-in time, N thres . This experiment shows how important it is to collect snapshots after the model sufficiently converges and why previous methods have failed. For example, Beluch et al. ( 2018) collected snapshots from the beginning of the training (like at 40, 80, 120, 160, and 200 epoch). We showed that when snapshots were obtained before the model sufficiently converged (N thres = 100, 125), it performed even worse than the random acquisition. Similarly, when the jump between snapshots was too small (in case of N thres =175, and therefore jump=5), the performance dropped in later episodes. Here, the number of snapshots S and SE learning rate were fixed to 5 and 0.01, respectively. The total number of epochs N was also fixed as 200. A.3 SE WITH FINE-TUNING HYPERPARAMETERS SE + FT algorithms are governed mainly by the two hyperparameters: regularization hyperparameter and replay buffer size. Here, the replay buffer size was fixed as 2,500. Although the acquisition quality is improved with regularization, there was no substantially outstanding λ value throughout the entire episodes. Therefore, in the experiments in Section 5, the λ value was fixed as 0.01. Replay buffer size. In the case of the number of data used for fine tuning process, some of the data labeled in the previous episodes were added to the newly acquired data. For CIFAR-10, we used the budget size for 500. Moreover, we add additional 2,000 randomly sampled examples from the previous labeled data. For both CIFAR-100 and Tiny ImageNet, we add 1,000 random sampled from labeled data to 1,000 newly acquired data. This has significantly reduced the training cost. Similar to Fig. 2 (right), Fig. 5 shows the correlation between the VR values and the predicted probabilities of the ground truth class, along with a histogram of VR values. In the histogram, the percentages denote the test error for each bin, and light gray areas depict the number of examples which the committee predicted incorrectly. Fig. 5 clearly illustrates that DE tends to exhibit overconfident results. Even when the number of acquired data increases, the degree of overconfidence by DE remains severe. However, VR scores calculated with SE showed much better correspondence with the actual error rate. Here, we trained the model the first {1000, 2000, 4000} CIFAR-10 examples in the train set.

B EXPERIMENTAL DETAILS B.1 BASELINE DESIGN

When conducting our experiments, we placed great importance on achieving robustness, reproducibility, and generalizability. Our experiments on the CIFAR-10 and CIFAR-100 datasets yielded average test accuracy scores of 90.15% and 68.02%, respectively, over five trials. We achieved these results using the VR acquisition function and acquiring 30% of the labels (equivalent to 15,000 examples) under the settings outlined in Appendix B.3. In contrast, a recent survey by Munjal et al. (2022) reported baseline results of 90.87% and 59.36% test accuracy with 40% random samples. In order to facilitate reproducibility of our results, we provide not only the code and configurations, but also the indices of the queried examples. Reported runtimes are based on an Ubuntu 20.04 server with an AMD Ryzen-4 5900X CPU and 64GB RAM, as well as an NVIDIA RTX-3090 GPU with 24GB VRAM. For faster training on the Tiny-ImageNet dataset, we additionally employed FP16 (Automatic Mixed Precision).

B.2 ACQUISITION QUALITY

To ensure a fair and objective comparison of acquisition quality, we devised an experiment in which the data selected by three different methods, namely SE, DE, and MCDO, were used to re-train both a single model and an ensembled model. By comparing the effectiveness of the selected data in learning, we aimed to provide a comprehensive evaluation of the performance of each method. In cases where multiple experiments were conducted, we carefully examined the results to ensure that they exhibited a similar trend across experiments. We randomly selected one experiment and reported the re-trained results obtained from that experiment due to the limited resources. We used a standard SGD optimizer with the following hyperparameters for both CIFAR-10 and CIFAR-100 datasets: a base learning rate of 0.001, momentum of 0.9, and weight decay of 0.01. The mini-batch size was set to 64 for CIFAR-10 and 128 for CIFAR-100. During SE, we raised the learning rate to 0.01 for CIFAR-10 or dropped it to 0.0001 for CIFAR-100. In our preliminary experiments, we also tried increasing the learning rate to 0.005 during SE for CIFAR-100, but the results (shown in Table 9 ) were not as good as those obtained with the final settings. Our results showed that SE with VR outperformed DE and MCDO, but other acquisition functions did not perform well with SE. Using a learning rate lower than the base learning rate can help collect snapshots yielding decent predictions if the training itself is unstable. In contrast, a learning rate higher than the base learning rate worked well in other hyperparameter settings and datasets. To speed up convergence and reduce the effort of finding optimal hyperparameters, we used One Cycle learning rate scheduler (ONECYCLE) proposed by Smith and Topin (2019), setting max lr to 0.01, for both datasets. For augmentations, we normalized images with the mean and variance of all images in the train set and applied random horizontal flip to both datasets. Additionally, random cropping was applied to CIFAR-100. We trained the models for a total of 200 epochs, and for SE, we collected five snapshots with an additional 50 epochs (10-epoch interval). We also evaluated the effect of using ONECYCLE on the performance of various acquisition functions for CIFAR-10 with ResNet-18 in Table 10 . Without ONECYCLE, the performance gain compared to random acquisition of all acquisition functions decreased, with the differences ranging from 2.2%p to 2.9%p. However, when using ONECYCLE, the performance of all acquisition functions improved, but the differences from random acquisition decreased to a range of 1.1%p to 1.9%p. Despite the fact that the differences from random sampling were lower with ONECYCLE than without ONECYCLE, we chose to report the results with ONECYCLE as they are more relevant for practical applications with limited labeled examples. The reported accuracies are averaged over five trials.

B.3 HYPERPARAMETERS

For Tiny-ImageNet dataset with ResNet-18, we also used a SGD optimizer with momentum of 0.9 and weight decay of 0.0001. We adjusted the learning rate to 0.1 for the first half of the training epochs, 0.01 until the 75% of the training epochs, and 0.001 for the rest. During SE, the learning rate is increased to 0.05. For augmentations, we used random crops and random horizontal flips, which are the de-facto standard random augmentation strategies when reporting in academia. . We here also used a SGD optimizer with momentum of 0.9 and weight decay of 0.0001. For ResNet-50, we used ONECYCLE (the same as above) and for ViT-base, we did not use a learning rate scheduler (constant learning rate). We also set Q to 10,000. Please see Table 8 for a summary of hyperparameters.

B.4 MODEL STRUCTURES

For all experiments above, the structure of ResNet-18 model is slightly modified in order to fit 32 × 32 and 64 × 64 images following the standard protocol as follows: • The kernel size of the first convolution layer (conv1) has changed to 3, and the stride is changed to 1. • The max pooling layer is disabled. • For MCDO, a dropout layer with dropout rate p = 0.5 is added in front of the final linear classifier layer, since the original implementation has no dropout layers. Similarly, the structure of VGG-16 model is slightly modified as follows: • The batch normalization layer is added next to every convolution layer. • The average pooling layer is disabled. • No additional dropout layer is attached to the model since its classifier already has two dropout layers with dropout rate p = 0.5. We turned on the dropout layer when querying with MCDO. These additional changes including MCDO layers had no effects on all reported figures in Section 5 since we reported acquisition qualities by re-training models with the same structure with the examples queried by each method.



AL with SE Input : Unlabeled dataset U, number of episodes T , number of (acquisitions m, snapshots S, SGD steps N ) per episode, acquisition function a, snapshot threshold steps Nthres, objective function J , learning rate schedule η. Ouput: A classifier f (•; θ * ).

Figure 1: Results of vanilla AL (gray) and fine-tuning (FT) at specific episodes (the rest of the colors). CIFAR-10, CIFAR-100 and Tiny ImageNet all used ResNet-18 architecture. Note that before the final episode, the intermediate models show low accuracies.

Fig. 2 (right) depicts the correlation between the VR values and the predicted probabilities of the ground truth class, along with distributions of VR scores. The test error rate for each bin are shown,



Figure 5: More boxplots and histograms for DE (up) and SE (down) plotted in the same manner as Fig. 2 (right), showing the tendency as the number of labeled examples increases.

Fig.4shows test accuracy map in the parameter space, and red crosses indicate parameters collected for ensemble. Note that two contours have different color scales. With a high learning rate, SE picks snapshots that are weak individually but strong together around the wider optimum. Each member of DE falls into their own narrow optima.

To reduce the computational overhead in ensemble-based acquisitions due to multiple forward passes of a large unlabeled pool U, we first randomly draw Q unlabeled examples from the pool and measured the scores, following Beluch et al. (2018) and Yoo and Kweon (2019). We set Q to 10,000. The total number of training epochs is 100, and for SE we collected five snapshots with additional 50 epochs (10 epoch interval). For transfer learning experiments on Tiny ImageNet, we used pretrained weights for ResNet-50 from Torchvision (Paszke et al., 2019) and those for ViT-base-224 from PyTorch Image Models (Wightman, 2019) and replaced the final linear classification head. Instead of using the original image size, an image was scaled up to 224 × 224 resolution to match the model trained with ImageNet dataset. The total number of training epochs is 50, and for SE we collected 5 snapshots with additional 25 epochs (5 epoch interval)

AL with SE + Fine-tuning (FT) Input : Input for Algorithm 1 + regularization parameter λ. Ouput: A classifier f (•; θ * ). Randomly draw m samples from U, remove them from U, and set them as Dtrain. Randomly initialize θ0. for t = 1, . . . , T do // Only at the final episode!

et al. (2017) used BALD(Houlsby et al., 2011) on MCDO(Gal and Ghahramani, 2017) and later extended it to batch settings where it considers overlaps among data points to be acquired(Kirsch et al., 2019). However, due to the expensive cost of training multiple models, most research on ensemble-based AL had been restricted to the traditional ML algorithms (Melville and Mooney, 2004; Körner and Wrobel, 2006). Beluch et al. (2018) compares the performance of various acquisition functions and uncertainty estimation methods on large-scale image classification tasks, proving that DE consistently outperforms other uncertainty-based methods such as MCDO(Gal et al., 2017) or a single model. In the same manner, Bayesian neural networks show much robustness and reliability compared to DE and MCDO in the context of AL with continual learning (Rakesh and Jain, 2021).

Table1confirms that the measuring uncertainty with SE acquires the samples leading to the best classification accuracies, regardless of the choice of acquisition functions or how we train the final classifier. This is somewhat remarkable, considering that the runtimes of Test accuracy on CIFAR-10 according to the ratio of labeled examples

Test accuracy on CIFAR-100 according to the ratio of labeled examples The reported means and standard deviations were averaged over five trials and one trial for retraining a single model and DE, respectively.Effectiveness of fine-tuning. Tables 1 to 3 compare the AL with SE + FT (Algorithm 2) to baselines. Somewhat surprisingly, SE + FT achieved comparable or even better test accuracies with much shorter runtimes. Fig.1shows the progression of test accuracies during the episodes of the SE + FT procedure. The intermediate models, which are only used for acquisitions, exhibit lower accuracies, as expected, since they are fine-tuned and trained with fewer examples for a lower number of epochs. However, at the end of the final episode, where a new classifier is trained from scratch, the test accuracy catches up with the vanilla AL, indicating that even if the accuracies of the intermediate classifiers are inferior, the acquired samples are good enough to obtain a decent classifier at the final episode. We set S = 5 for SE + FT, and the reported means and standard deviations were averaged over five trials and one trial for re-training a single model and DE, respectively.

Test accuracy on Tiny ImageNet according to the ratio of labeled examples

al., 2018) and fast ensembling methods(Huang et al., 2017;Izmailov et al., 2018;Maddox et al., 2019), which reveal that wandering around the wider optima leads to diverse yet more reliable predictions. Based on this empirical evidence, we can conclude that SE is effective at discovering samples with high uncertainties which are predicted to be difficult for the classifier.6 CONCLUSIONIn this paper, we demonstrate that estimating uncertainties of the predictions using SE works efficiently for uncertainty-based AL. Through extensive experiments with real-world image classification benchmarks, we empirically confirmed that the AL with SE outperforms AL with DE or MCDO for various choices of acquisition functions. We further presented a novel AL algorithm based on fine-tuning, where we keep a single model and continuously fine-tune it instead of re-initializing the models at the beginning of every episode. The resulting algorithm could achieve comparable classification accuracies given the same number of acquired samples compared to the baselines with much smaller training steps. We provide a detailed analysis of the effectiveness of SE for AL and show that SE builds model committees that yield diverse predictions that are useful for acquiring informative samples.REPRODUCIBILITY STATEMENTWe used the Pytorch (Paszke et al., 2019) library in our experiments and algorithms which are described in Algorithm 1 and Algorithm 2. In addition, all experimental details and configurations on hyperparameters are recorded in Appendix B. We will provide an open-source implementation of AL environments and our code of SE and SE + FT algorithms. Comparison between SE and FT with VGG-16 on CIFAR-10We also used VGG-16 architecture for fine tuning experiments in Table4. Somewhat surprisingly, SE + FT has caught up with SE.

Test accuracy on different hyperparameters for SE

summarizes various settings of SE hyperparameters according to the proportion of labeled examples. Here, we only compare test accuracies of a single model trained with vanilla SGD on CIFAR-10 dataset with VR acquisition function. We additionally applied Stochastic Weight Averaging (SWA)(Izmailov et al., 2018) when training from scratch at the end to be less sensitive to hyperparameter settings and effectively compare the quality of queried examples.

Test accuracy on different regularization hyperparameter λ for FT Regularization hyperparamter λ. The regularization hyperparamter λ has a role of controlling the balance between maintaining a single trajectory from the previous episode and adapting the model to newly acquired samples. It is crucial to find appropriate λ values for ℓ 2 -regularizer ∥θ -θ 0 ∥ 2 . Table6shows test accuracies at 10, 15, 20, 25, and 30 episode (with 5K, 7.5K, 10K, 12.5K, and 15K labeled examples, respectively) depending on different λ values on CIFAR-10 dataset.

Test accuracy at episode 30 according to the replay buffer size on CIFAR-10

shows test accuracies of the final model (trained from scratch) with 15,000 labeled examples according to the size of replay buffer in each episode. Loss surface on test set of SE (left) and DE with same initialization (right) for the visualization purpose in the parameter space when trained with the first 2,000 examples of CIFAR-10 dataset. Contours represent test accuracy, and red points denote weights gathered for AL.

Summary of hyperparameters

Previous results on CIFAR-100

Performance gain with ONECYCLE scheduler with 10,000 labeled examples on CIFAR-10. The differences from random acquisition are in parentheses.

ACKNOWLEDGEMENT

This work was partly supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.2022-0-00184, Development and Study of AI Technologies to Inexpensively Conform to Evolving Policy on Ethics, and No.2019-0-00075, Artificial Intelligence Graduate School Program (KAIST)), and National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2021M3E5D9025030).

availability

https://github.com/nannullna/snapshot

