SAAL: SHARPNESS-AWARE ACTIVE LEARNING

Abstract

While modern deep neural networks play significant roles in many research areas, they are also prone to overfitting problems under limited data instances. Particularly, this overfitting, or generalization issue, could be a problem in the framework of active learning because it selects a few data instances for learning over time. To consider the generalization, this paper introduces the first active learning method to incorporate the sharpness of loss space in the design of the acquisition function, inspired by sharpness-aware minimization (SAM). SAM intends to maximally perturb the training dataset, so the optimization can be led to a flat minima, which is known to have better generalization ability. Specifically, our active learning, Sharpness-Aware Active Learning (SAAL), constructs its acquisition function by selecting unlabeled instances whose perturbed loss becomes maximum. Over the adaptation of SAM into SAAL, we design a pseudo labeling mechanism to look forward to the perturbed loss w.r.t. the ground-truth label. Furthermore, we present a theoretic analysis between SAAL and recent active learning methods, so the recent works could be reduced to SAAL under a specific condition. We conduct experiments on various benchmark datasets for vision-based tasks in image classification and object detection. The experimental results confirm that SAAL outperforms the baselines by selecting instances that have the potentially maximal perturbation on the loss.

1. INTRODUCTION

Recently, deep learning is widely utilized in many research areas, such as computer vision, natural language processing, recommender systems, etc., but its success deeply depends on the large-scale labeled dataset for training the deep neural networks. The importance of the dataset is related to the generalization issue in deep learning, which refers that the model learned with the training dataset suffers from the degradation of performance when the unseen test dataset is encountered for deployment. This degradation results from the neural networks that are prone to overfitting under the lack of the training dataset (Keskar et al., 2016; Neyshabur et al., 2017; Kawaguchi et al., 2017) . The dependency on the dataset also invokes an adaptive data selection by acquisition functions, or active learning, which aims at the efficient use of the limited budget for annotations from oracle (Cohn et al., 1996; Tong, 2001; Settles, 2009) . Recently, various methods for active learning have been proposed; but the model trained with a small number of data from the adaptive selection is often difficult to be generalized (Dasgupta & Hsu, 2008) . Although there exist some prior works that deal with the generalization issue in active learning; those methods solve the problem by either proposing a new risk function (Farquhar et al., 2020) or adopting a new classifier network (Wan et al., 2021) , rather than by inventing a new acquisition function that considers the generalization. In this paper, we propose a new active learning algorithm, named Sharpness-Aware Active Learning (SAAL), that connects active learning and generalization ability to construct the acquisition function. Specifically, we are inspired by Sharpness-Aware Minimization, or SAM (Foret et al., 2020) , which minimizes the maximally perturbed loss of training dataset, leading to minimizing the loss sharpness as well as the task loss, itself. Such optimization leads to a flat minima of the loss landscape, which is shown to have a strong correlation with the generalization performance (Jiang et al., 2019) . Hence, SAAL adopts the maximally perturbed loss as the acquisition score. When calculating the acquisition score for SAAL, we cannot observe the labels for the unlabeled instances, so it is infeasible to compute the perturbed loss. To overcome this challenge, we utilize pseudo labels predicted by the current model, and we theoretically show that our proposed pseudo labeling conservatively estimates the maximally perturbed loss w.r.t. ground-truth label. Also, we theoretically derive the upper bound of the acquisition score of SAAL, which includes the loss, the norm of gradients, and the first eigenvalue of loss Hessian. Among the three terms of the upper bound, the loss and gradient terms are widely used metrics for active learning, which captures the model change by acquiring the instance (Yoo & Kweon, 2019; Ash et al., 2020; Settles et al., 2007) . Meanwhile, the first eigenvalue, which is newly considered by SAAL, is connected to the loss sharpness (Keskar et al., 2017) . Therefore, the selected instances by SAAL might contribute to the generalization of the model. We summarize our contributions in three points. First, we propose Sharpness-Aware Active Learning (SAAL), which considers the loss sharpness for constructing the acquisition function. The loss sharpness is related to the generalization of model, so selecting instances with a high value of loss sharpness might lead to a model with a better generalization performance. Second, we theoretically derive the upper bound of the acquisition score of SAAL and show the connection with the recent active learning methods. Specifically, we find that the upper bound also contains the first eigenvalue of loss Hessian, which is related to the generalization ability. Third, we empirically show that SAAL outperforms the baselines in various vision-based tasks on the benchmark dataset.

2. PRELIMINARIES

2.1 NOTATIONS Throughout this paper, we assume a classification problem and we represent our current deep learning model parameterized by θ as f θ : R d → R |Y | ; where d is the dimension of data instance, x, and Y is the set of candidate classes that x can have. There are two datasets: a dataset with labels, X L , and the other unlabeled dataset, X U . We denote the acquisition function of active learning as f acq : R d → R, where f acq receives a data instance as input, and calculates the informativeness, or the acquisition score, of the instance as output. The loss of a data instance, x, w.r.t. the given label y is represented as l(x, y; θ) := l CE (σ(f θ (x)), y), where σ(•) is a softmax function. The total loss of a dataset, S, is represented as L S (θ) = 1 N N i=1 l(x i , y i ; θ), where S = {(x i , y i )|i = 1, ..., N }. Lastly, we define the pseudo label, ŷ = argmax j∈Y σ(f θ (x)) j ; and we denote the ground-truth label as ȳ.

2.2. ACTIVE LEARNING

There are several active learning scenarios that differ by the setting of data accessibility; which include membership-query synthesis (Angluin, 1988; 2004) , stream-based active learning (Atlas et al., 1989; Cohn et al., 1994) , and pool-based active learning (Lewis & Gale, 1994) . In this paper, we focus on pool-based active learning, where the unlabeled data instances are provided as a large set of data pool, and the active learner sequentially selects the informative instances by a certain criterion. Pool-based active learning is categorized by the definition of informativeness, which includes the uncertainty, diversity, and hybrid-based methods. Uncertainty-based active learning adopts the acquisition function, f acq , to calculate the uncertainty of each unlabeled instance with regard to the current deep learning model, and an oracle provides the ground-truth label of the selected unlabeled instances with the highest uncertainty. Since the acquisition score is usually calculated for an unlabeled instance, x u ∈ X U , w.r.t. the current model, f θ , it is expanded as f acq (x u ; f θ ), resulting in the selection rule as the below. X S = argmax X ′ S ⊂X U xu∈X ′ S f acq (x u ; f θ ) (1) Entropy, which is denoted as f Ent acq (x u ; f θ ) = H[f θ (x u )] = -j σ(f θ (x u )) j log 2 σ(f θ (x u )) j , or variation ratio, which is denoted as f V ar acq = 1-max j σ(f θ (x u )) j , are the most widely used methods for calculating uncertainty (Shannon, 1948; Freeman, 1965) . These days, additional networks are used to approximate the uncertainty of each instance. Learning Loss for Active Learning (LL4AL) (Yoo & Kweon, 2019) trains the loss prediction module, f LP M , which takes the hidden feature maps as input and predicts the expected loss as output. Then, LL4AL constructs the acquisition functions f LL4AL acq (x u ) = f LP M (f k θ (x u )| k=1,...,K ) , where f k θ is the k-th hidden feature map. Variational Adversarial Active Learning (VAAL) (Sinha et al., 2019) trains a discriminator, f dis , which takes a data instance as input and discriminates whether the instance belongs to the labeled dataset or the unlabeled dataset. Then, VAAL calculates the probability of x u belonging to the unlabeled dataset, X U , as the acquisition score, i.e., f V AAL acq (x u ) = f dis (x u ). Diversity-based active learning, such as Coreset approach (Sener & Savarese, 2018) , selects instances that represent the whole distribution of unlabeled instances, by solving a mixed integer programming. Recently, to make use of both uncertainty and diversity methods, Hybrid-based active learning is proposed to select the uncertain instances in a diverse way. In BADGE (Ash et al., 2020) , the acquisition function is calculated as the gradient embedding of x u w.r.t. the parameter of the last fully connected layer, θ out , that is f BADGE acq (x u ) = ∂ ∂θout l(x u , ŷu ; θ) , where ŷu is the pseudo label of x u . Then, this embedding becomes an input to the k-means++ seeding algorithm (Arthur & Vassilvitskii, 2006) . Recently, pool-based active learning has been developed to deal with certain problematic scenarios. Such development includes using random round-robin sampling to efficiently apply active learning in large-batch setting (Citovsky et al., 2021) ; actively selecting test samples to query (Kossen et al., 2021) ; improving conventional active learning algorithm for high dimensional observational data (Jesson et al., 2021) ; etc. Compared to those researches, our proposed SAAL deals with a conventional scenario and focuses on how the proposed acquisition function selects the informative instances.

2.3. SHARPNESS-AWARE MINIMIZATION (SAM)

As an independent research direction from active learning, there is an increasing investigation on the flatness or sharpness of loss response surfaces, and their corresponding optimization. The flat minima of loss landscape is analyzed in various research areas and is confirmed to have deep connection to generalization of neural networks. A recent broad study on the various measures of generalization has confirmed that sharpness-based measure has the strongest correlation to the generalization (Jiang et al., 2019) . Hence, the flat minima is utilized in various research areas where generalization is important, such as domain generalization (Cha et al., 2021) , adversarial robustness (Stutz et al., 2021) , or domain adversarial training (Rangwani et al., 2022) . Having said that, Sharpness-Aware Minimization (SAM) is an optimizer for training the deep neural network (Foret et al., 2020) to weigh the importance of flat minima. Denoting the loss on the dataset S w.r.t. the current parameter θ as L S (θ), the optimization objective of SAM is to minimize the maximally perturbed loss with the regularization on the parameter, as below. min θ max ∥ϵ∥≤ρ L S (θ + ϵ) + γ∥θ∥ 2 2 (2) Here, γ is a hyperparameter that controls the magnitude of the effect of regularization, ϵ is the perturbation to the parameter, and ρ defines the size of the perturbation. The maximally perturbed loss can be decomposed as max ∥ϵ∥≤ρ L S (θ + ϵ) = (max ∥ϵ∥≤ρ L S (θ + ϵ) -L S (θ)) + L S (θ) , which is interpreted as the sharpness term (first term of the RHS) and the classification loss term (second term of the RHS). Hence, SAM minimizes the sharpness of the loss as well as the classification loss value, itself; that is SAM aims at seeking the flat minima among the local minimas. This optimization is a max-min problem, which solves max ∥ϵ∥≤ρ L S (θ + ϵ) first than solves min θ max ∥ϵ∥≤ρ L S (θ + ϵ). The inner maximization problem is solved by finding ϵ * = argmax ∥ϵ∥≤ρ L S (θ + ϵ). By deriving Taylor expansion of L S (θ + ϵ) w.r.t. θ around 0, and by introducing a dual norm problem, the ϵ * is approximated as the follow, with 1 p + 1 q = 1. ϵ * ≈ ρ • sign(∇ θ L S (θ)) |∇ θ L S (θ)| q-1 (∥∇ θ L S (θ)∥ q q ) 1/p (3) After solving the inner maximization using ϵ * , the minimization problem is solved by obtaining the gradient, while excluding the Hessian term, as the below. ∇ θ max ∥ϵ∥≤ρ L S (θ + ϵ) ≈ ∇ θ L S (θ)| θ+ϵ * (4) 3 METHOD 3.1 MOTIVATION According to SAM (Foret et al., 2020) , the loss of the population dataset, D, is upper bounded by the maximally perturbed loss of the training dataset, X . From the perspective of active learning, the training dataset is decomposed into the labeled dataset, X L , and the unlabeled dataset, X U , i.e., X = X L ∪ X U . Hence, the upper bound can be decomposed as the below, with π L = |X L | |X | and π U = |X U | |X | . L D (θ) ≤ max ∥ϵ∥≤ρ L X (θ + ϵ) ≤ π L • max ∥ϵ∥≤ρ L X L (θ + ϵ) + π U • max ∥ϵ∥≤ρ L X U (θ + ϵ) =: L SAAL X (5) Since the population loss, L D (θ), is never accessible; we instead access the rightmost upper bound denoted in Eq. 5, which is represented as L SAAL

X

, and train our network to minimize the upper bound. Among the two terms of L SAAL X , the first term, π L •max ∥ϵ∥≤ρ L X L (θ+ϵ), will be minimized if we use SAM optimizer. Then, the remaining second term, π U • max ∥ϵ∥≤ρ L X U (θ + ϵ), becomes the key component for our optimization in the sharpness-aware active learning scenario. During the active learning iterations, we suppose that we select unlabeled instances, x u ∈ X U , with high values of maximally perturbed loss. In other words, we query the label of such unlabeled instances, so that the remaining X U consists of unlabeled instances whose maximally perturbed loss value is small. Then, it leads to minimizing two terms of L SAAL X simultaneously; which contributes to minimizing the generalization error, L D (θ). Comparison to Semi-Supervised Learning Our proposed active learning algorithm is not the only way for decreasing the loss of unlabeled dataset, X U ; because traditional semi-supervised learning (SSL) is another way that utilizes L X U (θ) while training the model. However, it should be noted that SSL does not guarantee to minimize the rightmost upper bound, L SAAL X . SSL minimizes the average of unlabeled dataset loss instead of the maximum perturbed loss. Hence, it is hard to guarantee that SSL will contribute to minimizing the generalization error without prior knowledge on label distribution (Ben- David et al., 2008) . We can categorize the SSL approach as three ways (Berthelot et al., 2019; Zhu, 2005) , which are consistency regularization (Laine & Aila, 2016; Sajjadi et al., 2016) , entropy minimization (Cires ¸an et al., 2010; Lee et al., 2013) , and traditional regularization, such as weight decay (Zhang et al., 2018a; b) . First, consistency regularization and entropy minimization completely depend on the pseudo-label, and an incorrect pseudo-label might increase the generalization error. Second, the worst-case or hardest instances might have incorrect pseudo-label. In other words, SSL, training the model with an incorrect pseudo-label, might fail to model the maximum perturbed loss. Third, the minimization of maximum perturbed loss is an independent approach to the previous semi-supervised learning methods, such as traditional regularization as well as consistency and entropy minimization. They are potentially compatible with our active learning algorithm.

3.2. SHARPNESS-AWARE ACTIVE LEARNING

Motivated by the Sharpness-Aware Minimization (SAM), our active learning algorithm selects instances with a high perturbed loss under some perturbation on the model parameters, θ. Hence, our acquisition function is as follows: f SAAL acq (x u ; f θ ) = max ∥ϵ∥≤ρ l(x u , ŷu ; θ + ϵ), ( ) where l is the cross-entropy loss function for the model, and θ is the current parameter of the model. Algorithm 1 describes the overall process of our Sharpness-Aware Active Learning. Since our acquisition function is calculated for the unlabeled instances, there comes a problem when calculating the maximally perturbed loss function, which requires label. Hence, we use a pseudo label, ŷu , for the loss calculation. To provide the validity of utilizing pseudo label, we first provide Theorem 3.1, which explains the relation between the maximally perturbed losses which are calculated with pseudo label and with ground-truth label, respectively. Theorem 3.1. (Proof in Appendix A.2.1) For a data instance x, let ŷ be the pseudo label predicted by the network f θ and ȳ be the ground-truth label. Then, the maximally perturbed loss calculated with Algorithm 1 Sharpness-Aware Active Learning Input: Labeled dataset X 0 L , Unlabeled dataset X 0 U , Classifier f θ 1: Initially train f θ by the cross-entropy loss of X 0 L 2: for j = 0, 1, 2, . . . do ▷ active learning 3: Randomly sample X pool U ⊂ X j U 4: for x u ∈ X pool U do 5: Calculate f SAAL acq (x u ; f θ ) as Eq. 6 6: end for 7: Select X S = argmax X ′ S ⊂X pool U xu∈X ′ S f SAAL acq (x u ; f θ ) 8: Query the label of X S to oracle 9: Update the labeled dataset, X j+1 L = X j L ∪ X S 10: Update the unlabeled dataset, X j+1 U = X j U \ X S 11: Train f θ by the cross-entropy loss of X j+1 L 12: end for (x, ŷ) is a lower bound of the maximally perturbed loss calculated with (x, ȳ); with a non-negative margin, δ x , as the below: max ∥ϵ∥≤ρ l(x, ŷ; θ + ϵ) ≤ max ∥ϵ∥≤ρ l(x, ȳ; θ + ϵ) + δ x . Next, Proposition 3.2 shows that the inequality 7 has zero margin under a mild condition. Proposition 3.2. (Proof in Appendix A.2.2) For a data instance x and the corresponding pseudo label ŷ, let ε be the maximal perturbation over the parameters w.r.t. the loss l(x, ŷ; θ + ϵ). If the perturbed network, f θ+ε , keeps the predicted label as the same as the label predicted from the original network, f θ ; then the maximally perturbed loss calculated with (x, ŷ) is a lower bound of the maximally perturbed loss calculated with (x, ȳ), as the below: max ∥ϵ∥≤ρ l(x, ŷ; θ + ϵ) ≤ max ∥ϵ∥≤ρ l(x, ȳ; θ + ϵ). Theorem 3.1 and Proposition 3.2 provide that the selected instances by acquisition score of SAAL with pseudo label would also have high scores w.r.t. the ground-truth label. It indicates that we conservatively estimate the maximally perturbed loss for the acquisition score.

3.3. CONNECTION TO RECENT ACTIVE LEARNING ALGORITHMS

Here, we theoretically derive the upper bound of the acquisition score of SAAL, and show the connection to the recent active learning algorithms as well as the generalization ability. To begin with, we provide Theorem 3.3 as below. Theorem 3.3. (Proof in Appendix A.2. 3) The acquisition function, f SAAL acq , of Eq. 6 is upper bounded by l(θ) + ρ∥∇ θ l(θ)∥ 2 + 1 2 ρ 2 λ 1 + max ∥v∥≤1 O(ρ 2 v 3 ); where l(θ) abbreviates the loss of a data pair, (x, y), and λ 1 is the first eigenvalue of the loss Hessian. Theorem 3.3 derives the upper bound of the acquisition score of SAAL, which consists of the task loss, the gradient norm, and the first eigenvalue of loss Hessian. Since we are selecting instances which have high value of f SAAL acq , the selection refers that we are also selecting instances which have high values of the loss, l(θ), and the magnitude of the gradient embedding, ∥∇ θ l(θ)∥ 2 ; which are connected to LL4AL (Yoo & Kweon, 2019) and BADGE (Ash et al., 2020) , respectively. Furthermore, SAAL considers the first eigenvalue of the loss Hessian w.r.t. the current model, denoted as λ 1 . The importance of the first eigenvalue for generalization is widely studied, that is the first eigenvalue is used as the indicator of the sharpness of the loss surface (Keskar et al., 2017; Zhuang et al., 2022; Kaur et al., 2022) . Hence, the selected instances by SAAL might contribute to the generalization of the model. Figure 1a shows that there exists a positive correlation between our acquisition score, f SAAL acq , and the three terms of upper bound. It should be noted that the upper bound is not our optimization objective, because we use the acquisition score for ranking the instances with top values of the score. Hence, the correlation between the acquisition score and the upper bound is of our interest. Having said that, by selecting the instances with the high acquisition score of SAAL, f SAAL acq , we are selecting instances that have high values of the loss, gradient norm, and the first eigenvalue. Also, Figure 1b shows the value of the three terms of upper bound. Interestingly, as the acquisition iterations proceed, not only the loss and the gradient value, but the first eigenvalue gets smaller. The change of the value of the first eigenvalue is more noticeable in Figure 1c , which plots the value of λ 1 without the scaling term of 1 2 ρ 2 . This indicates that SAAL leads the model to a flat minima, which results in better generalization performance. 

4. EXPERIMENTS AND RESULTS

We examined the performance of SAAL in two vision-based tasks; which are image classification and object detection. Specifically, for image classification, we quantitatively compared the test accuracies among various active learning algorithms; and qualitatively analyzed the performance of SAAL by examining the upper bound of population loss. Also, we analyze the effect of ρ, which defines the size of the perturbation, ϵ.

4.1. IMAGE CLASSIFICATION

Experiment Setting We conduct our experiment on Fashion-MNIST (Fashion) (Xiao et al., 2017) , SVHN (Netzer et al., 2011) , CIFAR-10, and CIFAR-100 (Krizhevsky et al., 2009) dataset. We adopt the ResNet-18 (He et al., 2016) network for our classifier, and we train the network for 50 epochs after each acquisition step, using Adam optimizer (Kingma & Ba, 2015) with a learning rate of 1e-3; and SAM optimizer (Foret et al., 2020) with a learning rate of 1e-3 for Fashion, SVHN, CIFAR-10 and 1e-1 for CIFAR-100. To simulate an experiment scenario with bad generalization cases, we followed the settings of the prior works (Kim et al., 2021) , which assumes a very low amount of allowed budget. For Fashion, SVHN, and CIFAR-10, we construct the initial labeled dataset with 20 instances, which are random but balanced; and we select 10 additional instances with the highest acquisition score among the randomly selected 2,000 unlabeled instances. For CIFAR-100, the initial labeled dataset consists of 1,000 instances, and we select 100 additional instances for 100 repeated iterations. We repeat the acquisition for 100 iterations, and we report the results with three repeated trials. Here, SAAL introduces the perturbation size, ρ, of the perturbation, ϵ, in Eq. 6, and we set the value of ρ as 0.05 for all the datasets. Baselines We compared the performance of SAAL with Random, Entropy (Shannon, 1948) , Coreset (Sener & Savarese, 2018) , Learning Loss for Active Learning (LL4AL) (Yoo & Kweon, 2019) , Variational Adversarial Active Learning (VAAL) (Sinha et al., 2019), and BADGE (Ash et al., 2020) . Since our most relevant algorithm, BADGE, adopts k-means++ seeding algorithm to introduce diversity on the acquisition; we also provide an experimental result with diversity following the same practice from BADGE. Specifically, after calculating our acquisition function using Eq. 6, we implement k-means++ seeding algorithm with the acquisition score as input. Quantitative Analysis Table 1 indicates that SAAL outperforms the baselines in seven out of eight combinations of experiments. The advantage of SAAL becomes obvious when we use the Adam optimizer, rather than the SAM optimizer. We conjecture that this gain for Adam optimizer originates from Eq. 5, which motivates SAAL in modeling the expected flat local minima after acquisitions. Recall that our inaccessible goal, L D (θ), is upper bounded by π L • max ∥ϵ∥≤ρ L X L (θ + ϵ) + π U • max ∥ϵ∥≤ρ L X U (θ + ϵ), as we discussed in Section 3.1. When using Adam optimizer, the first term, max ∥ϵ∥≤ρ L X L (θ + ϵ), in the upper bound is weakly optimized compared to using SAM optimizer, which we will present qualitative analyses in the next section; because SAM optimizer directly minimizes max ∥ϵ∥≤ρ L X L (θ + ϵ). Hence, the importance of the second term in the upper bound, max ∥ϵ∥≤ρ L X U (θ + ϵ), becomes more significant for Adam optimizer. We also provide the test accuracy along the acquisition iterations in Figure 6 of Appendix A.1. The figures show that SAAL achieves higher accuracy more quickly than baselines in most cases, see Figure 6a , 6d, or 6g. Table 1 : Comparison of test accuracy (%) using Adam optimizer and SAM optimizer. The best performance is indicated as boldface. The results are replicated by three times.

Method

Adam optimizer SAM optimizer Fashion SVHN CIFAR-10 CIFAR-100 Fashion SVHN CIFAR-10 CIFAR-100 Random 81.2 ± 0.5 72.4 ± 0.9 50.7 ± 1.5 43.3 ± 0.3 83.7 ± 0.3 78.1 ± 1.1 52.6 ± 2.8 44.0 ± 0.7 Entropy 81.5 ± 1.4 73.1 ± 1.0 51.9 ± 1.8 44.4 ± 0.7 84.1 ± 0.2 77.5 ± 3.2 54.6 ± 0.4 44.1 ± 1.0 Coreset 83.8 ± 0.7 75.3 ± 5.8 51.7 ± 1.0 44.4 ± 0.5 84.4 ± 0.6 78.9 ± 1.3 53.9 ± 1.3 47.6 ± 1.4 LL4ALfoot_0 83.5 ± 1.8 75.1 ± 1.7 51.7 ± 0.4 43.9 ± 0.3 83.2 ± 1.4 72.2 ± 0.2 50.2 ± 1.1 35.7 ± .01 VAAL 83.4 ± 0.1 73.4 ± 1.3 52.0 ± 0.9 44.8 ± 0.3 84.1 ± 0.6 77.1 ± 0.8 53.1 ± 0.9 45.5 ± 0.4 BADGE 85.4 ± 0.6 74.9 ± 1.1 52.3 ± 2.2 45.7 ± 0.6 86.2 ± 0.2 78.8 ± 0.9 56.8 ± 1.9 47.4 ± 0.7 SAAL 85.8 ± 0.8 76.8 ± 0.7 54.4 ± 0.9 47.6 ± 0.9 86.3 ± 0.5 78.8 ± 1.0 57.0 ± 1.1 48.4 ± 0.9 Figure 2 shows the time in seconds with a log scale. The results of Random acquisition show that the SAM optimizer takes twice longer time than the Adam optimizer, because it takes two steps of gradient calculation. However, the gap between Adam and SAM becomes mere when using other active learning algorithms, indicating that the time for calculating acquisition score is the largest bottleneck. SAAL calculates the perturbation, ϵ, for every single unlabeled instance, instead of batch-wise calculation; so it takes longer than most of the other baselines. The time complexity of SAAL can be reduced if we adopt the improved SAM models (Du et al., 2021; 2022) that have been proposed for an efficient calculation. Qualitative Analysis Figure 3 supports the conjecture for the advantage of SAAL by anticipating the flat local minima in the acquisition process. Figure 3 measures the maximally perturbed loss for the labeled dataset, X L ; the unlabeled dataset, X U ; and the total dataset, X L ∪ X U . We compare the results between the models trained with either SAM optimizer or Adam optimizer. Since it is computationally hard to calculate the corresponding perturbation for every single unlabeled instance, x u ∈ X U , we uniformly sample 2,000 unlabeled instances from X U at each iteration; and we report the averaged results for three independently repeated trials. Figure 3a shows the maximally perturbed loss of X U when using SAM optimizer. If we compare the result of SAAL with the results of baselines, SAAL shows the lowest value of the maximally perturbed loss, because SAAL selected the instances with high values of perturbed loss; and because SAAL removes such instances by passing those instances to the labeled dataset. Figure 3b shows the maximally perturbed loss of X L when using SAM optimizer. This loss also indicates the flatness of the model; the lower value of the maximally perturbed loss of X L indicates that the model does not change the result even if the parameter is changed in a small range, which refers to the flat model (Keskar et al., 2016; Neyshabur et al., 2017; Kawaguchi et al., 2017) . Hence, SAAL results in a flat network compared to the baselines. We conjecture that the flat model attained by SAAL is explained by the look-ahead concept (Roy & McCallum, 2001; Konyushkova et al., 2017; Kim et al., 2021) . If we are planning to minimize max ∥ϵ∥≤ρ L X (θ + ϵ) by SAM optimizer, SAAL looks ahead the high values of the max ∥ϵ∥≤ρ L X (θ + ϵ) from unlabeled instances, and SAAL actively selects such unlabeled instances to flatten the future response surface. Finally, Figure 3c shows the maximally perturbed loss of the total dataset, which is equivalent to the upper bound in Eq. 5. As confirmed in the figure, SAAL achieves the lowest upper bound, which indicates that the model trained with SAAL is more likely to achieve a lower population loss, which is our ultimate goal of minimization objective. When comparing the results of SAM (Figure 3a -Figure 3c ) and the results of Adam (Figure 3d -Figure 3f ), the gap between SAAL and other baselines becomes clearer in Adam optimizer. Sensitivity Analysis on ρ SAAL introduces a hyperparameter, ρ, which represents the size of the perturbation, ϵ. Hence, we conduct the sensitivity analysis on ρ with the CIFAR-10 dataset, and we set the candidate values for ρ as 0.01, 0.05, and 0.10. First, we examined the validity of Theorem 3.1 by investigating if the network with the maximally perturbed parameters keeps the predicted label as same as the original network. Figure 4a shows the proportion of unlabeled data instances whose predicted labels are remained the same by the perturbed network during the active learning iterations; that is ψ(ρ) := 1 |X U | x∈X U 1 argmax j f θ+ε (x)j =ŷ , where 1 A is the indicator function that returns 1 if the condition A is satisfied. When the size, ρ, of the perturbation, ϵ, is zero, that is if we do not perturb the network; then the inequality of Eq. 8 is satisfied for all instances, by the definition of the pseudo label, ŷ. As we increase the value of ρ, some instances fail to keep the predicted label as the same as ŷ, because the parameter of the model changes drastically so that the model loses the prediction ability that it has learned so far. Also, we examined the validity of Proposition 3.2 by investigating the value of the margin, δ x , for the unlabeled data instances in Figure 4b . It should be noted that δ x is not our hyperparameter, but a dependent variable subject to change by ρ. We only investigate δ x to reveal the characteristics of ρ, not for the hyperparameter optimizations. To show how the value of the margin, δ x , affects the inequality, we measure the relative value of the margin, δ x , compared to the maximally perturbed loss, max ∥ϵ∥≤ρ l(x, ȳ; θ + ϵ); that is r(ρ, δ x,ȳ;θ+ϵ) . From the analyses of Figure 4a and 4b, we adopted ρ = 0.05, because this value 1) keeps the predicted label of data instance from the original network with high probability and 2) keeps the value of the margin relatively small compared to the max perturbed loss w.r.t. the ground-truth label; while ρ = 0.05 is confirmed to perturb the parameters of the network effectively (Foret et al., 2020) . x ) := 1 |X U | x∈X U δx max ∥ϵ∥≤ρ l( The proper selection of ρ also affects the test accuracy, as shown in Figure 4c . If we select ρ with a too small value, that is ρ = 0.01, the parameter of the model is not perturbed enough to measure the sharpness, so SAAL cannot catch the informative instances. If we select ρ with a too large value, that is ρ = 0.10, the maximally perturbed loss 1) does not satisfy Proposition 3.2, as confirmed in Figure 4a , and 2) have too large value of margin, as confirmed in Figure 4b . Meanwhile, a proper value of ρ = 0.05 for the perturbation, ϵ, shows the best performance. To show the effectiveness of SAAL in a complex task, we conduct an object detection task. Object detection returns the locations of semantic objects and the corresponding labels for a given input image, x. Hence, the loss for training detection model consists of the bounding box regression loss and the classification loss. We experiment with PASCAL VOC 2007 and 2012 dataset (Everingham et al., 2010) , which contains 5,011 images and 4,952 images with 20 object classes, respectively. We adopt Single Shot Multibox Detector (SSD) (Liu et al., 2016) as the detection model. To apply SAAL for object detection, we perturb the parameters to maximize the classification loss; and use the summation of the perturbed loss from every corresponding detection box in the image, x, as the acquisition score for x. Afterward, we select the images with the highest scores. We construct the initial labeled dataset with 1,000 randomly selected images, and we select additional 1,000 instances at every acquisition iterations, so that we attain 10,000 final instances with nine repeated acquisitions. We train the model for 300 epochs with a batch size of 32. Figure 5 reports the mean average precision (mAP) for three repeated trials of SAAL and baselines. As shown in the figure, SAAL achieves high performance at the earlier iterations and shows the highest mAP of 0.7541 at the last iteration; while BADGE, Entropy, and Random show 0.7493, 0.7518, and 0.7403, respectively.

5. CONCLUSION AND FUTURE WORKS

We propose a new active learning method named Sharpness-Aware Active Learning, or SAAL. SAAL considers the loss sharpness of data instances, which is strongly related to the generalization performance of deep learning. Furthermore, we derive the upper bound of SAAL acquisition score and find the connection to the recent active learning methods; as well as the connection to the first eigenvalue of loss Hessian which is widely used as the indicator of loss sharpness. By various experiments with benchmark datasets, SAAL shows better performance than baselines. As a future work, we will improve the time complexity of SAAL by adopting recent SAM models. [max j {f θ+ε (x) j -f θ+ε (x) ŷ }] + where [•] + = max{•, 0}. Then, the following holds. Proposition A.2. For a data instance x and the corresponding pseudo label ŷ, let ε be the maximal perturbation over the parameters w.r.t. the loss l(x, ŷ; θ + ϵ). If the perturbed network, f θ+ε , keeps the predicted label as the same as the label predicted from the original network, f θ ; then the maximally perturbed loss calculated with (x, ŷ) is a lower bound of the maximally perturbed loss calculated with (x, ȳ), as the below: max ∥ϵ∥≤ρ l(x, ŷ; θ + ϵ) ≤ max ∥ϵ∥≤ρ l(x, ȳ; θ + ϵ). Proof. Since the perturbed network, f θ+ε , keeps the predicted label as the same as the label predicted from the original network, f θ ; it holds that argmax f θ+ε (x) = argmax f θ (x) = ŷ and accordingly f θ+ε (x) j ≤ f θ+ε (x) ŷ for all j. Hence, max j {f θ+ε (x) j -f θ+ε (x) ŷ } ≤ 0. Thus, by the definition of the margin in Theorem 3.1, δ x becomes zero. A.2.3 PROOF OF THEOREM 3.3 Theorem A.3. The acquisition function, f SAAL acq , of Eq. 6 is upper bounded by l(θ) + ρ∥∇ θ l(θ)∥ 2 + 1 2 ρ 2 λ 1 + max ∥v∥≤1 O(ρ 2 v 3 ); where l(θ) abbreviates the loss of a data pair, (x, y), and λ 1 is the first eigenvalue of the loss Hessian. Proof. Recall that our acquisition function is f SAAL acq = max ∥ϵ∥≤ρ l(x u , ŷu ; θ + ϵ). Since we limit the size of the perturbation as ∥ϵ∥ ≤ ρ, we can write ϵ = ρv with ∥v∥ ≤ 1, and max ∥ϵ∥≤ρ l(x u , ŷu ; θ + ϵ) = max ∥ρv∥≤ρ l(x u , ŷu ; θ + ρv) = max ∥v∥≤1 l(x u , ŷu ; θ + ρv). Then, by Taylor expansion of l(x u , ŷu ; θ + ρv) w.r.t. θ, the below holds, where we abbreviate l(x u , ŷu ; θ) as l(θ). 



For LL4AL, we failed to reproduce the performance when using SAM optimizer.



(a) Correlation between f SAAL acq and upper bound terms (b) Magnitude of the upper bound terms (c) Detailed view of the first eigenvalue, λ1

Figure 1: Correlation and magnitude of upper bound terms

Figure 2: Comparison of time complexity, in log-scale Next, we compare the time complexity of SAAL and baselines. We used the CIFAR-10 dataset and measured the time for a single iteration of acquisition and training.Figure2shows the time in seconds with a log scale. The results of Random acquisition show that the SAM optimizer takes twice longer time than the Adam optimizer, because it takes two steps of gradient calculation. However, the gap between Adam and SAM becomes mere when using other active learning algorithms, indicating that the time for calculating acquisition score is the largest bottleneck. SAAL calculates the perturbation, ϵ, for every single unlabeled instance, instead of batch-wise calculation; so it takes longer than most of the other baselines. The time complexity of SAAL can be reduced if we adopt the improved SAM models(Du et al., 2021; 2022)  that have been proposed for an efficient calculation.

Figure 3: Maximally perturbed loss of the labeled dataset, unlabeled dataset, and total dataset during the active learning iterations. (a) -(c) are the results of the model trained by SAM optimizer. (d) -(f) are the results of the model trained by Adam optimizer.

Figure 4: Sensitivity analysis on ρ

ŷ; θ + ϵ) = ln j exp(f θ+ε (x) j ) -f θ+ε (x) ŷ ≤ ln j exp(f θ+ε (x) j ) -f θ+ε (x) ȳ + δ x θ+ϵ (x) j ) -f θ+ϵ (x) ȳ   + δ x = max ∥ϵ∥≤ρ l(x, ȳ; θ + ϵ) + δ x A.2.2 PROOF OF PROPOSITION 3.2

ρv) 3 )

A APPENDIX

A.1 TEST ACCURACY We provide the learning curve of SAAL and baselines along the acquisition iterations. Theorem A.1. For a data instance x, let ŷ be the pseudo label predicted by the network f θ and ȳ be the ground-truth label. Then, the maximally perturbed loss calculated with (x, ŷ) is a lower bound of the maximally perturbed loss calculated with (x, ȳ); with a non-negative margin, δ x , as the below:Proof. The cross-entropy loss, l(x, y; θ), is represented with the logit vector f θ (x) ∈ R |Y | as the below:Then, the maximally perturbed loss of a data pair (x, y) is represented as the below:Since the pseudo label, ŷ, satisfies ŷ = argmax j∈Y f θ (x) j by the definition, it holds that f θ (x) ŷ ≥ f θ (x) j for all j ∈ Y . Let ε = argmax ∥ϵ∥≤ρ l(x, ŷ; θ + ϵ). Define the margin, δ x , as δ x :=

