ACTIVE LEARNING WITH CONTROLLABLE AUGMENTA-TION INDUCED ACQUISITION

Abstract

The mission of active learning is to iteratively identify the most informative data samples to annotate, and therefore to attain decent performance with much fewer samples. Despite the promise, the acquisition of informative unlabeled samples can be unreliable -particularly during early cycles -owning to limited data samples and sparse supervision. To tackle this, the data augmentation techniques seem straightforward yet promising to easily extend the exploration of the input space. In this work, we thoroughly study the coupling of data augmentation and active learning whereby we propose Controllable Augmentation ManiPulator for Active Learning. In contrast to the few prior work that touched on this line, CAMPAL emphasizes a tighten and better-controlled integration of data augmentation into the active learning framework, as in three folds: (i)-carefully designed data augmentation policies applied separately on labeled and unlabeled data pool in every cycle; (ii)-controlled and quantifiably optimizable augmentation strengths; (iii)-full but flexible coverage for most (if not all) active learning schemes. Through extensive empirical experiments, we bring the performance of active learning methods to a new level: an absolute performance boost of 16.99% on CIFAR-10 and 12.25% on SVHN with 1,000 annotated samples. Complementary to the empirical results, we further provide theoretical analysis and justification of CAMPAL.

1. INTRODUCTION

The acquisition of labeled data serves as a foundation for the remarkable successes of deep supervised learning over the last decade, which also incurs great monetary and time costs. Active learning (AL) is a pivotal learning paradigm that puts the data acquisition process into the loop of learning, locating the most informative and valuable data samples for annotation (Settles, 2009; Zhang et al., 2020; Kim et al., 2021a; Wu et al., 2021) . With much-lowered sample complexity but comparable performance compared to its supervised counterpart, active learning is widely used in real-world applications and ML productions (Bhattacharjee et al., 2017; Feng et al., 2019; Hussein et al., 2016) . In spite of its meritorious practicality, active learning often suffers from unreliable data acquisition, especially from the early stages. Notably, the models obtained around the early stages are generally raw and undeveloped due to the insufficient data curated and sparse supervision signal being consumed. The subsequent cycle of the data query is based on the model produced from the current cycle. While this problem can probably be mitigated after adequate cycles are conducted, we argue that the problems at the early stages of AL cannot be overlooked. Indeed, few works have resorted to data augmentation techniques to generate additional data examples for data distribution enrichment, e.g. GAN-based (Tran et al., 2019) and STN-based (Kim et al., 2021b) methods. In this work, we attempt to take a further step in investigating the role of data augmentation for AL. To begin with, we provide a straightforward quantitative observation in Figure 1 . The setup of these results is rather simple: we directly apply vanilla DA operations, such as flipping and rotation, to data samples and linearly increase the augmentation strengths. We may conclude from these scores as follows. First, the simple augmentation (loosely) integrated into AL has led to surprisingly enhanced results, albeit their complicated designs. Secondly and perhaps more important, we have a counterfactual observation that the same augmentation policy facilitated on different data pools manifests notably different impacts. As shown in Figure 1 , when gradually stacking the augmentation operations, the labeled and unlabeled data pools achieves the best performance at different levels of Figure 1 : A visualization for data augmentation and their corresponding performance change as we stack augmentations over images when integrating them into active learning cycles. We test 3 cases where augmentations are applied to 1) unlabeled samples only; 2) labeled samples only; 3) Both. Details of experimental setups can be found in Appendix B.2. augmentation strengths. In hindsight, we hereby post our reasoning. To fully and tightly incorporate DA into AL schemes, the augmentation ought to serve different objectives on the labeled and unlabeled pools. In particular, the labeled pool favors label-preserving augmentation in order to obtain a strong and reliable classifier. By contrast, the unlabeled pool may require relatively more aggressive augmentation so as to maximally gauge the unexplored distribution. The phenomenon in Figure 1 preliminarily validates this reasoning. Noted, this counterfactual observation has not been studied or investigated by prior works (Tran et al., 2019; Gao et al., 2020; Kim et al., 2021b) . Motivated by it, we propose Controllable Augmentation ManiPulator for Active Learning. Core to our method is a purposely designed form of better controlled and tightened the integration of data augmentation into active learning. By proposing CAMPAL, we aim to fill this integration gap and unlock the full potential of data augmentation methods integrated into active learning schemes. In particular, CAMPAL integrates several mechanisms into the whole AL framework: • CAMPAL constructs separate augmentation flows distinctly on labeled and unlabeled data pools towards their own objectives; • CAMPAL composes a strength optimization procedure for the applied augmentation policies; • CAMPAL complies with most common active learning schemes, with carefully designed acquisition functions for both score-and representation-based methods. Besides the theoretical justification of CAMPAL offered in Section 4, we extensively conduct wide experiments and analyses on our approach. The empirical results of CAMPAL are stunning: a 16.99% absolute improvement at a 1,000-sample cycle and a 13.34% lead with 2,000 samples on CIFAR-10, compared with previously best methods. Arguably, we may postulate that these significantly enhanced results may have the chance to greatly extend the boundary of active learning research.

2. METHODOLOGY

In this section, we describe CAMPAL in detail. CAMPAL is chiefly composed of two components. On one hand, CAMPAL formulates a decoupled optimization workflow to locate feasible augmentations being applied to labeled/unlabeled data pools with distinct optimization objectives. This optimization difference is eventually manifested by their augmentation strength (Section 2.2). On the other hand, CAMPAL aggregates the information provided by properly-controlled augmentations with modified acquisition functions (Section 2.3), so as to be adaptable with most (if not all) active learning schemes. Hence we may posit that CAMPAL forms a much more tightened integration of DA and AL, due to not only its controllable mechanism on both data pools but also its full adaptability for all common active learning schemes. The framework for CAMPAL is summarized in Figure 2 .

2.1. SETUP AND DEFINITIONS

Active Learning. The problem of active learning (AL) is defined with the following setup. Consider D ⊂ R d as the underlying dataset consisting of a labeled data pool D L and an unlabeled data pool D U , with |D U | ≫ |D L |. Based on a fully-trained classifier f θ that assigns a label to each data point, a data acquisition function h acq (x, f θ ) : D U → R calculates the score for each data instance. We also use P(y|x; f θ ) to denote the probabilistic label distribution of x given by f θ . Then AL selects the most informative sample batch and updates the labeled set accordingly. In the remainder of this paper, we omit parameter f θ in h acq when the reliance on acquisitions over classifiers is clear.

Unlabeled Images

Labeled Images

Augmentation Generator

Selected Samples Annotation Request

Enhanced Acquisition

Figure 2 : An active learning cycle for the CAMPAL framework. We optimize strengths su, s l for the unlabeled/labeled pool separately, then generate the corresponding strength-guided augmented views. We train an enhanced classifier over augmented labeled samples and induce an enhanced acquisition with augmented unlabeled samples. Finally, the acquisition selects informative samples to be labeled by an oracle. Data Augmentation. We denote a single data augmentation (DA) operator by T (x) that transforms a data point x to another view via translation, rotation, or other augmentation operations. Then we denote the set consisting of these augmentations by T . In practice, several studies also provide extended augmentation operators consisting of multiple operators (Hendrycks et al., 2019; Xu et al., 2022) . The number of operators s in the composition is named the strength of T , then T (s) denotes the augmentation set with strength s. Intuitively, s also quantifies how far an augmentation drifts images away from their original counterparts. Given a data point x, we also use T (x) to denote all its augmented views. With the notations given above, we continue our controllable augmentation-induced acquisition in the following sections.

2.2. CONTROLLABLE AUGMENTATIONS FOR ACTIVE LEARNING

As shown in Figure 1 , data augmentation plays different roles in different data pools. On labeled data, it targets to improve the model prediction performance such that requires the generated virtual examples to be label invariant. In contrast, on unlabeled examples, it enlarges the exposed data distribution in the pursuit of better acquisition distribution. In this section, we propose a principled framework that searches for feasible DA configurations for different data pools in their own natural habits. It is worth noting that we adopt a dynamic control on the strength of augmentations across different cycles, making them adaptable to changes in AL as the cycle proceeds. It can be empirically verified that such dynamic control is better than a fixed augmentation strategy (Section 3.3). Through appropriate strength control, we expect to increase the quality of augmentations for AL. Strength for Unlabeled Data. The primary goal for augmenting unlabeled data is to offer precise informativeness evaluation with an enriched distribution induced accordingly, thus inducing a more reliable acquisition. A problem with this is the invalid information introduced by potential undesirable augmentations. In detail, weak augmentations contain trivial augmentations that contribute little to the distribution enrichment, while drastic augmentations introduce excessive distribution drifts that mislead the acquisition. We resolve this problem by proposing a proper strength that maximizes the overall informativeness of the augmented unlabeled pool. Specifically, we maximize the least information augmentations can provide, ensuring that an optimized strength offers reliable informativeness, which can be formulated as follows: s u = arg max s x U ∈D U min{H(x U ) | xU ∈ T (s) (x U ), f θ (x U ) = f θ (x U )}, where H can be an arbitrary informativeness metric, and we adopt entropy here as it is sufficient to derive a proper s u . By adopting a max-min optimization procedure, we can eliminate the potential negative impact brought by the corruption from aggressively augmented samples with min{H(x U )}, and maximize the overall informativeness of the augmented unlabeled pool with arg max.  s l = arg min s 1 |D L | x L ∈D L L f (x L , s), where L f (x, s) = L(x)+λ 1 JS {P(y | x; f θ ) | x ∈ T (s) single (x)} + λ 2 |T (s) mix (x)| ẋ∈T (s) mix (x) L( ẋ), where L(x), L f (x) denotes the normal loss term and the augmented loss respectively, λ 1 , λ 2 denotes fixed weights. For single-image augmentations T (s) single , we integrate the augmented information into the model by making them produce similar outputs, in which the dissimilarity is quantified with a Jensen-Shannon (JS) divergence term. For Image Mixing T (s) mix , we just follow the setup of Mixup. With the strengths s u , s l given above, we locate augmentations T u that effectively enlarge the distribution, and the augmentations T l that help deduce dependable classifiers. The combination of the two enables us to enhance acquisitions by making classifiers and informativeness evaluations in the AL framework work collaboratively and efficiently. We will show how augmentations for unlabeled samples (UA) and labeled samples (LA) contribute to the acquisition in Section 3.3.

2.3. CONTROLLABLE AUGMENTATION-INDUCED ACQUISITION FOR ACTIVE LEARNING

With the properly-controlled augmentations in Section 2.2, we proceed by providing a fast and efficient approach to integrate the functionalities of acquisitions and augmentations, i.e. controllable augmentation-induced acquisition. A key challenge for inducing the augmented acquisition h acq arises from the complicated forms for h base , which denotes basic acquisitions and varies across different studies. In this section, we highlight two types of acquisitions, i.e., score-based acquisition and representation-based acquisition. We treat these two types of h base differently and describe the corresponding augmented acquisition forms. Notably, CAMPAL can adopt various kinds of acquisitions and enhance them, see Section 3. Since training a classifier f θ with augmentations is straightforward, we focus on formulating augmented acquisition with augmented unlabeled data. Integrating Augmentations into Score-based Acquisitions. Score-based acquisition calculates an information score for each data point and selects samples with the highest score, like Max Entropy (Settles, 2009) . We enhance them by aggregating the information provided by augmentations, which are given by real value scores. Specifically, for methods that calculate an acquisition score h base (x) for each sample x, we calculate an information score h base (x) for every augmented counterpart x ∈ T (x) and aggregate them into one score. We propose several variants of h acq , including: 1. h acq (x) = min x∈Tu(x) h base (x) reduces potential redundant information with a minimum acquisition score within the augmented batch; 2. h acq (x) = x∈Tu(x) h base (x) sums up all the informativeness provided by augmentations; 3. h acq (x) = x∼Tu(x) sim(x, x)h base (x) weights the informativeness of x by its similarity to its non-augmented counterpart, thus introducing the inter-sample information. Integrating Augmentations into Representation-based Acquisitions. For representation-based acquisitions, h base provides a feature vector embedded into a representation space and performs sampling according to this space, like Core-set (Sener et al., 2018) . Notice that representation-based methods rely on a distance function to measure the correlation between instances, we generalize the distance functions between individual samples to point-set distance functions between augmented sample batches. By adopting set distance functions, we enhance the acquisition process by taking the correlation across augmentations over different samples into consideration. To this end, we focus on well-defined set distance functions and propose the corresponding variants as follows: 1. Standard distance: d(x, z) = min x∈Tu(x),z∈Tu(z) ∥x -z∥ 2 2 ; 2. Chamfer distance: d(x, z) = x∈Tu(x) min z∈Tu(z) ∥x -z∥ 2 2 + z∈Tu(z) min x∈Tu(x) ∥x -z∥ 2 2 considers pairwise similarities for the augmented views from two samples; 3. Pompeiu-Hausdorff distance: d(x, z) = max{max x∈Tu(x) d (x, T u (z)) , max z∈Tu(z) d (T u (x), z)} highlights the maximal potential difference between two samples. Controllable DA-Driven Active Learning Cycles. With those augmentation-induced acquisitions, we complete the active learning cycle within CAMPAL. First, we generate the labeled augmentations T l with properly controlled strength s l , then produce an augmented classifier f θ trained over them. This makes up for the insufficient labeled information and further brings a reliable model. Second, we generate the unlabeled augmentations with an optimized strength s u and induce the enhanced acquisition h acq with T u and f θ . Notably, CAMPAL offers a dynamic strength control on augmentations across cycles, which also leads to a controllable acquisition adapting itself to the changing data pools. This augmentation-induced acquisition step provides precise information evaluation and guarantees the positive impact of augmentations, which finally helps produce better querying results. As a result, these two steps jointly ensure the quality of data to label at the end of the active learning cycle, largely boosting the performance. Our experiments in Section 3 show their separate effects as well as the combined impacts in detail. The pseudo-code of our algorithm is provided in Appendix C.

3.1. BASELINES AND DATASETS

We instantiated our proposed CAMPAL with several existing strategies, including 1) Entropy, 2) Least Confidence (LC), 3) Margin, 4) Core-set (Sener et al., 2018) , and 5) BADGE (Ash et al., 2020) . We also implement several augmentation-aggregation modes that integrate augmentations into an enhanced acquisition, including 1) MIN, 2) SUM, 3) DENSITY for Entropy, LC, Margin, and 1) STANDARD, 2) CHAMFER, 3) HAUSDORFF for Core-set, BADGE, as shown in Section 2.3 and Table 2 . In this section, we specify the instantiated augmentation-acquisition with basic strategy h base as its subscript and the augmentation-aggregation mode as its superscript, e.g. CAMPAL MIN Entropy . We also denote the version with the best performance for CAMPAL, i.e. CAMPAL CHAMFER BADGE as CAMPAL*, as shown in Table 2 . We repeat every experiment 5 times. In this work, we compare our method to 1) Random, 2) Coreset, 3) BADGE, 4) Max Entropy, 5) Least Confidence 6) Margin. We also compare our method with other active learning strategies with data augmentations, including 1) BGADL (Tran et al., 2019) , 2) CAL (Gao et al., 2020) , and 3) LADA (Kim et al., 2021b) in Table 1 . For a fair comparison, CAL does not use its original semi-supervised setting but uses a supervised procedure. Since LADA has multiple versions, we choose the one with the best performance for comparison in Table 1 . We further prove the efficacy of CAMPAL by comparing its performance with the corresponding baseline versions in Table 3 .

3.2. MAIN EMPIRICAL RESULTS

CAMPAL achieves SOTA results. As shown in Table 1 and Figure 3 , CAMPAL significantly outperforms their rivals on many datasets and data scales. Specifically, on the CIFAR-10 dataset, we improve upon the best baseline by 8.08%, 16.99%, 15.11%, 13.34%, where the labeled set has 500, 1000, 1500, 2000 instances respectively. Moreover, CAMPAL exhibits the most significant performance boost with a moderately small N L , which is approximately around 1,000 for CIFAR-10 and SVHN. Besides, we can see that different versions of CAMPAL consistently achieve superior results on CIFAR-10, as shown in Table 2 . As shown in Table 3 , in all combinations of baselines and datasets, CAMPAL variations exhibit the best performance. Notably, CAMPAL also brings a consistent performance boost with strong scalability. In addition, it is worth noting that previous works (Tran et al., 2019; Kim et al., 2021a) are typically evaluated with a large number of labeled samples (e.g., 10% ∼ 40% of labeled samples for CIFAR-10). We also challenge this by querying fewer samples over benchmark datasets, shown in Table 1 . When N L = 500 or 2,000 on CIFAR-10, recent augmentation-based AL strategies fail to outperform other simple baselines like BADGE. Notably, BGADL performs the worst, because of the inadequate training with insufficient instances in the current active learning setting. Since CAL is originally designed for a semi-supervised setting, it fails to outperform simple baselines like BADGE under The learnt augmentation strength differs for unlabeled/labeled data. In Figure 4 , we visualize the dynamics of the learnt strength s * u , s * l across active learning cycles. In particular, we conduct the experiment 5 times on CIFAR-10 with CAMPAL MIN Entropy and CAMPAL STANDATD BADGE and figure out the average optimal strength value. We can observe that the s * u is generally larger in comparison with s * l across the AL cycles. This verifies our postulations that labeled requires moderate augmentation for label preserving. In contrast, unlabeled data prefers relatively stronger augmentations to enrich the data distribution such that a wider range of informative regions can be explored. We conclude that AL is better enhanced by DA with a combinatorial scheme of weak and strong augmentations applied to labeled and unlabeled data in our framework, which also corroborates our theoretical findings in Section 4.

3.3. EMPIRICAL ANALYSIS

In this section, we present our ablation results to show the effectiveness of our framework. We exemplify the superiority of your CAMPAL by using scored-based acquisition with MIN as the aggregation; see Appendix B for more ablation experiments. Impact of Unlabeled/Labeled Augmentations. Here, we compare the performance boost of augmentation-induced acquisitions based on different AL strategies and the results are reported in Table 4 . We can see that without augmented labeled information, the enhanced acquisition gives out a consistent performance boost over several strategies, and the maximal boost is presented by Margin (∆5.10%). The enhanced training process also plays an important role in promoting the performance of the existing strategies by 8.35% ∼ 11.86%. A combination of these two components also shows consistently best performance compared to other ablation versions, indicating that they can work well with each other and unleashes different types of information. We can conclude that both the augmented unlabeled information and the labeled ones help resolve the problem of unreliable judgment in AL strategies. Compare with fixed augmentation strengths. Since we emphasize the importance of a strength control over s l , s u in Section 2.2, we will provide more details here. In brief, augmentations with various strengths contribute to the performance but can be inefficient when strengths are not chosen appropriately. To further look at the impact of augmentation sets with different strengths, we fix the value for s l , s u and see how they decide the final performance. Specifically, we test different combinations of s l and s u in the range [0, 4], with other settings following the main empirical studies. The relative performance boost compared to their non-augmented counterparts is shown in Figure 5 . Without proper strength control, the performance boost can decrease. For instance, CAMPAL MIN Margin with s l = 3, s u = 1 leads to a 4.32% performance drop compared to the optimal one, when the worst case in CAMPAL MIN Entropy causes a 3.28% drop similarly. In addition, we can also see a trend similar to Section 3.2 that the classifier f θ prefers weakly labeled augmentations when stronger unlabeled augmentations induce stronger acquisitions, even without a dynamic strength control.

4. THEORETICAL ANALYSIS

In this section, we theoretically analyze why weak and strong augmentations being strategically applied to labeled and unlabeled data exhibit the best performance when combining AL with DA. Following the previous sections, we use f θ to denote the model fully trained over augmented labeled samples. When an unlabeled sample lies within the augmented region for a particular labeled sample, we can propagate the labeled information to the corresponding unlabeled samples. Formally, with a feature map f emb θ derived from f θ we define a covering relation between augmented labeled batches and unlabeled samples as follows: Definition 1. Given a collection of augmentations T , we say that an image x is covered by x i with respect to the augmentation set T , if the feature embedding of x lies within the convex hull of the augmented views of x i : f emb θ (x) ∈ conv f emb θ (T (x i )) . We denote the covering relation by x ◁ x i . Without loss of generality, assume there are L labeled samples x 1 , . . . , x L , together with the unlabeled samples covered by its augmentations, constituting L components. For each component C i (i = 1, . . . , L), let P i be a probability that a data point sampled from the underlying data distribution covered by C i . To make the analysis tractable, we assume the properly controlled augmentations for labeled samples, eliminating the potential overlaps across different components: Assumption 2. With moderately weak augmentations for labeled samples, C i 's do not overlap with each other, i.e. ∀i ̸ = j, P( C i ∩ C j ̸ = ∅) = 0. With Assumption 2, the error for f θ can be estimated by how these components cover the data space. To further illustrate this, we provide a comparison between different augmentations in Figure 6 . The following proposition characterizes the relationship between the error and the components. Proposition 3. Let E denote the probability that the f θ cannot infer the correct label of a test example. Then E is upper bounded by E ≤ L i=1 P i (1 -P i ) m + 1 - L i=1 P i , where m denotes the number of samples that lie within the labeled components. In Eq. ( 3), the first term denotes the risk brought by ill-defined augmentations, while the second term denotes sub-sample empirical risk. With Eq. (3), we continue to reduce the error as much as possible by acquiring informative samples. By adding a newly queried sample x L+1 , the error reduction is estimated as follows: ∆E(∆m, P L+1 ) ≈ L i=1 P i (1 -P i ) m 1 -(1 -P i ) ∆m -P L+1 1-(1 -P L+1 ) m+∆m , ( ) where ∆m is the number of samples newly covered after labeling x L+1 . We take a step further by illustrating two terms in Eq. 4. The first term denotes the performance boost brought by better coverage with newly-annotated samples. Specifically, the samples that drift farthest from the existing components better cover the under-explored data space, indicating a larger ∆m -in turn -the performance boost. This is also consistent with the max-min optimization objective for unlabeled samples described in Eq. ( 1), with the intuition provided in Figure 6 (c),(d). The second term characterizes the potential error induced from augmentations on unlabeled samples, i.e., too strong augmentation excessively increases the value of P L+1 , leading to its increase. Therefore, it is important to locate moderately strong augmentations for unlabeled data in AL. Theorem 4. With properly selected augmentation sets and sufficient large L, the maximal value for error reduction ∆E(∆m, P L+1 ) with newly-annotated samples can be estimated as follows: ∆E(∆m, P L+1 ) ⪅ E 1 -Ke -m/L , ( ) where U denotes the number of unlabeled samples, with K = m+L(log(L+U )-log L-1)

L+U

. From the theorem, we can see that properly selected samples and augmentations give out a significant error reduction. Specifically, m/L denotes the average number of samples covered by each component, which indicates better coverage induced from properly controlled components when being larger. Revisiting our theoretical proof, we further explain that DA indeed serves different goals in AL. On the one hand, the augmentations on labeled data guarantee that Assumption 2 holds, indicating that we need a dependable model and weak augmentations. On the other hand, this theorem emphasizes the importance of acquiring newly-labeled samples guided by moderately strong augmentation, ensuring better coverage while also avoiding potential misleading information. With all the discussions above, augmentation-acquisition integration effectively relies on the quality of augmentations, where better augmentations result in more dependable classifiers for AL and larger error reduction across AL cycles. This echoes our discussion of the benefit of appropriately controlled data augmentations for AL. A more detailed analysis is given in Appendix A.

5. RELATED WORKS

Data Augmentation is a technique that improves the generalization ability of models by increasing the number of images and their variants in a dataset (Xu et al., 2022) . The most commonly used augmentation techniques include geometric transformations (Shorten & Khoshgoftaar, 2019) , random erasing (Devries & Taylor, 2017; Zhong et al., 2020) and generative adversarial networks (Zhu et al., 2018; Bowles et al., 2018) . Another type of augmentation is image mixing (Zhang et al., 2018; Yun et al., 2019) , which blends multiple images and their corresponding labels. Instead of designing new types of augmentations, recent studies also collect a group of augmentations and optimize their strength (a.k.a strength) (Cubuk et al., 2020) , which quantifies how far an augmented image drift from its original counterpart. By optimizing this strength, several studies attain state-of-the-art performance over several benchmarks (Zheng et al., 2022; Yang et al., 2022) . Active Learning is a machine learning paradigm in which a learning algorithm actively selects the data it wants to learn from the unlabeled data sources (Settles, 2009; Ren et al., 2021) . The crucial part of active learning in most existing strategies is exactly the data acquisition process, which targets selecting the most informative examples. Current studies mostly focus on specific parts of samples and can be roughly categorized as follows: (a) Uncertainty-based methods that prefer the hardest samples (Choi et al., 2021; Mai et al., 2022) or the ones the current fully-trained model uncertain about (Kirsch et al., 2019; Wang et al., 2022) 

A THEORETICAL ANALYSIS

This section provides a complete derivation for the analysis given in Section 4. An intuition for this is given in Figure 6 . Before the actual acquisition process, we must ensure convergence for the underlying classifier. Specifically, with proper augmentations over labeled data and the approximate loss term in equation 3, we can deduce the upper bound for P r(A) and guarantee the convergence for training, shown as follows: Theorem 1. Under the setting for CAMPAL, Let E denote the probability that the classifier cannot infer the label of newly given samples drawn from the underlying data space, with L labeled samples given in D L and augmentation set T . Then E is upper bounded by Ê as follows: E ≤ Ê(D L , f θ , T ) = L l=1 P i (1 -P i ) m + 1 - L i=1 P i , With properly selected augmentation set T and sufficient large L, Ê can be estimated by O(ε) with O(L/ε) samples covered by labeled components, i.e. m = O(L/ε) ⇒ Ê ⪅ O(ε) ⇒ E ≤ O(ε). Proof. With proper control over augmentations, we assume that each component does not overlaps with at most one other component in Proposition 3, which can be controlled with appropriate augmentations, and generalizable to multiple components. Let x be the sampled example, the probability of x not covered only in one of C i 's is Ê = P (∃i ̸ = j, x ′ ∈ C i ∩ C j ) + P (x ′ is uncovered) = L l=1 P i (1 -P i ) m + 1 - L i=1 P i With sufficiently large L, we can also have a component set that covers the entire dataset, leading to i P i = 1. Now it remains to find the maximum value of L l=1 P i (1 -P i ) m to bound the error term, with the following optimization objective: min C - i P i (1 -P i ) m , s.t. i P i = 1. With the KKT condition, we attain its maximum value when all P i is set to 1 L , i.e. Ê ⪅ (1 -1 L ) m . With O(L/ε) and sufficiently large L, we have Ê ⪅ exp - m L = exp -O( 1 ε ) ≤ O(ε). With the conditions in theorem 1, it remains to consider the approximate boost provided by the reduction on upper bound Ê: ∆ Ê(∆m, P L+1 ) = Ê(D L ∪ {x L+1 }, f θ , T }) -Ê(D L , f θ , T )) = L i=1 P i (1 -P i ) m 1 -(1 -P i ) ∆m -P L+1 1 -(1 -P L+1 ) m+∆m . where ∆m is the number of samples newly covered after labeling x L+1 . Theorem 2. With the conditions given in Theorem 1, the maximal value for error bound reduction ∆ Ê(∆m, P L+1 ) with newly-annotated samples can be estimated as follows: ∆E(∆m, P L+1 ) ⪅ E 1 -Ke -m/L , where U denotes the number of unlabeled samples, with K = m log(L + U ) -L (log L -1) L + U . Proof. Under this setting, P L+1 appears to be proportional to ∆m, when no unnecessary overlap appears across components (guaranteed by Theorem 1). Therefore, we can estimate P L+1 ≈ ∆m/(L + U ), where U denotes the number of unlabeled samples. With those conditions, we estimate the relative error reduction as follows: ∆ Ê Ê = L i=1 P i (1 -P i ) m 1 -(1 -P i ) ∆m -P L+1 1 -(1 -P L+1 ) m+∆m L i=1 P i (1 -P i ) m + 1 - L i=1 P i ⪅ 1 -1 - 1 L ∆m -exp - m L ∆m L + U 1 -1 - ∆m L + U m+∆m Since m is large with sufficient labeled samples, we can further estimate this term as: ∆ Ê Ê ⪅ 1 -1 - 1 L ∆m -exp(- m L ) ∆m L + U . Then the maximum value for this is attained when ∆m reaches ∆m * = 1 log 1 -1 L m L -log 1 L + U -log -log 1 - 1 L ≈ L m L -log 1 L + U + log( 1 L ) = m + L (log (L + U ) -log L) Then ∆ Ê Ê ≈ 1 + 1 L + U 1 log(1 -1 L ) -∆m * exp - m L ≈ 1 - m + L (log(L + U ) -log L -1) L + U exp - m L .

B.1 IMPLEMENTATION DETAILS

We conduct experiments on four benchmark datasets: FashionMNIST, SVHN, CIFAR-10, and CIFAR-100. We will construct a random initial dataset with 100 instances for FashionMNIST, Since CAMPAL locate feasible augmentations guided by their strength, we also compare CAMPAL with RandAugment (Cubuk et al., 2020) . To show the effectiveness of a separate control on unlabeled/labeled data in CAMPAL, we trained RandAugment on the labeled data within each AL cycle, then applied the optimized augmentation to both the labeled pool and unlabeled pool. As shown in Table 6 , CAMPAL shows better performance than the RandAugment, indicating the superiority of the separate control. It should be noted that RandAugment is originally designed for training over full labeled data, but is obliged to be conducted over the labeled pool with limited samples under the AL setting. Therefore, directly adopting RandAugment to AL is infeasible, since it can be heavily biased towards limited labeled data, contributing little to the distribution enrichment on unlabeled data.

B.5 ABLATION STUDIES OVER TYPES OF AUGMENTATIONS

The impact of each single-image augmentation operator on CAMPAL. To further dive into the impact of the contribution of augmentations, we also provide the results when each augmentation is separately applied to CAMPAL with different strengths, shown in Table 7 on CIFAR-10 with CAMPAL DENSITY Entropy . We can see the impact of different types of single-image augmentations varies. An interesting observation is that different augmentation operator does not contribute equally at the different AL cycles. For example, Sharpness performs better than Rotate when N L = 500, but underperforms Rotate when N L = 2000. It reveals a sophisticated mechanism of the benefit of these augmentation operators on AL. However, the profound theory behind why data augmentation works have not been fully revealed to date, making it difficult to principally pick up the best optimal augmentation type. Hence, we naively adopt a simple strategy that uniformly selects and stacks these operators to enjoy their mixed benefits to AL. Effect of single-image augmentations and mix-up. To prove the efficacy of including both singleimage augmentations of image-mixing into one query batch, we further explore the effect of these two kinds of augmentations separately. To verify this, we conduct experiments over two variants of CAMPAL that only use one type of augmentations, i.e. single-image augmentations and MixUp. The tests are performed by the ResNet-18 model with 4% (2000) data from CIFAR-10. For fairness, when only one kind of augmentation is used, we generate 15 augmented samples of this type. In Table 8 , we can see a consistent performance boost when using both kinds of augmentations over Entropy (∆ 1.03), LC (∆ 0.78), Margin (∆ 0.94), Coreset (∆ 1.58), and BADGE (∆ 1.93). In conclusion, an integration of both single-image augmentations and image-mixing better unleashes the potential information of each sample than they separately do. Effect of λ 1 , λ 2 in the virtual loss term. To optimize m l , i.e. the strength for augmentations performed over labeled samples, we use λ 1 , λ 2 to trade off the impact of single-image augmentations and image mixing. We dive deeper into this scheme by applying different combinations of λ 1 , λ 2 , shown in Table 9 . Specifically, the experiment is conducted on the following versions: 1) CAMPAL Recall that several studies tried to involve unlabeled samples in training auxiliary networks to assist querying (Sinha et al., 2019; Zhang et al., 2020; Kim et al., 2021a; Caramalau et al., 2021) , which inevitably brings high computational costs. We claim that data augmentations are sufficient to enforce the acquisition process without much extra cost over unsupervised training. To verify this, we compare the running time and performance of augmentation-based strategies and those utilizing extra unsupervised architectures, shown in Table 10 . We can see that augmentation-based methods with the best performance consistently outperform other strategies when becoming computationally efficient. Since active learning usually faces the problem of heavy computational cost in acquisitions, data augmentation may serve as an effective tool for both boosting the speed and performance at once. More importantly, this thought restricts the training process merely over labeled data, thus reducing the need for numerous unlabeled data in AL and making AL paradigms more applicable. We also adopt augmentations for labeled samples for methods with unsupervised representations. detail, the difference between m u , m l is not clear in the early stages of active learning, since the model lacks the ability to uncover information from images.

C PSEUDO CODE

We summarize the pseudo-code of our CAMPAL within one active learning cycle in Algorithm 1. Algorithm 1: An active learning cycle for CAMPAL. Generate an augmentation set T (mu) with strength mu; Deduce the enhanced acquisition hacq with T (mu) and f θ as shown in Section 2.3; Select optimal sample batch Q according to hacq; DU ← DU -Q; DL ← DL ∪ Q.



CONCLUSIONSIn this work, we propose a novel active learning framework CAMPAL. Based on the observation that the impacts of augmentations applied to the disparate data pools differ due to their different goals, CAMPAL conducts appropriate controls on data augmentation integrated into active learning. We empirically find CAMPAL attains state-of-the-art performance with a significant performance boost, especially with fewer labeled samples. Our theoretical analysis further guarantees this difference and claims the reliance of AL on the quality of introduced augmentations. In the future, we hope to generalize CAMPAL to more tasks and investigate the impact of DA over AL in more detail.



Figure 3: Test accuracy on the number of labeled samples over different datasets.

Figure 5: A heatmap visualization of performance boost brought by augmentations of different strengths, when attached to the labeled pool and unlabeled pool. The experiments are performed over CIFAR-10 with 2,000 labeled samples and are conducted over Entropy, LC, and Margin.

Figure 4: The learned strength on CIFAR-10, with CAMPAL instantiated with Entropy/BADGE.

Figure 6: The coverage on the data space presented by augmentations, where colored circles are labeled samples, white circles are unlabeled samples and the colored shade denotes the region covered by corresponding augmentations. A double circle denotes the unlabeled sample to be annotated. The figures above show: a) Proper augmentation for labeled samples; b) Drastic augmentation for labeled samples; c) Sub-optimal unlabeled sample with the corresponding augmentation; d) A proper unlabeled sample with the corresponding augmentation.

Labeled data pool DL, Unlabeled data pool DU , Model f θ . θ ← arg min θ 1 |D L | x∈D L L (f θ (x), y); m l = arg min m 1 |D L | x L ∈D L L f (xL, m), where L f is shown in equation 2; Generate an augmentation set T (m l ) with strength m l ; θ ← arg min θ 1 |T (m lab ) (D L )| x∈T (m l ) (D L ) L (f θ (x), y); mu = arg max m x U ∈D U min{H(xU ) | xU ∈ T (m) (xU ), f θ (xU ) = f θ (xU )};

Strength for Labeled Data. By involving augmentations in model training, we aim at obtaining a dependable model from limited labeled data and further enhancing the acquisition process. Different from augmentation for unlabeled data that maximize overall informativeness, augmentations for labeled data are prone to training stability and convergence. To give out proper control over labeled augmentations while avoiding extra training costs, we introduce a virtual loss term L f and search the proper strength s l for labeled samples by minimizing it:

Comparison of the averaged test accuracy on benchmark datasets and different AL strategies. Since CAMPAL has multiple versions, we choose the one with the best performance and denote it with CAMPAL*. The best performance in each category is indicated in boldface. NL denotes the number of labeled samples.DatasetMethod N L = 500 N L = 1, 000 N L = 1, 500 N L = 2, 000

Performance of CAMPAL with different h base and aggregation modes. The experiment is conducted over CIFAR-10 with 2,000 labeled samples.

Comparison of CAMPAL with its non-augmented counterpart with different AL strategies. ∆ indicates

Test accuracy of CAMPAL when UA or LA are individually applied over CIFAR-10 with 2,000 labeled samples. The results are produced over 5 different AL strategies.

; (b) Representation-based methods searching for the samples that are the most representative of the underlying data distribution(Sener et al., 2018;Ash et al., 2020;Kim & Shin, 2022). To date, the unreliable informativeness evaluation with very few samples remains a critical issue for Active Learning. Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random erasing data augmentation. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pp. 13001-13008. AAAI Press, 2020. URL https://aaai.org/ojs/index.php/AAAI/article/view/7000. Xinyue Zhu, Yifan Liu, Jiahong Li, Tao Wan, and Zengchang Qin. Emotion classification with data augmentation using generative adversarial networks. In Pacific-Asia conference on knowledge discovery and data mining, pp. 349-360. Springer, 2018.

The list of all the augmentations used in the experiments. The letter x or x * denotes given images. U(a, b) denotes a continuous uniform distribution at interval [a, b], when B(a, b) denotes a beta distribution with parameters a and b.

Comparison of the averaged test accuracy when each type of augmentation is separately integrated into CAMPAL. We ran each experiment on CIFAR-10 with 2,000 samples annotated at the last cycle, and repeat them 5 times. N L denotes the number of labeled samples.Augmentation N L = 500 N L = 1, 000 N L = 1, 500 N L = 2, 000

Test accuracy of CAMPAL when integrated with different combinations of single-image augmentations and the MixUp.

Test accuracy of CAMPAL when integrated with different combinations of single-image augmentations and the MixUp.

Augmentation

Parameters Description AutoContrast(x)Maximizing the (normalize) image contrast Brightness(x,v) v ∼ U(1, 1.18): an enhancing factor Enhancing the brightness of a given image Color(x,v) v ∼ U(1, 1.18): an adjustment factor Adjust the color balance of a given image Contrast(x,v) v ∼ U(1, 1.18):Enhancing the contrast of a given image CutOut(x,v) v ∼ U(0. λ: the mixing ratio Mix up the two given images SVHN, and CIFAR-10, and 1,000 instances for CIFAR-100. Then we acquire 100 instances for FashionMNIST, SVHN, and CIFAR-10, and 500 instances for CIFAR-100 at each cycle. We repeat the cycle 20 times. Then we generate 10 single-image augmentations and 5 mix-up augmentations for each sample. We normalize the images with the channel mean and standard deviation over all the datasets. For CIFAR-10 and CIFAR-100, we apply a standard augmentation after conducting augmentations in the pipeline. We adopt ResNet-18 as the architecture and train the model for 300 epochs with an SGD optimizer of learning rate 0.01, momentum 0.9, and weight decay 5e-4. For the virtual loss term in equation 2, we also set λ 1 = λ 2 = 1.

B.2 IMPLEMENTATION DETAILS FOR THE SIMPLE APPLICATION OF DA FOR AL IN FIGURE 1

We integrate DA into AL with fixed augmentations T as follows. This experiment is also conducted on dataset CIFAR-10 with a ResNet-18 architecture. The basic acquisition here is Max Entropy. First, we augment the labeled pool with T , and train the classifier f θ accordingly. Then we augment the unlabeled pool with T and performs acquisitions directly on the augmented unlabeled pool. Other settings are the same as the main empirical experiments.

B.3 AUGMENTATIONS INCLUDED

The details of the 19 augmentations in the (CAMPAL) with their parameters are shown in Table 5 . In brief, the augmentations we use can be categorized into single-image augmentations and image-mixing. Formally, we provide an augmentation functional set that covers (i)-singular input augmentation means such as rotation for low-level image processing. The corresponding functional set is denoted by T single = {ω(x; λ)} where ω points to an instantiated augmentation function. The sample x is taken as an input to ω together with varying augmentation hyper-parameters λ, such as the angle in the image rotation function. Similarly, we also construct a combinatorial augmentation functional set, T mix = {γ(x, x ′ ; λ)}, where the augmentation function γ takes two input samples x and x ′ together with hyper-parameter. With slight abuse of notations, we uniformly use λ to refer to augmentation-related hyper-parameters. In the implementation of CAMPAL, we simply adopt MixUp for combinatorial augmentation. As we can see, upon fixed input, both singular and combinatorial augmentation functional sets can be arbitrarily expanded, by varying λ in a continuous scalar space. , which have similar phenomena as described in Section 3.3 and shown in Figure 7 . Specifically, augmentations over unlabeled samples tend to produce better results as strength increases and produce the best results mostly with strength 4, indicating that drastic augmentations help induce a stronger acquisition. In contrast, augmentations over labeled samples produce the best results mostly with a strength of 2, fitting with the conclusion in Section 3.2 that weak augmentations for labeled samples are better at boosting the classifier. These phenomena show the difference between the impacts of augmentations over training and acquisition in active learning, and further guarantee the importance of a combinatorial scheme of weak and strong augmentations being strategically applied to labeled and unlabeled data. 

