ACTIVE LEARNING WITH CONTROLLABLE AUGMENTA-TION INDUCED ACQUISITION

Abstract

The mission of active learning is to iteratively identify the most informative data samples to annotate, and therefore to attain decent performance with much fewer samples. Despite the promise, the acquisition of informative unlabeled samples can be unreliable -particularly during early cycles -owning to limited data samples and sparse supervision. To tackle this, the data augmentation techniques seem straightforward yet promising to easily extend the exploration of the input space. In this work, we thoroughly study the coupling of data augmentation and active learning whereby we propose Controllable Augmentation ManiPulator for Active Learning. In contrast to the few prior work that touched on this line, CAMPAL emphasizes a tighten and better-controlled integration of data augmentation into the active learning framework, as in three folds: (i)-carefully designed data augmentation policies applied separately on labeled and unlabeled data pool in every cycle; (ii)-controlled and quantifiably optimizable augmentation strengths; (iii)-full but flexible coverage for most (if not all) active learning schemes. Through extensive empirical experiments, we bring the performance of active learning methods to a new level: an absolute performance boost of 16.99% on CIFAR-10 and 12.25% on SVHN with 1,000 annotated samples. Complementary to the empirical results, we further provide theoretical analysis and justification of CAMPAL.

1. INTRODUCTION

The acquisition of labeled data serves as a foundation for the remarkable successes of deep supervised learning over the last decade, which also incurs great monetary and time costs. Active learning (AL) is a pivotal learning paradigm that puts the data acquisition process into the loop of learning, locating the most informative and valuable data samples for annotation (Settles, 2009; Zhang et al., 2020; Kim et al., 2021a; Wu et al., 2021) . With much-lowered sample complexity but comparable performance compared to its supervised counterpart, active learning is widely used in real-world applications and ML productions (Bhattacharjee et al., 2017; Feng et al., 2019; Hussein et al., 2016) . In spite of its meritorious practicality, active learning often suffers from unreliable data acquisition, especially from the early stages. Notably, the models obtained around the early stages are generally raw and undeveloped due to the insufficient data curated and sparse supervision signal being consumed. The subsequent cycle of the data query is based on the model produced from the current cycle. While this problem can probably be mitigated after adequate cycles are conducted, we argue that the problems at the early stages of AL cannot be overlooked. Indeed, few works have resorted to data augmentation techniques to generate additional data examples for data distribution enrichment, e.g. GAN-based (Tran et al., 2019) and STN-based (Kim et al., 2021b) methods. In this work, we attempt to take a further step in investigating the role of data augmentation for AL. To begin with, we provide a straightforward quantitative observation in Figure 1 . The setup of these results is rather simple: we directly apply vanilla DA operations, such as flipping and rotation, to data samples and linearly increase the augmentation strengths. We may conclude from these scores as follows. First, the simple augmentation (loosely) integrated into AL has led to surprisingly enhanced results, albeit their complicated designs. Secondly and perhaps more important, we have a counterfactual observation that the same augmentation policy facilitated on different data pools manifests notably different impacts. As shown in Figure 1 , when gradually stacking the augmentation operations, the labeled and unlabeled data pools achieves the best performance at different levels of augmentation strengths. In hindsight, we hereby post our reasoning. To fully and tightly incorporate DA into AL schemes, the augmentation ought to serve different objectives on the labeled and unlabeled pools. In particular, the labeled pool favors label-preserving augmentation in order to obtain a strong and reliable classifier. By contrast, the unlabeled pool may require relatively more aggressive augmentation so as to maximally gauge the unexplored distribution. The phenomenon in Figure 1 preliminarily validates this reasoning. Noted, this counterfactual observation has not been studied or investigated by prior works (Tran et al., 2019; Gao et al., 2020; Kim et al., 2021b) . Motivated by it, we propose Controllable Augmentation ManiPulator for Active Learning. Core to our method is a purposely designed form of better controlled and tightened the integration of data augmentation into active learning. By proposing CAMPAL, we aim to fill this integration gap and unlock the full potential of data augmentation methods integrated into active learning schemes. In particular, CAMPAL integrates several mechanisms into the whole AL framework: • CAMPAL constructs separate augmentation flows distinctly on labeled and unlabeled data pools towards their own objectives; • CAMPAL composes a strength optimization procedure for the applied augmentation policies; • CAMPAL complies with most common active learning schemes, with carefully designed acquisition functions for both score-and representation-based methods. Besides the theoretical justification of CAMPAL offered in Section 4, we extensively conduct wide experiments and analyses on our approach. The empirical results of CAMPAL are stunning: a 16.99% absolute improvement at a 1,000-sample cycle and a 13.34% lead with 2,000 samples on CIFAR-10, compared with previously best methods. Arguably, we may postulate that these significantly enhanced results may have the chance to greatly extend the boundary of active learning research.

2. METHODOLOGY

In this section, we describe CAMPAL in detail. CAMPAL is chiefly composed of two components. On one hand, CAMPAL formulates a decoupled optimization workflow to locate feasible augmentations being applied to labeled/unlabeled data pools with distinct optimization objectives. This optimization difference is eventually manifested by their augmentation strength (Section 2.2). On the other hand, CAMPAL aggregates the information provided by properly-controlled augmentations with modified acquisition functions (Section 2.3), so as to be adaptable with most (if not all) active learning schemes. Hence we may posit that CAMPAL forms a much more tightened integration of DA and AL, due to not only its controllable mechanism on both data pools but also its full adaptability for all common active learning schemes. The framework for CAMPAL is summarized in Figure 2 . Based on a fully-trained classifier f θ that assigns a label to each data point, a data acquisition function h acq (x, f θ ) : D U → R calculates the score for each data instance. We also use P(y|x; f θ ) to denote the probabilistic label distribution of x given by f θ . Then AL selects the most informative sample batch and updates the labeled set accordingly. In the remainder of this paper, we omit parameter f θ in h acq when the reliance on acquisitions over classifiers is clear.



Figure 1: A visualization for data augmentation and their corresponding performance change as we stack augmentations over images when integrating them into active learning cycles. We test 3 cases where augmentations are applied to 1) unlabeled samples only; 2) labeled samples only; 3) Both. Details of experimental setups can be found in Appendix B.2.

SETUP AND DEFINITIONS Active Learning. The problem of active learning (AL) is defined with the following setup. Consider D ⊂ R d as the underlying dataset consisting of a labeled data pool D L and an unlabeled data pool D U , with |D U | ≫ |D L |.

