BETTER OPTIMIZATION CAN REDUCE SAMPLE COM-PLEXITY: ACTIVE SEMI-SUPERVISED LEARNING VIA CONVERGENCE RATE CONTROL

Abstract

Reducing the sample complexity associated with deep learning (DL) remains one of the most important problems in both theory and practice since its advent. Semisupervised learning (SSL) tackles this task by leveraging unlabeled instances which are usually more accessible than their labeled counterparts. Active learning (AL) directly seeks to reduce the sample complexity by training a classification network and querying unlabeled instances to be annotated by a human-in-the-loop. Under relatively strict settings, it has been shown that both SSL and AL can theoretically achieve the same performance of fully-supervised learning (SL) using far less labeled samples. While empirical works have shown that SSL can attain this benefit in practice, DL-based AL algorithms have yet to show their success to the extent achieved by SSL. Given the accessible pool of unlabeled instances in pool-based AL, we argue that the annotation efficiency brought by AL algorithms that seek diversity on labeled samples can be improved upon when using SSL as the training scheme. Equipped with a few theoretical insights, we designed an AL algorithm that rather focuses on controlling the convergence rate of a classification network by actively querying instances to improve the rate of convergence upon inclusion to the labeled set. We name this AL scheme convergence rate control (CRC), and our experiments show that a deep neural network trained using a combination of CRC and a recently proposed SSL algorithm can quickly achieve high performance using far less labeled samples than SL. In contrast to a few works combining independently developed AL and SSL (ASSL) algorithms, our method is a natural fit to ASSL, and we hope our work can catalyze research combining AL and SSL as opposed to an exclusion of either.

1. INTRODUCTION

The data-hungry nature of supervised deep learning (DL) algorithms has spurred interest in active learning (AL) , where a model can interact with a dedicated annotator and request unlabeled instances to be labeled. In the pool-based AL setting, a model initially has access to a set of unlabeled samples and can query instances which need be labeled for training. Under certain conditions on the task, AL can provably achieve up to exponential improvement in sample complexity and thus has great potential for reducing the number of labeled instances required to achieve high accuracy. This is especially important when the annotation task is extremely costly, for example, in medical imaging where only highly-specialized experts can diagnose a subject's condition. Active learning algorithms have been extensively explored, with various formulations including uncertainty-based sampling (Wang & Shang, 2014) , aligning the labeled and unlabeled distributions (Gissin & Shalev-Shwartz, 2019) with connections to domain adaptation (Ben-David et al., 2010), and coreset (Sener & Savarese, 2018) . Furthermore, there is no standard method in modeling a deep neural network's (DNN) uncertainty, and uncertainty-based AL has its own variants ranging from utilizing Bayesian networks (Kirsch et al., 2019) to using a model's predictive confidence (Wang & Shang, 2014) . This ambiguous characterization of how much information a sample's label carries also motivated AL algorithms based on maximizing the expected change of a classification model (Huang et al., 2016; Ash et al., 2020) .

Algorithm 1 Active Semi-Supervised Learning

for each query iteration i = 1, ..., N do Train a classifier f θ using some SSL algorithm on (X L , X U ) until convergence. Retrieve unlabeled samples X * u ⊂ X U using Alg. 2 and obtain their labels. Update labeled and unlabeled pools X L ← X L ∪ X * u , X U = X U \X * u . end for While AL comes with optimistic potentials, most algorithms outperform random sampling (passive learning) by only a small margin, with follow-up works (Gissin & Shalev-Shwartz, 2019; Sener & Savarese, 2018; Ducoffe & Precioso, 2018) reporting worse performance of certain AL algorithms than random sampling due to their dependency on specific model architectures or dataset characteristics. Furthermore, the performance of AL algorithms are usually reported by training a model using supervised learning (SL) on the queried labeled data despite the availability of unlabeled data in pool-based AL. Semi-supervised learning (SSL) has recently shown impressive performance with a small number of labeled instances, and its most premature variant known as pseudo-labeling (Lee, 2013) has been combined with AL algorithms (Wang et al., 2017) . One recent work (Song et al., 2019) uses a rather modern SSL algorithm (Berthelot et al., 2019) and shows the strength of combining AL with SSL which we name ASSL. However, their AL algorithm is not designed specifically considering the ASSL setting. In this work, we propose a novel query strategy which naturally blends in with SSL, and show that our AL algorithm can rapidly achieve the high performance of fully-supervised algorithms using fewer labeled data. Our algorithm is inspired by recent developments in DNN theory, namely the neural tangent kernel (NTK) (Jacot et al., 2018) . Experimental comparisons with diversity-seeking strategies and an algorithm with an objective similar to ours demonstrate how labeling instances based on our objective helps SSL attain high performance in a sample-efficient manner.

2.1. SAMPLE EFFICIENCIES OF ACTIVE LEARNING AND SEMI-SUPERVISED LEARNING

Here we informally describe some theoretical results describing the superiority of AL to passive learning and SSL to SL in terms of labeled sample complexity, that is, the number of labeled instances that can be used to attain -classification error. Because pool-based AL subsumes unlabeled instances, it makes much more sense to perform SSL on the readily available unlabeled instances when training a classification network between each query iteration (Alg. 1). This section describes how AL and SSL algorithms can consider different objectives to attain higher accuracy in the ASSL setting. It is well known that SL can find an -optimal classifier from a sufficiently rich class of hypotheses (e.g. classifiers which can be realized by deep neural networks) using Θfoot_0 i.i.d samples for the separable case and Θ 1 2 in the agnostic case (Massart & Nédélec, 2006; Vapnik & Izmailov, 2015) 1 . In contrast, actively adding instances to the (labeled) training set can sometimes improve the sample complexity sub-exponentially from O(1/ ) to O(poly log 1/ ) (Balcan et al., 2010) . When unlabeled instances are drawn i.i.d., Balcan & Urner (2016) showed how labeling samples selected via binary search over Õ(1/ ) unlabeled instances can achieve exponential improvement over passive learning. Göpfert et al. (2019) constructed a few examples which show how an SSL algorithm can use unlabeled instances to significantly improve the labeled sample complexity for a rather restricted class of data distributions. A corollary of one such example is that using O(log 1/ ) labeled samples and O(1/ 2 ) unlabeled samples can be used to obtain -error. Considering a target error ≈ 6% achieved by fully-supervised learning in CIFAR10 using O(1/ ) ≈ 50, 000 labeled samples, a training set with only O(poly log(1/ )) ≈ 250 labeled samples and leaving the remaining O(1/ ) ≈ 49, 750 samples unlabeled is analogous to the aforementioned sample complexities achievable in the respective AL and SSL settings, although their assumptions may not be satisfied. Instead of focusing on the same objective for both AL and SSL to attain the potential exponential sample complexity improvements above, we suggest using an AL algorithm that queries instances to



Here we slightly abuse the Θ notation to denote upper and lower bounds matching up to logarithmic factors.

