A SIMPLE YET POWERFUL DEEP ACTIVE LEARNING WITH SNAPSHOTS ENSEMBLES

Abstract

Given an unlabeled pool of data and the experts who can label them, active learning aims to build an agent that can effectively acquire data to be queried to the experts, maximizing the gain in performance when trained with them. While there are several principles for active learning, a prevailing approach is to estimate uncertainties of predictions for unlabeled samples and use them to define acquisition functions. Active learning with the uncertainty principle works well for deep learning, especially for large-scale image classification tasks with deep neural networks. Still, it is often overlooked how the uncertainty of predictions is estimated, despite the common findings on the difficulty of accurately estimating uncertainties of deep neural networks. In this paper, we highlight the effectiveness of snapshot ensembles for deep active learning. Compared to the previous approaches based on Monte-Carlo dropout or deep ensembles, we show that a simple acquisition strategy based on uncertainties estimated from parameter snapshots gathered from a single optimization path significantly improves the quality of the acquired samples. Based on this observation, we further propose an efficient active learning algorithm that maintains a single learning trajectory throughout the entire active learning episodes, unlike the existing algorithms training models from scratch for every active learning episode. Through the extensive empirical comparison, we demonstrate the effectiveness of snapshot ensembles for deep active learning.

1. INTRODUCTION

The progress of deep learning is largely driven by data, and we often work with well-curated and labeled benchmark data for model developments. However, in practice, such nicely labeled data are rarely available. Many of the data accessible to practitioners are unlabeled, and more importantly, labeling such data incurs costs due to human resources involved in the labeling process. Active Learning (AL) may reduce the gap between the ideal and real-world scenarios by selecting the informative samples from the unlabeled pool of data, so after being labeled and trained with them, a model can maximally improve the performance. The main ingredient of an AL algorithm is the acquisition function which ranks the samples in an unlabeled pool with respect to their utility for improvement. While there are several possible design principles (Ren et al., 2021) , in this paper, we mainly focus on the acquisition functions based on the uncertainty of the predictions. Intuitively speaking, given a model trained with the data acquired so far, an unlabeled example exhibiting high predictive uncertainty with respect to the model would be a "confusing" sample which would substantially improve the model if being trained with the label acquired from experts. A popular approach in this line is Bayesian Active Learning by Disagreement (BALD) (Houlsby et al., 2011) , where a committee of multiple models predicts an unlabeled sample, and the degree of disagreement is measured as a ranking factor. Here, the multiple models are usually constructed in a Bayesian fashion, and their disagreement reflects the model uncertainty about the prediction. BALD is demonstrated to scale well for modern deep neural networks for high-dimensional and large-scale data (Gal et al., 2017) . Similar to BALD, many AL algorithms based on uncertainty employ a committee of models to estimate the uncertainty of predictions. The problem is, for deep neural networks trained with highdimensional data, it is often frustratingly difficult to accurately estimate the uncertainty. To address this, Gal et al. (2017) proposed to use Monte-Carlo DropOut (MCDO) (Gal and Ghahramani, 2017), an instance of variational approximation to the posteriors and predictive uncertainty, while Rakesh and Jain (2021) suggested using more generic spike-and-slab variational posteriors (Louizos et al., 2017) . Nevertheless, variational approximations tend to underestimate the posterior variances (Blei et al., 2017; Le Folgoc et al., 2021) , so the uncertainty-based acquisition functions computed from them may be suboptimal. Alternatively, one can employ Deep Ensemble (DE) (Lakshminarayanan et al., 2017) , where a single model is trained multiple times with the same data but with different random seeds for initialization and mini-batching. Despite being simple to implement, DE works surprisingly well, surpassing most of the Bayesian Neural Networks (BNN) alternatives in terms of accuracy and predictive uncertainty (Fort et al., 2021; Ovadia et al., 2019) . To this end, Beluch et al. ( 2018) highlighted the effectiveness of DE as a way to estimate uncertainty for acquisition functions and demonstrated excellent performance. A drawback of DE is that it is computationally expensive, as it requires multiple models to be trained and maintained for inference. As an alternative, Snapshot Ensemble (SE) (Huang et al., 2017; Garipov et al., 2018) proposes to collect multiple model snapshots (checkpoints) within a single learning trajectory, rather than collecting them at the end of the multiple learning trajectories as in DE. Compared to DE, SE enables the construction of a decent set of models without having to go through multiple training runs while not losing too much accuracy. Inspired by the advantage of SE, we study its use in the context of AL. Specifically, we estimate the uncertainties from SE and use them to evaluate the uncertainty-based acquisition functions. Through extensive empirical comparisons, we demonstrate that the AL based on SE significantly outperforms existing approaches, even comparable to or better than the one with DE. This result is somewhat surprising since it is often reported that SE is less accurate than DE (Ashukha et al., 2020) . Moreover, based on this observation, we propose a novel AL algorithm that can substantially reduce the number of training steps required until the final acquisition. Typically, an AL algorithm alternates between acquiring labels based on a model and re-training the model with the newly acquired labels. Here, for every re-training step, the old models are discarded, and a new model is trained from scratch. Instead, we suggest maintaining a model on a single learning trajectory throughout the entire AL procedure and gathering snapshots from the trajectory to compute acquisition functions. We show that this can significantly reduce the number of training steps without sacrificing too much accuracy. In summary, our contributions are as follows: • We propose to use SE for the uncertainty-based acquisition functions for AL and demonstrate its effectiveness through various empirical evaluations. • We propose a novel AL algorithm where a single learning trajectory is maintained and used to compute acquisition functions throughout the entire AL procedure. We demonstrate that our algorithm could achieve decent accuracy with much fewer training steps.

2. BACKGROUND 2.1 SETTINGS AND BASIC ACTIVE LEARNING ALGORITHM

In this paper, we mainly discuss K-way classification problem, where the goal is to learn a classifier f (•; θ), parameterized by θ, taking an input x ∈ R d to produce a K-dimensional probability vector, that is, f (x; θ) ∈ [0, 1] K such that K k=1 f k (x; θ) = 1. To learn θ, we need a labeled dataset consisting of pairs of an input x and corresponding label y ∈ {1, . . . , K}, but in AL, we are given only an unlabeled dataset U = {x i } n i=1 without labels. An AL algorithm is defined with a classifier model f (•; θ) and an acquisition function a : R d → R measuring how useful an unlabeled example x is to the classifier f (•; θ). Given f and a, an AL algorithm alternates between acquiring the labels for chosen unlabeled samples and training the

availability

https://github.com/nannullna/snapshot

