LEVERAGING IMPORTANCE WEIGHTS IN SUBSET SE-LECTION

Abstract

We present a subset selection algorithm designed to work with arbitrary model families in a practical batch setting. In such a setting, an algorithm can sample examples one at a time but, in order to limit overhead costs, is only able to update its state (i.e. further train model weights) once a large enough batch of examples is selected. Our algorithm, IWeS, selects examples by importance sampling where the sampling probability assigned to each example is based on the entropy of models trained on previously selected batches. IWeS admits significant performance improvement compared to other subset selection algorithms for seven publicly available datasets. Additionally, it is competitive in an active learning setting, where the label information is not available at selection time. We also provide an initial theoretical analysis to support our importance weighting approach, proving generalization and sampling rate bounds.

1. INTRODUCTION

Deep neural networks have shown remarkable success in several domains such as computer vision and natural language processing. In many tasks, this is achieved by heavily relying on extremely large labeled datasets. In addition to the storage costs and potential security/privacy concerns that come along with large datasets, training modern deep neural networks on such datasets also incur high computational costs. With the growing size of datasets in various domains, algorithm scalability is a real and imminent challenge that needs to be addressed. One promising way to solve this problem is with data subset selection, where the learner aims to find the most informative subset from a large number of training samples to approximate (or even improve upon) training with the entire training set. Such ideas have been extensively studied in k-means and k-median clustering (Har-Peled & Mazumdar, 2004) , subspace approximation (Feldman et al., 2010 ), computational geometry (Agarwal et al., 2005) , density estimation (Turner et al., 2021) , to name a few. One particular approach for solving data subsampling involves the computation of coresets, which are weighted subsets of a dataset that can act as the proxy for the whole dataset to solve some optimization task. Coreset algorithms are primarily motivated with theoretical guarantees that bound the difference between the training loss (or other such objective) over the coreset and that over the full dataset under different assumptions on the losses and hypothesis classes (Mai et al., 2021; Munteanu et al., 2018; Curtin et al., 2019; Karnin & Liberty, 2019) . However, in practice, most competitive subset selection algorithms, that are designed for general loss functions and arbitrary function classes, focus only on selecting informative subsets of the data and typically do not assign weights to the selected examples. These methods are, for example, based on some notion of model uncertainty (Scheffer et al., 2001) , information gain (Argamon-Engelson & Dagan, 1999) , loss gradients (Paul et al., 2021; Ash et al., 2019) , or diversity (Sener & Savarese, 2018) . Counter to this trend, we show that weighting the selected samples can be very beneficial. In this work, we present a subset selection algorithm called IWeS that is designed for general loss functions and hypothesis classes and that selects examples by importance sampling, a theoretically motivated and unbiased sampling technique. Importance sampling is conducted according to a specially crafted probability distribution and, importantly, each sampled example is weighted inversely proportional to its sampling probability when computing the training loss. We develop two types of sampling probability for different practical requirements (e.g. computational constraints and label availability), but in both cases, the sampling probability is based on the example's entropy-based score computed using a previously trained model. We note, the IWeS algorithm is similar to the IWAL active learning algorithm of Beygelzimer et al. (2009) as both are based on importance sampling. However, in contrast to IWAL, IWeS uses a different sampling probability definition with a focus on providing a practical method that is amenable to large deep networks and complex hypothesis classes. Through extensive experiments, we find that the IWeS algorithm is competitive for deep neural networks over several datasets. We compare our algorithm against four types of baselines whose sampling strategies leverage: the model's uncertainty over examples, diversity of selected examples, gradient information, and random sampling. Finally, we analyze a closely related albeit less practical algorithm that inspires the design of IWeS, called IWeS-V, proving it admits generalization and sampling rate guarantees that hold for general loss functions and hypothesis classes. The contributions of this work can be summarized as follows: 1. We present the Importance Weighted Subset Selection (IWeS) algorithm that selects examples by importance sampling with a sampling probability based on a model's entropy, which is applicable to (and practical for) arbitrary model families including modern deep networks. In addition to the subset selection framework, IWeS also works in the active learning setting where the examples are unlabeled at selection time. 2. We demonstrate that IWeS achieves significant improvement over several baselines (Random, Margin, Least-Confident, Entropy, Coreset, BADGE) using VGG16 model for six common multi-class datasets (CIFAR10, CIFAR10-corrupted, CIFAR100, SVHN, Eurosat, Fashion MNIST), and using ResNet101 model for the large-scale multi-label OpenImages dataset. 3. We provide a theoretical analysis for a closely related algorithm, IWeS-V, in Section 4. We prove a O(1/ √ T ) generalization bound, which depends on the full training dataset size T . We further give a new definition of disagreement coefficient and prove a sampling rate bound by leveraging label information, which is tighter compared with the label complexity bound provided by Beygelzimer et al. (2009) that does not use label information.

1.1. RELATED WORK

Uncertainty. Uncertainty sampling, which selects examples that the model is least confident on, is favored by practitioners (Mussmann & Liang, 2018) and rather competitive among many recent algorithms (Yang & Loog, 2018) . Uncertainty can be measured through entropy (Argamon-Engelson & Dagan, 1999) , least confidence (Culotta & McCallum, 2005) , and most popular is the margin between the most likely and the second most likely labels (Scheffer et al., 2001) . More recent works measure the model uncertainty indirectly, such as selecting examples based on an estimated loss (Yoo & Kweon, 2019) as well as leveraging variational autoencoders and adversarial networks to find points not well represented by the current labeled data (Sinha et al., 2019; Kim et al., 2021) . Beygelzimer et al. (2009) makes use of a disagreement-based notion of uncertainty and constructs an importance weighted predictor with theoretical guarantees called IWAL, which is further enhanced by Cortes et al. (2019) . However, IWAL is not directly suitable for use with complex hypothesis spaces, such as deep networks, since it requires solving a non-trivial optimization over a subset of the hypothesis class, the so-called version space, in order to compute sampling probabilities. We further discuss these difficulties in Section 4. 



In another line of research, subsets are selected by enforcing diversity such as in the FASS (Wei et al., 2015) and Coreset (Sener & Savarese, 2018) algorithms. Wei et al. (2015) introduces a submodular sampling objective that trades off between uncertainty and diversity by finding a diverse set of samples from amongst those that the current trained model is most uncertain about. It was further explored by Kaushal et al. (2019) who designed a unified framework for data subset selection with facility location and dispersion-based diversity functions. Sener & Savarese (2018) show that the task of identifying a coreset in an active learning setting can be mapped to solving the k-center problem. Further recent works related to coreset idea are Mirzasoleiman et al. (2020); Killamsetty

