LEVERAGING IMPORTANCE WEIGHTS IN SUBSET SE-LECTION

Abstract

We present a subset selection algorithm designed to work with arbitrary model families in a practical batch setting. In such a setting, an algorithm can sample examples one at a time but, in order to limit overhead costs, is only able to update its state (i.e. further train model weights) once a large enough batch of examples is selected. Our algorithm, IWeS, selects examples by importance sampling where the sampling probability assigned to each example is based on the entropy of models trained on previously selected batches. IWeS admits significant performance improvement compared to other subset selection algorithms for seven publicly available datasets. Additionally, it is competitive in an active learning setting, where the label information is not available at selection time. We also provide an initial theoretical analysis to support our importance weighting approach, proving generalization and sampling rate bounds.

1. INTRODUCTION

Deep neural networks have shown remarkable success in several domains such as computer vision and natural language processing. In many tasks, this is achieved by heavily relying on extremely large labeled datasets. In addition to the storage costs and potential security/privacy concerns that come along with large datasets, training modern deep neural networks on such datasets also incur high computational costs. With the growing size of datasets in various domains, algorithm scalability is a real and imminent challenge that needs to be addressed. One promising way to solve this problem is with data subset selection, where the learner aims to find the most informative subset from a large number of training samples to approximate (or even improve upon) training with the entire training set. Such ideas have been extensively studied in k-means and k-median clustering (Har-Peled & Mazumdar, 2004) , subspace approximation (Feldman et al., 2010 ), computational geometry (Agarwal et al., 2005) , density estimation (Turner et al., 2021) , to name a few. One particular approach for solving data subsampling involves the computation of coresets, which are weighted subsets of a dataset that can act as the proxy for the whole dataset to solve some optimization task. Coreset algorithms are primarily motivated with theoretical guarantees that bound the difference between the training loss (or other such objective) over the coreset and that over the full dataset under different assumptions on the losses and hypothesis classes (Mai et al., 2021; Munteanu et al., 2018; Curtin et al., 2019; Karnin & Liberty, 2019) . However, in practice, most competitive subset selection algorithms, that are designed for general loss functions and arbitrary function classes, focus only on selecting informative subsets of the data and typically do not assign weights to the selected examples. These methods are, for example, based on some notion of model uncertainty (Scheffer et al., 2001) , information gain (Argamon-Engelson & Dagan, 1999) , loss gradients (Paul et al., 2021; Ash et al., 2019) , or diversity (Sener & Savarese, 2018) . Counter to this trend, we show that weighting the selected samples can be very beneficial. In this work, we present a subset selection algorithm called IWeS that is designed for general loss functions and hypothesis classes and that selects examples by importance sampling, a theoretically

