DATA SUBSET SELECTION VIA MACHINE TEACHING

Abstract

We study the problem of data subset selection: given a fully labeled dataset and a training procedure, select a subset such that training on that subset yields approximately the same test performance as training on the full dataset. We propose an algorithm, inspired by recent work in machine teaching, that has theoretical guarantees, compelling empirical performance, and is model-agnostic meaning the algorithm's only information comes from the predictions of models trained on subsets. Furthermore, we prove lower bounds that show that our algorithm achieves a subset with near-optimal size (under computational hardness assumptions) while training on a number of subsets that is optimal up to extraneous log factors. We then empirically compare our algorithm, machine teaching algorithms, and coreset techniques on six common image datasets with convolutional neural networks. We find that our machine teaching algorithm can find a subset of CIFAR10 of size less than 16k that yields the same performance (5-6% error) as training on the full dataset of size 50k.

1. INTRODUCTION

Machine learning has made tremendous progress in the past two years with large models such as GPT-3 and CLIP (Brown et al., 2020; Radford et al., 2021) . A key ingredient in these milestones is increasing the amount of training data to Internet scale. However, large datasets come with a cost, and not all -or maybe not even most -training data is useful. Data subset selection addresses this practical issue with the goal of finding a subset of a dataset so that training on the subset yields approximately the same test performance as training on the full dataset (Wei et al., 2015) . Applications of data subset selection include continual learning (Aljundi et al., 2019; Borsos et al., 2020; Yoon et al., 2021) , experience replay (Schaul et al., 2016; Hu et al., 2021 ), curriculum learning (Bengio et al., 2009; Wang et al., 2021) , and data-efficient learning (Killamsetty et al., 2021a) . Additionally, data subset selection is closely related to active learning (Settles, 2009; Sener & Savarese, 2018) and to fundamental questions about the role of data in learning (Toneva et al., 2018) . Data subset selection has been studied in two quite different contexts. First, perhaps the more popular approaches are found in the coreset literature (Sener & Savarese, 2018; Coleman et al., 2020; Mirzasoleiman et al., 2020; Paul et al., 2021) , an empirically driven (Guo et al., 2022) research area that includes a variety of techniques to select important and diverse points, often with an eye towards minimal computational cost of subset selection. Second, machine teaching (Goldman & Kearns, 1995; Shinohara & Miyano, 1991) focuses on minimizing the number of examples a teacher must present to a learner and can be reframed as data subset selection in some settings. As machine teaching has mostly been studied from a conceptual or theoretical viewpoint, black-box models such as modern neural networks present a practical challenge for machine teaching. However, recently, a theoretical breakthrough in the machine teaching literature (Dasgupta et al., 2019; Cicalese et al., 2020) formalizes and provides algorithms with analysis for teaching black-box learners. Although these works are mainly theoretical, they also include some limited empirical evaluation, though with implementations that include unjustified heuristics. Furthermore, Dasgupta et al. (2019) includes no baselines and Cicalese et al. ( 2020) only includes random sampling as a baseline. As of yet, the two algorithms have not been compared in the literature, much less to other coreset methods. In this work, we bring together these two lines of research, through both theoretical analysis and empirical evaluation. We make a clear connection between a particular machine teaching setting and data subset selection, use this insight to introduce an algorithm with state of the art and near-optimal asymptotics, prove correctness of implementation heuristics from previous machine teaching work,

