DATA SUBSET SELECTION VIA MACHINE TEACHING

Abstract

We study the problem of data subset selection: given a fully labeled dataset and a training procedure, select a subset such that training on that subset yields approximately the same test performance as training on the full dataset. We propose an algorithm, inspired by recent work in machine teaching, that has theoretical guarantees, compelling empirical performance, and is model-agnostic meaning the algorithm's only information comes from the predictions of models trained on subsets. Furthermore, we prove lower bounds that show that our algorithm achieves a subset with near-optimal size (under computational hardness assumptions) while training on a number of subsets that is optimal up to extraneous log factors. We then empirically compare our algorithm, machine teaching algorithms, and coreset techniques on six common image datasets with convolutional neural networks. We find that our machine teaching algorithm can find a subset of CIFAR10 of size less than 16k that yields the same performance (5-6% error) as training on the full dataset of size 50k.

1. INTRODUCTION

Machine learning has made tremendous progress in the past two years with large models such as GPT-3 and CLIP (Brown et al., 2020; Radford et al., 2021) . A key ingredient in these milestones is increasing the amount of training data to Internet scale. However, large datasets come with a cost, and not all -or maybe not even most -training data is useful. Data subset selection addresses this practical issue with the goal of finding a subset of a dataset so that training on the subset yields approximately the same test performance as training on the full dataset (Wei et al., 2015) . Applications of data subset selection include continual learning (Aljundi et al., 2019; Borsos et al., 2020; Yoon et al., 2021) , experience replay (Schaul et al., 2016; Hu et al., 2021 ), curriculum learning (Bengio et al., 2009; Wang et al., 2021), and data-efficient learning (Killamsetty et al., 2021a) . Additionally, data subset selection is closely related to active learning (Settles, 2009; Sener & Savarese, 2018) and to fundamental questions about the role of data in learning (Toneva et al., 2018) . Data subset selection has been studied in two quite different contexts. First, perhaps the more popular approaches are found in the coreset literature (Sener & Savarese, 2018; Coleman et al., 2020; Mirzasoleiman et al., 2020; Paul et al., 2021) , an empirically driven (Guo et al., 2022) research area that includes a variety of techniques to select important and diverse points, often with an eye towards minimal computational cost of subset selection. Second, machine teaching (Goldman & Kearns, 1995; Shinohara & Miyano, 1991) focuses on minimizing the number of examples a teacher must present to a learner and can be reframed as data subset selection in some settings. As machine teaching has mostly been studied from a conceptual or theoretical viewpoint, black-box models such as modern neural networks present a practical challenge for machine teaching. However, recently, a theoretical breakthrough in the machine teaching literature (Dasgupta et al., 2019; Cicalese et al., 2020) formalizes and provides algorithms with analysis for teaching black-box learners. Although these works are mainly theoretical, they also include some limited empirical evaluation, though with implementations that include unjustified heuristics. Furthermore, Dasgupta et al. (2019) includes no baselines and Cicalese et al. ( 2020) only includes random sampling as a baseline. As of yet, the two algorithms have not been compared in the literature, much less to other coreset methods. In this work, we bring together these two lines of research, through both theoretical analysis and empirical evaluation. We make a clear connection between a particular machine teaching setting and data subset selection, use this insight to introduce an algorithm with state of the art and near-optimal asymptotics, prove correctness of implementation heuristics from previous machine teaching work, and empirically evaluate methods from both machine teaching and coreset selection on a standard set of benchmarks for the first time. The subset size returned by our machine teaching algorithm shaves off a factor logarithmic in the dataset size compared to existing work. Furthermore, through novel lower bounds, we show that the subset size of our algorithm is near-optimal (under computational hardness assumptions regarding the NP-complete class of problems) and that the expected number of times we must query the learner (train a network) is optimal up to extraneous log factors. Existing machine teaching algorithms from Dasgupta et al. ( 2019); Cicalese et al. ( 2020) perform well when our learner fits labels perfectly (zero training error) but fail catastrophically if the learner makes a few training errors. To address this issue, prior work (Cicalese et al., 2020) removes the training errors from the predictions provided to the teacher, a technique we refer to as error squashing. We provide a rigorous theoretical framework that explains why this technique of squashing errors is justified and effective, and furthermore, use it in our algorithm. Finally, and perhaps most importantly, we compare machine teaching algorithms, including ours, to state-of-the-art coreset selection techniques and random sampling. We perform experiments with three convolutional neural network architectures on six image datasets (CIFAR10, CIFAR100, CINIC10, MNIST, Fashion MNIST, SVHN). We find that the machine teaching algorithms all perform roughly the same and consistently match or outperform the coreset selection techniques. In summary, our main contributions are: • Proposing a machine teaching subset selection algorithm with analysis and lower bounds, showing that the algorithm achieves optimal asymptotic performance up to extraneous log factors. • Providing the first analysis for the justification of error squashing. • Experimentally comparing machine teaching algorithms and coreset techniques using three convolutional neural network architectures on six image datasets. In Sections 2 and 3, we introduce the classification setting and present our algorithm with its guarantee. Next, we show our main theoretical results in Section 4 and experimental results in Section 5. Finally, we discuss our work within the context of related work in Sections 6 and 7.

2. SETTING

We work in a classification setting with an input space X and a finite output space Y. Given a distribution D over X × Y, we wish to find a classifier h : X → Y with low (test) error: err(h) = Pr (x,y)∼D [h(x) ̸ = y]. In this work, we assume we have a hypothesis class H and a learner L, where each h ∈ H is a function h : X → Y and the learner is a function: L : (X × Y) * → H. We assume we have a pool of m labeled data points, {(x i , y i )} m i=1 ⊂ (X × Y) m . The objective of data subset selection is to select a subset S ⊂ [m] of small size |S| such that err(L({(x i , y i ) : i ∈ S})) is low; perhaps as low as the error of training on all data: err(L({(x i , y i ) : i ∈ [m]})).

3. METHOD

Although our ultimate objective is low test error, without a comprehensive understanding of generalization for models such as neural networks, we instrumentally focus on achieving low error on the pool of m datapoints. Zhang et al. (2021) show that modern neural networks can exactly fit arbitrary (even random) ground truth labels. Following previous machine teaching work, we initially assume that the learner makes no errors on the training subset, and refer to these learners as interpolating learners. Note that for interpolating learners, we can always achieve zero pool error if we train with the entire pool. However, it may be possible to select a smaller subset that yields zero pool error. Later, we relax this assumption as it is only approximately true in practice.

