DATA SUBSET SELECTION VIA MACHINE TEACHING

Abstract

We study the problem of data subset selection: given a fully labeled dataset and a training procedure, select a subset such that training on that subset yields approximately the same test performance as training on the full dataset. We propose an algorithm, inspired by recent work in machine teaching, that has theoretical guarantees, compelling empirical performance, and is model-agnostic meaning the algorithm's only information comes from the predictions of models trained on subsets. Furthermore, we prove lower bounds that show that our algorithm achieves a subset with near-optimal size (under computational hardness assumptions) while training on a number of subsets that is optimal up to extraneous log factors. We then empirically compare our algorithm, machine teaching algorithms, and coreset techniques on six common image datasets with convolutional neural networks. We find that our machine teaching algorithm can find a subset of CIFAR10 of size less than 16k that yields the same performance (5-6% error) as training on the full dataset of size 50k. We work in a classification setting with an input space X and a finite output space Y. Given a distribution D over X × Y, we wish to find a classifier h : X → Y with low (test) error: err In this work, we assume we have a hypothesis class H and a learner L, where each h ∈ H is a function h : X → Y and the learner is a function: L : (X × Y) * → H. We assume we have a pool of m labeled data points, ) is low; perhaps as low as the error of training on all data: err(L({(x i , y i Although our ultimate objective is low test error, without a comprehensive understanding of generalization for models such as neural networks, we instrumentally focus on achieving low error on the pool of m datapoints. Zhang et al. (2021) show that modern neural networks can exactly fit arbitrary (even random) ground truth labels. Following previous machine teaching work, we initially assume that the learner makes no errors on the training subset, and refer to these learners as interpolating learners. Note that for interpolating learners, we can always achieve zero pool error if we train with the entire pool. However, it may be possible to select a smaller subset that yields zero pool error. Later, we relax this assumption as it is only approximately true in practice.

1. INTRODUCTION

Machine learning has made tremendous progress in the past two years with large models such as GPT-3 and CLIP (Brown et al., 2020; Radford et al., 2021) . A key ingredient in these milestones is increasing the amount of training data to Internet scale. However, large datasets come with a cost, and not all -or maybe not even most -training data is useful. Data subset selection addresses this practical issue with the goal of finding a subset of a dataset so that training on the subset yields approximately the same test performance as training on the full dataset (Wei et al., 2015) . Applications of data subset selection include continual learning (Aljundi et al., 2019; Borsos et al., 2020; Yoon et al., 2021) , experience replay (Schaul et al., 2016; Hu et al., 2021) , curriculum learning (Bengio et al., 2009; Wang et al., 2021) , and data-efficient learning (Killamsetty et al., 2021a) . Additionally, data subset selection is closely related to active learning (Settles, 2009; Sener & Savarese, 2018) and to fundamental questions about the role of data in learning (Toneva et al., 2018) . Data subset selection has been studied in two quite different contexts. First, perhaps the more popular approaches are found in the coreset literature (Sener & Savarese, 2018; Coleman et al., 2020; Mirzasoleiman et al., 2020; Paul et al., 2021) , an empirically driven (Guo et al., 2022) research area that includes a variety of techniques to select important and diverse points, often with an eye towards minimal computational cost of subset selection. Second, machine teaching (Goldman & Kearns, 1995; Shinohara & Miyano, 1991) focuses on minimizing the number of examples a teacher must present to a learner and can be reframed as data subset selection in some settings. As machine teaching has mostly been studied from a conceptual or theoretical viewpoint, black-box models such as modern neural networks present a practical challenge for machine teaching. However, recently, a theoretical breakthrough in the machine teaching literature (Dasgupta et al., 2019; Cicalese et al., 2020) formalizes and provides algorithms with analysis for teaching black-box learners. Although these works are mainly theoretical, they also include some limited empirical evaluation, though with implementations that include unjustified heuristics. Furthermore, Dasgupta et al. (2019) includes no baselines and Cicalese et al. (2020) only includes random sampling as a baseline. As of yet, the two algorithms have not been compared in the literature, much less to other coreset methods. In this work, we bring together these two lines of research, through both theoretical analysis and empirical evaluation. We make a clear connection between a particular machine teaching setting and data subset selection, use this insight to introduce an algorithm with state of the art and near-optimal asymptotics, prove correctness of implementation heuristics from previous machine teaching work, and empirically evaluate methods from both machine teaching and coreset selection on a standard set of benchmarks for the first time. The subset size returned by our machine teaching algorithm shaves off a factor logarithmic in the dataset size compared to existing work. Furthermore, through novel lower bounds, we show that the subset size of our algorithm is near-optimal (under computational hardness assumptions regarding the NP-complete class of problems) and that the expected number of times we must query the learner (train a network) is optimal up to extraneous log factors. Existing machine teaching algorithms from Dasgupta et al. (2019) ; Cicalese et al. (2020) perform well when our learner fits labels perfectly (zero training error) but fail catastrophically if the learner makes a few training errors. To address this issue, prior work (Cicalese et al., 2020) removes the training errors from the predictions provided to the teacher, a technique we refer to as error squashing. We provide a rigorous theoretical framework that explains why this technique of squashing errors is justified and effective, and furthermore, use it in our algorithm. Finally, and perhaps most importantly, we compare machine teaching algorithms, including ours, to state-of-the-art coreset selection techniques and random sampling. We perform experiments with three convolutional neural network architectures on six image datasets (CIFAR10, CIFAR100, CINIC10, MNIST, Fashion MNIST, SVHN). We find that the machine teaching algorithms all perform roughly the same and consistently match or outperform the coreset selection techniques. In summary, our main contributions are: • Proposing a machine teaching subset selection algorithm with analysis and lower bounds, showing that the algorithm achieves optimal asymptotic performance up to extraneous log factors. • Providing the first analysis for the justification of error squashing. • Experimentally comparing machine teaching algorithms and coreset techniques using three convolutional neural network architectures on six image datasets. In Sections 2 and 3, we introduce the classification setting and present our algorithm with its guarantee. Next, we show our main theoretical results in Section 4 and experimental results in Section 5. Finally, we discuss our work within the context of related work in Sections 6 and 7.

3.1. MAIN ALGORITHM

In this section, we present our algorithm (see Algorithm 1 for pseudo-code). Our algorithm is an iterative algorithm, where each iteration is composed of three steps: sampling a subset of the pool, training on that subset, and using the results of training to update the subset sampling distribution. The intuitive principle behind our algorithm is that if we train on a subset S ⊂ [m] and receive a hypothesis h ∈ H, we want to emphasize the points (i.e., include some of the points in future subsets) where the hypothesis makes errors with respect to the ground truth labels. Then, in future iterations, that same hypothesis will not be returned by an interpolating learner since the hypothesis would make an error on the subset. At each iteration, the algorithm samples a subset S t and trains on that subset to yield a hypothesis h t . For sampling a subset at the t th iteration, the algorithm independently samples Bernoulli random variables with probability p t,i , and upon success, includes the i th point. The success probability of the Bernoulli variable for the i th point at the t th iteration is p t,i and we sample a set S t such that Pr( S t = S) = i∈S p t,i i∈[m]\S (1 -p t,i ) . Thus, in order to emphasize errors, the algorithm sets the next iteration's Bernoulli to have higher success probability (p t+1,i > p t,i ). While there are a variety of ways to increase the probabilities, our algorithm calculates the errors E t := E ht (where E h = {i ∈ [m] : h(x i ) ̸ = y i }) on the entire pool made by the hypothesis h t and sequentially doubles the probabilities corresponding to the errors until i∈Et p t,i is sufficiently large (larger than a hyperparameter, ξ). Since a probability might exceed 1 after this step, we subsequently clip the probabilities to 1. Because the algorithm only updates probabilities via doubling, a natural choice is to initialize p 1,i as a (negative) power of 2: 2 -k0 for some integer k 0 . Then, p t,i is always a power of two, since doubled powers of 2 are powers of 2. The final aspect of the algorithm is that we periodically halve all probabilities. This step is key to improving the guarantee on the size of the subset from previous work (Dasgupta et al., 2019; Cicalese et al., 2020) to the asymptotically near-optimal subset size in this work. Note that without this halving step, the probabilities only increase with the number of iterations. Our algorithm has two hyperparameters, d and N , which are used to set the doubling limit ξ, the initial power k 0 , and the number of iterations between halving the probabilities. The algorithm will succeed with high probability if the two hyperparameters are upper bounds on d * , the size of the smallest subset that yields zero pool errors, and N , the size of the induced hypothesis set, a set we will define in the next section. Conveniently, in the cases where the algorithm fails, it returns whether d was too small or N was too small. We can set d and N to be arbitrarily large to yield valid subsets, but the size of the resulting subsets will be very large. See Algorithm 1 for pseudo-code.

3.2. ALGORITHMIC ANALYSIS

For the analysis, we begin by assuming that our learner is interpolating, meaning that it makes no errors on the subset it is trained on. More precisely, if h S = L({(x i , y i ) : i ∈ S}) where S ⊂ [m], then ∀i ∈ S : h S (x i ) = y i . Note that this implies that if we train on the entire dataset, a hypothesis that makes zero pool errors is returned. Define H as the induced hypothesis class, that is, all hypotheses that can be returned by training on a subset: H = {L({(x i , y i ) : i ∈ S}) : S ⊂ [m]}. Note that the size of H could be as large as 2 m (the number of subsets) but can be much smaller if the data has structure that can be leveraged by the model. Our framework technically requires a deterministic learner and we can fix the random seed for training; however, this is not a practical issue as it is extremely unlikely that we will train a model on the exact same subset at two different iterations. We now define d * and N . Let E h be the errors of a hypothesis h: E h = {i ∈ [m] : h(x i ) ̸ = y i }. Let E be the set of possible errors: E = {E h : h ∈ H} -{∅}. We say a set S "fully intersects" E if for all E h ∈ E, |S ∩ E h | ≥ 1. Note that an interpolating learner trained on a fully intersecting subset S will yield 0 pool error; if not, then the returned hypothesis h t would have non-zero training error on the subset S t which contradicts the interpolating assumption. Let d * be the size of smallest Algorithm 1 Input: examples {(x i , y i )} m i=1 ⊂ (X × Y) m , failure probability δ, d ∈ Z + , N ∈ Z + k 0 = ⌊log 2 (m/ d)⌋ ξ = ln(4 N 2 log 2 (m)/δ) ∀i : k 1,i = k 0 for t = 1, 2, . . . , 4 dk 0 do Sample S t such that i ∈ S t independently with probability p t,i = 2 -kt,i Train h t = L({(x i , y i )} i∈St ) Calculate the errors of h t on the pool: E t := E ht = {i ∈ [m] : h t (x i ) ̸ = y i } if |E t | = 0 then return SUCCESS: Ŝ = S t else if i∈Et p t,i ≥ ξ then return FAILURE TYPE 1: N is too small else ∆ t = log 2 ξ i∈E t pt,i k t+1,i = max(0, k t,i -∆ t 1[i ∈ E t ] + 1[t (mod) 2 d = 0]) end if end if end for return FAILURE TYPE 2: d is too small fully intersecting set, that is, d * := arg min S⊂[m] |S| subject to |S ∩ E h | ≥ 1 ∀E h ∈ E. With these definitions, we are ready to give our algorithmic analysis theorem. The proof of Theorem 1 is in Appendix D. Note that the log log m term is negligible, for example, it can be ignored (subsumed in the constants) if m ≤ 2 N which is true in reasonable cases. In Section 4, we provide lower bounds showing that the number of queries to the learner is asymptotically optimal up to extraneous log factors and that the size of the returned subset is asymptotically optimal (with a computational hardness assumption) if we assume log log m is a low-order term compared to log N and that δ is a small constant. To choose the hyperparameters d and ξ = Θ(log N ), we can use a doubling approach to find a constant factor approximation to d * and log N .

3.3. NON-INTERPOLATING LEARNERS

For non-interpolating learners, Algorithm 1 might never return successfully even for a single training error. To remedy this, we can "squash" the training errors, that is, remove the points from E t that also appear in S t . More precisely, when we calculate errors on the pool, instead of defining E t = {i ∈ [m] : h t (x) ̸ = y i }, define E t = {i ∈ [m] : h t (x) ̸ = y i ∧ i ̸ ∈ S t }. For the experiments, we run Algorithm 1 with this edit. This error squashing technique is used beyond our algorithm. This technique is explicitly used for the experiments in Cicalese et al. (2020) and likely a similar approach is used for Dasgupta et al. (2019) as we found experimentally that the algorithm dramatically fails, as written, without error squashing. Similar to this paper, the algorithmic analysis of Cicalese et al. (2020) assumes interpolating learners, but the practical algorithm includes error squashing.

4. THEORY

In this section, we first cover notation definitions, a particular class of learners, and the equivalence between machine teaching and the classic set cover problem. Next, we cover our lower bounds and results for the error squashing technique.

4.1. NOTATION DEFINITIONS

Let Z denote the set of integers and Z + denote the set of positive integers. Let [k] = {1, 2, . . . , k}. For a set S, let P(S) denote the power set of S. For a set S, let S k be the repeated Cartesian product; for example, S 3 = S × S × S. Let O(•) denote O(•) where extraneous log factors are ignored. In particular, O(f (n) log k (f (n)) = O(f (n)).

4.2. RANKED MINIMAL ERROR LEARNERS

Here, we define a type of learner that appears in the constructions for our lower bounds and is given as an example in the error squashing framework. Intuitively, a ranked minimal error learner is a learner with a hypothesis class H and a ranking σ such that the learner returns the lowest ranked hypothesis that has minimal error. Definition 1. Let X and Y be finite. We say a learner L is a ranked minimal error learner if there exists a (finite) hypothesis class H and a bijection σ : H → [|H|] such that for any D ⊂ P(X × Y), with k = min h∈H |{(x, y) ∈ D : h(x) ̸ = y}|, L(D) = arg min h:|{(x,y)∈D:h(x)̸ =y}|=k σ(h). Note that since we focus on fixed datasets and classification in this work, finite X and Y are not a restriction for our purposes. If we assume the ground truth labels are generated by a member of the hypothesis class, ∃h * ∈ H : ∀(x, y) ∈ D : y = h * (x), then the learner is an interpolating learner.

4.3. LOWER BOUNDS

We present two lower bounds that are query-based lower bounds proved using similar techniques. In particular, we construct a ranked minimal error learner with a special structure that depends on a random "key" (e.g., a random vector). We show that even if the teacher knew all the structure of the learner except the random key, any teaching algorithm will not be able to return a subset of size λd * (a λ-approximation) with few queries and probability greater than 1/2. Theorem 2. Fix any λ ≥ 1 and m ≥ 2λ. There exists a distribution over interpolating ranked minimal error learners and an m-sized dataset with optimal subset size d * , such that any teaching algorithm requires 2 Ω(m/λ) queries to achieve at most λd * subset size with probability at least 1/2. The proof is in Appendix E.1. This would appear to be very bad news: we require an exponential number of queries to even achieve a valid subset that is a factor λ larger than the optimal subset. Fortunately, as shown later, we can not in general approximate the optimal subset to a factor of o(log N ) (asymptotically strictly better than log N ). Furthermore, N is very large in the construction for Theorem 2; so large that log N = Θ(m). Thus, if we are content with an approximation guarantee of O(log N ), then far fewer queries to the learner are required, as the existence of Algorithm 1 shows. Theorem 3. Fix any k ≥ 1, ℓ ≥ 1, and λ ≥ 1. There exists a distribution over interpolating ranked minimal error learners and a dataset of size m = k(⌈eλ⌉) ℓ with optimal subset size d * = k and N = O(k ln(m/k)/ ln(λ)), such that any teaching algorithm requires Ω d * ln(m/d * ) ln 2 (λ) queries to achieve at most λd * subset size with probability at least 1/2. The proof is in Appendix E.2. Note that this implies achieving an approximation ratio of O(log N ) or O(log m log N ) (as found in Cicalese et al. (2020) ; Dasgupta et al. (2019) ) requires Ω(d * ln(m/d * )) queries. We now present a computational hardness lower bound on finding a subset with a small o(log N ) approximation ratio to optimal, even with unlimited learner queries so that H could be explicitly constructed. This result is a straightforward application of the following theorem to the set cover equivalence of machine teaching. Theorem 4 ((Theorem 4.4 from Feige (1998) ). If there is some ϵ > 0 such that a polynomial time algorithm can approximate set cover within (1 -ϵ) ln n, then NP ⊂ TIME(n O(log log n) ). Let us interpret this result. For set cover instances, the "size" n of a problem is measured in terms of the number of elements, our N . Feige (1998) defines TIME(f (n)) to be the set of problems solvable in f (n) (determnistic) time. The conclusion NP ⊂ TIME(n O(log log n) ) is slightly weaker than NP ⊂ TIME(n O(1) ) = P which would imply the famous P = N P . All the same, if it were to be shown that NP ⊂ TIME(n O(log log n) ), it would revolutionize complexity theory, and thus, it is doubtful that such a statement can be shown to be true without significant obstacles. Then, a direct result of set cover equivalence (Goldman & Kearns, 1995) and Theorem 4 Feige (1998) is that approximating machine teaching to within 1 2 ln N would imply NP ⊂ TIME(n O(log log n) ).

4.4. ERROR SQUASHING FRAMEWORK

In a nutshell, our framework consists of defining a general condition for a learner, defining a teaching set definition for non-interpolating learners, then showing that achieving such a teaching set reduces, by way of error squashing, to a standard set cover problem (if the learner meets the condition). This means that the error squashing technique effortlessly converts any machine teaching algorithm for interpolating learners to a machine teaching algorithm for non-interpolating learners, which inherits the same theoretical guarantees, albeit with a larger hypothesis class.

4.4.1. LEARNERS INVARIANT TO CONSISTENT ADDITIONS

In this section, we first define a general condition on a learner: if a learner trained on a dataset correctly predicts a point, then additionally including that point in the dataset does not change the the learner's predictions. Definition 2 (Invariant to consistent additions). We say a learner is invariant to consistent additions if ∀D ⊂ P(X × Y), x ⊂ X , y ⊂ Y: L(D)(x) = y =⇒ L(D) = L(D ∪ {(x, y)}). Intuitively, for models that minimize error, adding a point with zero loss will not change the model. In particular, we have the following proposition: Proposition 1. A ranked minimal error learner is invariant to consistent additions. The proof of Proposition 1 can be found in Appendix F. The invariance to consistent additions is more general than ranked minimal error learners: see an example in Appendix F.7.1.

4.4.2. ERROR INCLUSIVE TEACHING SETS

Recall that error squashing with machine teaching means that when the learner is passed a subset S t and returns a hypothesis h t , the teacher acts as if the hypothesis makes no errors on the subset S t . Machine teaching algorithms only terminate when the returned hypothesis makes no errors on the pool, which means that all the errors of h t are squashed with respect to S t . For a fixed dataset and learner, we say a set S is "error inclusive" if all errors of L(S) are inside S. Formally, for a fixed dataset {(x i , y i ) } m i=1 , define L (err) : P([m]) → P([m]) as the set of indices of points where errors are made by a learner trained on a subset S: L (err) (S) = {i ∈ [m] : L({(x i ′ , y i ′ ) : i ′ ∈ S})(x i ) ̸ = y i }. Definition 3. For a fixed dataset, we say a set S is "error inclusive" if L (err) (S) ⊂ S. Error inclusivity means that S contains the points useful for generalization outside S, as well as the (presumably) nuisance points that are hard or impossible to classify correctly, such as incorrectly labeled ground truth. An error squashing machine teaching algorithm, if it returns successfully, must return an error inclusive set. An important consequence of error inclusivity is that, for a learner invariant to consistent additions, the learner trained on the set will yield the same result as training on the entire dataset. Proposition 2. Suppose a learner L is invariant to consistent additions. If a set S is error inclusive, L (err) (S) = L (err) ([m]). The proof of Proposition 5 is in Appendix F.6. Interpreted further, error inclusivity implies that the number of errors from training on S is the same as the number of errors from training on the entire dataset, |L (err) (S)| = |L (err) ([m])|. Furthermore, error inclusivity implies L (err) ([m]) ⊂ S, meaning that S contains the errors of training on the full dataset. Error inclusivity is a condition on a set that we will use as the teaching set condition for non-interpolating learners. Note that Cicalese et al. (2020) introduce another teaching set condition for non-interpolating learners, k-extended teaching sets. In Appendix F, we compare k-extended teaching sets and error inclusive teaching sets for ranked minimal errors. We find they are remarkably similar, though neither is stronger than the other: there exist examples where the smallest k-extended teaching set is of size Θ(m) but there exists an error inclusive set of size Θ(1), and vice versa. However, we show error inclusive sets are much easier to design algorithms for.

4.4.3. EXTENDED ERROR SETS

Recall that in the context of machine teaching with interpolating learners, we define the errors of a hypothesis as E h = {i ∈ [m] : h(x i ) ̸ = y i } and let E = {E h : h ∈ H} -{∅} be the error sets. Then a set S ⊂ [m] is a valid teaching set if S fully intersects E (intersects each element of E). Similarly, black-box machine teaching is like a game where elements of E are revealed by learner queries and a solution S ⊂ [m] must intersect every element. In fact, there is an equivalence between these two views, observed in Goldman & Kearns (1995) as the relationship between set cover and machine teaching. For a more complete exposition, see Appendix F. Define the extended error sets as E + = {L (err) (S) \ S : S ⊂ [m]} -{∅}. For interpolating models, E + = E. If we find a set S ⊂ [m] that fully intersects E + , we are guaranteed that when the learner is trained on S, there will be no errors outside of S: S is error inclusive. Not only this, we show that, in fact, fully intersecting E + is equivalent to error inclusivity. Proposition 3. Fix a dataset. If a learner is invariant to consistent additions, S fully intersects E + if and only if S is error inclusive. Thus, we show that (for learners invariant to consistent additions) the error squashing technique converts machine teaching with non-interpolating learners to set cover, though of an expanded set. Then, black-box machine teaching algorithms (which are online set cover algorithms in disguise) can be run with the same theoretical guarantees, though with a larger N . Note that the same is not true of k-extended teaching sets (Cicalese et al., 2020) which require solving "generalized set cover" that is harder to design algorithms for (e.g. Cicalese et al. (2020) does not implement any).

5. EXPERIMENTS

We evaluate 7 subset selection methods: Random sampling, Entropy sampling (Lewis & Gale, 1994) , Forgetting events (Toneva et al., 2018) , GraNd (Paul et al., 2021) , DHPZ (Dasgupta et al., 2019) , CFLM (Cicalese et al., 2020) , and our introduced algorithm. Although there are many coreset selection techniques, we focused on the methods that achieve the highest accuracy regardless of computation cost. We compare to Entropy, Forgetting, and GraNd as the coreset selection methods because they perform the best in the evaluation on CIFAR10 by Guo et al. (2022) in our dataset subset size regime (20% to 60% of CIFAR10). More details can be found in Appendix B. We compare results on 6 common image datasets (all but CINIC10foot_0 retrieved using torchvision) and use the predefined train/test splits. The six datasets are: CIFAR10 (Krizhevsky, 2009) , CIFAR100 (Krizhevsky, 2009) , CINIC10 (Darlow et al., 2018) , FashionMNIST (FMNIST) (Xiao et al., 2017) , MNIST (LeCun, 1998) , and SVHN (Netzer et al., 2011) . We evaluate using three architectures: Myrtle, VGG, and ResNet10. Myrtle was created by Page (2018) by stripping away parts of ResNet and modifying the training procedure while balancing between training speed and accuracy on CIFAR-10. The end result is a network that achieves 96% on CIFAR-10 in three minutes of training time. VGG (Simonyan & Zisserman, 2015) and ResNet10 (He et al., 2016) are created according to a common pytorch library for the CIFAR datasetsfoot_1 . More details including training hyperparameters can be found in Appendix B.

5.1. RESULTS

We first show results (see Figure 1 ) for the seven methods on all six datasets with one of the architectures, ResNet10. We note that there is a variety of performances from the coreset methods though the forgetting events technique consistently performs the best (lower test error). The machine teaching approaches behave similarly to each other, and outperform all coreset methods. Note that in a few cases (SVHN and FMNIST), achieving near-zero pool error is insufficient to achieve minimal test error with the data. This is surprising because the models trained on the machine teaching subsets get less than 0.01% error on the pool outside of the subset, meaning that every point is either seen or predicted correctly by the model. Next, we study how varying the model architecture impacts performance. We show results for the seven methods with the three architectures on CIFAR10 in Figure 2 . We can draw approximately the same conclusions, though surprisingly, Entropy performs worse than random on non-ResNet architectures. We note that while Entropy performs better than random in Guo et al. (2022) and Coleman et al. (2020) , both these papers use ResNet architectures. Full results for all combinations of datasets and architectures can be found in Appendix C. Although we do not measure or report wall clock time, a proxy is the number of times the model is trained. For Entropy and Forgetting, the model is trained once, for GraNd, the model is trained ten times, and the number of iterations for the machine teaching methods are shown in Table 1 . Finally, we perform cross-architecture experiments where model architecture used during subset selection is different from the model architecture used for evaluation. Interestingly, the behavior of the machine teaching methods differers dramatically, from transferring as well as random to transferring as well as the full dataset. See Figure 3 for an example. All 54 combinations of datasets, subset selection architectures, and evaluation architectures can be found in Appendix C. 

6. RELATED WORK

We note a number of works, sometimes known as coreset methods, focus on reducing the time to train a model once by training with less data, perhaps with a drop in performance (Coleman et al., 2020; Mirzasoleiman et al., 2020; Paul et al., 2021; Killamsetty et al., 2021b; a) . A key challenge in that formulation is creating a data selection algorithm that is computationally fast enough to avoid negating the computational gain of training on less data. In this work, we focus on data selection when we will train multiple times on the data subset (curriculum learning, continual learning, etc), so computational cost is less of an issue. Data subset selection without knowing the labels of the datapoints before a point is selected is known as active learning (Settles, 2009) , a strictly harder problem. Note that an optimal data subset collected with full knowledge of the labels is an upper bound on how well an active learning algorithm can perform. As mentioned throughout this work, machine teaching (Goldman & Kearns, 1995; Zhu et al., 2018) is closely related to subset selection; especially Dasgupta et al. (2019) and Cicalese et al. (2020) . In these works, the specification for machine teaching with black-box models is a bit different from our formulation. In particular, those works require the teaching set to be built up iteratively so that the queried subsets form a nested sequence. While natural from a machine teaching perspective, this requirement is unnecessary for subset selection as we can train on arbitrary subsets before returning an arbitrary subset. This less restrictive specification is likely why the log m factor can be trimmed from our algorithm subset size; see Alon et al. (2009) for more details on a related lower bound. An adjacent problem setting is that of "Dataset Distillation" (Wang et al., 2018; Nguyen et al., 2021) or "Dataset Condensation" (Zhao et al., 2020; Zhao & Bilen, 2021) where a very small (e.g. 100 images) dataset of synthetic images yields decent, but markedly sub-par, performance. Additionally, the optimized synthetic images can be very far from the natural data distribution.

7. DISCUSSION

In this work, we bring together the neural network coreset algorithm and recent machine teaching literature in data subset selection. At the same time, we streamline the theory for machine teaching by proving the correctness of an effective heuristic used in machine teaching implementations and by closing theoretical gaps in the asymptotic number of learner queries and size of the returned subset. More broadly, we hope the insights produced in this work will spur further research in selecting informative datasets and the role of data in learning. A APPENDIX OVERVIEW This appendix is split into 5 sections: • Additional Experimental Results The machine teaching techniques, including ours, have three implementation adaptations. First, all methods are run with error squashing. Second, instead of terminating when no pool errors (outside of the training errors) are made, the algorithms return when the number of pool errors is less than 10 -4 m. Third, to avoid training on very small subsets where training instabilities occur, when a machine teaching algorithm attempts to query a subset with fewer than 10 4 points, we randomly sample enough points so that the total number of points is exactly 10 4 . Because MNIST requires significantly fewer points, we randomly sampled to 10 3 points.

• Experimental

All machine teaching algorithms include a hyperparameter that roughly corresponds to log N . For each method and dataset, we tuned over the set {2 k 10 3 : k ∈ Z} to find the minimal hyperparameter value where the algorithm returned successfully. See Table 2 . Our proposed algorithm additionally has the hyperparameter d which was set to 10 3 for all experiments except for MNIST it was set to 10 2 .

B.2 CORESET IMPLEMENTATION DETAILS

Forgetting scores were calculated while training the subset selection model once. Ten different subset selection models were trained for the ten different runs. Entropy was calculated using the softmax probability distribution after training until completion. Ten different subset selection models were trained for the ten different runs. GraNd was calculated as the L2 norm of the gradient of cross entropy loss with respect to the last layer's weights and biases, averaged over ten runs. All ten replications used the same GraNd scores (averaged from ten runs).

B.3 MYRTLE DETAILS

Details for the Myrtle architecture can be found here https://github.com/davidcpage/ cifar10-fast. Specifically, we use the updated version that uses a preprocessing whitening block and a weighted KL and cross entropy loss, as well as tricks such as Ghost BatchNorm, two sets of SGD optimizers, and exponential moving averages. Note that we use neither CutOut augmentation nor test time augmentation, as we stick to training augmentation with crops and flips. We keep most hyperparameters the same as the default ones, except we train for 50 * 50000 steps and set the base learning rate to 0.4.

B.4 VGG AND RESNET10 DETAILS

The code for the VGG and ResNet10 can be found at https://github.com/kuangliu/ pytorch-cifar. VGG was created with the command VGG('VGG13') and ResNet10 was created with the command ResNet(BasicBlock, [1, 1, 1, 1]). Both architectures were trained using SGD with momentum of 0.9 and weight decay 0.0005, and a triangular learning rate schedule with a maximum learning rate of 0.1. Cross entropy loss and a gradient scaler were also used. The number of gradient steps was set to be 50 × m (i.e. 50 epochs for the full training, more for subsets).

C ADDITIONAL EXPERIMENTS

We first present the number of iterations for the machine teaching methods with Myrtle and VGG in Table 3 and Table 4 . This appendix includes the full experimental plots. In the first figure (Figure 4 ), all three architectures are run with all six datasets. The following three figures are cross-architecture experiments and are grouped by the evaluation architecture. See Figure 5 for Myrtle evaluation, Figure 6 for VGG evaluation, and Figure 7 for ResNet10 evaluation. "CIFAR10 with Myrtle" means Myrtle was used for both subset selection and evaluation, while "CIFAR10 from VGG to ResNet10" means VGG was used for subset selection and ResNet10 was used for evaluation. 

D UPPER BOUND PROOFS

In this section, we prove the following theorem through a sequence of lemmas. Throughout this section, we implicitly assume d ≥ d * and N ≥ N .

D.1 HIGH PROBABILITY EVENT

Set T = 2 N ⌊log 2 m⌋ and recall that ξ = ln(4 N 2 log 2 (m)/δ) G h,t = |S t ∩ E h | ≥ 1 ∨ i∈E h p t,i < ξ (1) G = T t=1 h∈H G h,t Lemma 1. Event G holds with probability 1 -δ 2 . Proof. Fix any h ∈ H and t ≤ T . Note ¬G h,t = |S t ∩ E h | = 0 ∧ i∈E h p t,i ≥ ξ Examine the random variable |S t ∩ E h | = i∈E h Bernoulli(p t,i ). Set P h,t = i∈E h p t,i . By a Chernoff bound and taking the limit δ → 1, Pr(|S t ∩ E t | < (1 -δ)P h,t ) ≤ e -δ (1 -δ) (1-δ) P h,t Pr(|S t ∩ E h | = 0) ≤ exp(-P h,t ) Therefore, Pr(¬G h,t ) ≤ exp(-ξ) Then, by a union bound, Pr(¬G) = Pr   T t=1 h∈H ¬G h,t   (6) ≤ T t=1 h∈H Pr(¬G h,t ) ≤ T N exp(-ξ) (8) ≤ 2 N log 2 (m)N δ 4 N 2 log 2 (m) (9) ≤ δ 2 (10) Lemma 2. If G holds, then for t ≤ T , i∈Et p t,i < ξ. Proof. By definition of the interpolating model, |S t ∩ E t | = 0, therefore, since G holds, i∈Et p t,i < ξ. Note this implies that if G holds and the algorithm terminates before T , we won't return FAILURE TYPE 1

D.2 BOUND ON NUMBER OF ITERATIONS

Lemma 3. If G holds and N < 2d * , then the algorithm returns successfully within N ≤ 2d * k 0 iterations. Proof. First note that T = 2 N log 2 (m) ≥ N . Note that, under G and if the probabilities are not halved, a hypothesis cannot be returned twice by the learner. This is because when a learner returns h t , for the next iterations, the weight will be at least ξ and under G, by Lemma 2, h t cannot be returned again. Thus, if N ≤ N < 2 d, the learner will return all hypotheses before halving. Lemma 4. If G holds and N ≥ 2d * , Algorithm 1 returns successfully within 2d * k 0 + 1 iterations. Proof. Fix a subset solution S * that has the optimal size d * . Let T = 2d * k 0 + 1. Note that T ≤ 2 d⌊log 2 (m)⌋ ≤ N ⌊log 2 (m)⌋ ≤ T . Suppose we have not returned by iteration T ; then, by assumption ∀t ≤ T : |E t | ≥ 1. Note that E t ∈ E, so |S * ∩ E t | ≥ 1. i∈S * 1[i ∈ E t ] ≥ 1 (11) i∈S * T t=1 1[i ∈ E t ] ≥ T Note that at every iteration, k t,i for i in E t decrease by at least one. By G holding, E t ≤ ξ. Thus, ∆ t ≥ 1. And further, if k t,i = 0, then p t,i = 1, i ∈ S t , and i ̸ ∈ E t . k T +1,i ≤ k 0 - T t=1 1[i ∈ E t ] + T t=1 1[t (mod) 2 d = 0] (13) But since k T +1,i ≥ 0, k 0 - T t=1 1[i ∈ E t ] + T 2 d ≥ 0 (14) d * k 0 - i∈S * T t=1 1[i ∈ E t ] + d * T 2 d ≥ 0 (15) d * k 0 -T + d * T 2 d ≥ 0 (16) d * k 0 ≥ 1 - d * 2 d T (17) d * k 0 ≥ 1 2 T (18) 2d * k 0 ≥ T which is a contradiction. So the algorithm must have returned before T . Thus, in either case ( N < 2 d or N ≥ 2 d), the algorithm terminates before T iterations and within 2d * k 0 + 1 iterations.

D.2.1 SIZE OF RETURNED SUBSET

We now analyze the size of the returned set | Ŝ|. First, note that at every iteration, E[|S t |] = i p t,i . We can think of bounding the size of | Ŝ| as bounding the size of each |S t | which involves bounding the mean ( i p t,i ) and the deviation from the mean. First we bound the deviation. Lemma 5. If m i=1 p t,i ≤ L for all t ≤ T , then, with 1 -δ probability, |S t | ≤ max(e 2 L, ln(2T /δ) for each t ≤ T with probability at least 1 -δ 2 . Proof. Note that |S t | = m i=1 Bernoulli(p t,i ). Define P t = m i=1 p t,i . Note 0 ≤ P t ≤ L. Fix t. By a Chernoff bound, for any κ > 0, Pr[|S t | ≥ (1 + κ)P t ] ≤ e κ (1 + κ) (1+κ) Pt (20) (21) Define c = max(e 2 , ln(2 N T )/L). Let (1 + κ)P t = cL, so κ = cL Pt -1. Pr[|S t | ≥ cL] ≤ e cL-Pt (cL/P t ) (cL) ≤ e cL (c) (cL) (23) = e c cL ≤ e -cL (25) Pr[|S t | ≥ max(e 2 L, ln(2T /δ))] ≤ δ 2T Finally, a union bound over all iterations t ≤ T yields the result. Lemma 6. If event G holds, for t ≤ T , n i=1 p t,i ≤ 8 dξ Proof. We will show something slightly stronger, for t ≤ T , n i=1 p t,i ≤ (2 d + ((t -1) (mod) (2 d)))(2ξ) We prove this by induction. As a base case, for t = 1, n i=1 p 1,i = m2 -k0 (29) ≤ m2 log 2 (m/ d)+1 (30) = 2 d (31) ≤ (2 d + 0)(2ξ) For the inductive case, we examine two cases: t (mod) (2 d) ̸ = 0 and t (mod) (2 d) = 0. In both cases, we rely on Lemma 2 and event G holding. For the first case, n i=1 p t+1,i = n i=1 2 -kt+1,i ≤ n i=1 2 -kt,i+∆t1[i∈Et] (34) ≤ n i=1 2 -kt,i + i∈Et 2 -kt,i+∆t = n i=1 p t,i + 2 ∆t i∈Et p t,i ≤ n i=1 p t,i + 2 ξ i∈Et p t,i i∈Et p t,i = n i=1 p t,i + 2ξ ≤ (2 d + ((t -1) (mod) (2 d))2ξ + 2ξ ≤ (2 d + (((t + 1) -1) (mod) r))2ξ For the second case, n i=1 p t+1,i = n i=1 2 -kt+1,i ≤ n i=1 2 -kt,i+∆t1[i∈Et]-1 (42) ≤ 1 2 n i=1 2 -kt,i + i∈Et 2 -kt,i+∆t ≤ 1 2 n i=1 p t,i + 2ξ (44) = 1 2 (2 d + ((2 d -1) + 1)2ξ (45) = 4 dξ (46) = 2 d + ((t + 1) -1) (mod) (2 d) 2ξ Putting the bound on the mean and deviation of the mean, we achieve a bound on |S t | and thus | Ŝ|. Proposition 4. With probability 1 -δ, | Ŝ| ≤ max(e 2 8 dξ, ln(2T /δ)) ≤ e 2 8 dξ (49) = O d(log( N ) + log log m + log(1/δ))

E LOWER BOUNDS

In this section, we prove three lower bounds. where // signifies integer division and % signifies modulo. Let R(h i ) = i and R(h ∞ ) = kℓ. So h 0 is preferred, and we must rule out all but the last hypothesis to achieve zero error. We assume the teacher knows everything about the structure of the hypotheses except for the randomness in defining ⃗ c (the random "key"). Note that we can cover all non-optimal hypotheses with k points, so d * = k. We now describe the dynamics of making queries. At every iteration t, the teacher can keep track of an uncertainty set for each element of ⃗ c. In particular, let V t,i ∈ P([C]) be an uncertainty set such that we know ⃗ c i ∈ V t,i at the t th iteration. In particular, each uncertainty set begins as [C] . If the i th hypothesis is returned by the learner, the teacher knows V t,i = {⃗ c i }. If we know that we covered h i with a query set X t (by knowing the ranked order of hypotheses), then we know that ⃗ c i ∈ x∈Xt:x1=i//ℓ x 2,i%ℓ . Given the above, we can refine the problem to the following problem with states Z t ∈ [C] kℓ , actions A t ∈ [C] kℓ , and randomness R t ∈ [kℓ] ∪ {∞}. Let Z t,i = |V t,i | be the size of the uncertainty set. Let A t,i =   x∈Xt:x1=i//ℓ x 2,i%ℓ   ∩ V t,i be the number of elements within the certainty set that we cover with X t . Finally, let R t be the hypothesis returned from the learner to the teacher at the t th iteration. Given a state Z t , the teacher chooses X t which can be converted to an A t such that 0 ≤ A t,i ≤ Z t,i . For any query set X t that the teacher queries, we convert this to an A t where |X t | = i1∈[k] max i2∈[ℓ] A t,i1ℓ+i2 . The learner gives the teacher a hypothesis (and thus R t ) that is random with respect to the randomness in ⃗ c such that Pr(R t = i) = i-1 j=1 A t,j Z t,j 1 - A t,i Z t,i Pr(R t = ∞) = kℓ j=1 A t,j Z t,j In other words, h i is returned by the learner if the query X t covers the first i -1 hypotheses, but fails to cover the i th hypothesis. Finally, we return h * = h ∞ if the rest of the hypotheses are covered. Then, on a state Z t , action A t , and learner's hypothesis R t , we can generate the next state as Z t+1,i =    A t,i i < R t 1 i = R t Z t,i i > R t (59) E.2.1 ANALYSIS FOR LOWER BOUND The analysis hinges on examining the potential function: P (Z t ) = kℓ i=1 ln C Z t,i First, note that the potential function is monotone increasing (not strictly) with the number of iterations. Further note that P (Z 1 ) = 0 since Z 1,i = C. Note that at any iteration t and state Z t , the teacher can cover all possible (with respect to the uncertainty sets) hypotheses with exactly Q(Z t ) = i1∈[k] max i2∈[ℓ] Z t,i1ℓ+i2 points. Furthermore, all uncertainty sets cannot be covered with fewer points. For an algorithm to achieve a final subset of size of at most λk, it must be the case that Q(Z t ) ≤ λk. Lemma 7. For an algorithm to achieve a final subset at iteration t of size at most λk, it must be the case that P (Z t ) ≥ kℓ ln C λ Proof. This follows from solving the relaxed optimization problem where Z t,i ∈ R and 1 ≤ Z t,i ≤ C. min P (Z t ) such that Q(Z t ) ≤ λk which has a global solution where all elements of Z t are equal: Z t,i = λ Lemma 8. E[P (Z t+1 ) -P (Z t )] ≤ 1 + ln(C) Proof. E[P (Z t+1 ) -P (Z t )] = E i ln Z t,i Z t+1,i = ∞ r=0 Pr(R t = r) r-1 i=0 ln Z t,i A t,i + ln Z t,i 1 (66) ≤ ln(C) + ∞ r=0 Pr(R t = r) r-1 i=0 ln Z t,i A t,i = ln(C) + kℓ-1 i=0 ∞ r=i+1 Pr(R t = r) ln Z t,i A t,i = ln(C) + kℓ-1 i=0 Pr(R t > i) ln Z t,i A t,i = ln(C) + kℓ-1 i=0   i j=0 A t,j Z t,j   ln Z t,i A t,i Define π ∈ R kℓ where π i = At,i Zt,i so 0 ≤ π i ≤ 1. We can solve the relaxed problem of maximizing over π to upper bound the expression. E[P (Z t+1 ) -P (Z t )] ≤ ln(C) + kℓ-1 i=0   i j=0 π j   ln 1 π i (71) Define f π (σ) = π ln 1 π + σ . Then, kℓ-1 i=0   i j=0 π j   ln 1 π i = π 0 ln 1 π 0 + π 0 π 1 ln 1 π 1 + • • • + π 0 π 1 . . . π kℓ-1 ln 1 π kℓ-1 (72) = f π0 (f π1 (. . . f π kℓ-2 (f π kℓ-1 (0)) . . . )) Note that if σ ≤ 1, then f π (σ) ≤ 1 for any 0 ≤ π ≤ 1. So, by induction, kℓ-1 i=0   i j=0 π j   ln 1 π i ≤ 1 (74) and we get the result. Intuitively, if an algorithm can only increase P (Z t ) (in expectation) by 1 + ln(C) for every iteration and the algorithm isn't done until P (Z t ) ≥ kℓ ln C λ , we have a lower bound. Lemma 9. For any algorithm that returns a subset of size at most λk at random iteration τ , Pr τ ≤ kℓ ln(C/λ) 2(1 + ln(C)) ≤ 1 2 Proof. Noting that P (Z 1 ) = 0, for any T , E[P (Z T +1 )] = T t=1 E[P (Z t+1 -P (Z t )] ≤ T (1 + ln(C)) By Markov's inequality, Pr(τ ≤ T ) ≤ Pr P (Z T +1 ) ≥ kℓ ln C λ (78) ≤ E[P (Z T +1 )] kℓ ln C λ (79) ≤ T (1 + ln(C)) kℓ ln C λ ( ) If we set T = ⌊ kℓ ln(C/λ) 2(1+ln(C)) ⌋, then Pr(τ ≤ T ) ≤ 1 2 . Or equivalently, Pr τ ≥ kℓ ln(C/λ) 2(1 + ln(C)) ≥ 1 2 Finally, we convert this bound into something more usable. Note that d * = k, m = kC ℓ and N = kℓ + 1. Thus, ℓ = ln(m/k) ln(C) . Set C = ⌈eλ⌉. Then, with probability at least 1/2, τ is at least, kℓ ln(C/λ) 2(1 + ln(C)) ≥ k ln(m/k) ln(e) 2(1 + ln(eλ + 1))(ln(eλ + 1)) (82) ≥ k ln(m/k) 2(2 + ln(λ) + 1/e))(1 + ln(λ) + 1/e) ≥ k ln(m/k) 2(3 + ln(λ)) 2 (84) = d * ln(m/d * ) 2(3 + ln(λ)) 2 (85) If λ ≤ λ ′ log(m) log(N ) ≤ λ ′ log(m) log(k ln(m/k)) (for constant λ ′ ), then, the expected number of iterations is Ω(d * log(m/d * )). F ERROR SQUASHING F.1 BACKGROUND In this subsection, we review related theoretical work in more detail as background to our framework and algorithms.

F.1.1 SETCOVER AND INTERACTIVE VARIANTS

Setcover is a classic computer science problem (Karp, 1972) . While there are many similar and equivalent formulations, for this work, we formulate the problem in terms of m binary decision variables and N elements, where each element is represented by the decision variables that would "cover" the element: Z i ⊂ [m] for each i ∈ [N ]. A solution is a subset S ⊂ [m] such that ∀i ∈ [N ] : ∃s ∈ S : s ∈ Z i . In other words, we choose a subset S ⊂ [m] of the decision variables so that all elements Z i intersect S. We wish to find a solution of smallest size |S|. In the standard non-interactive version of setcover, all elements and sets are known, and there exist algorithms that return a solution of size O(log(N )C OPT ) where C OPT is the size of the optimal solution set. In an interactive variant known as online set cover (Alon et al., 2009) , the elements are not known at the beginning, only the number of decision variables m. The algorithm is initialized with S 0 = ∅. Then, for each iteration t = 1, 2, . . . , an adversary chooses an element i t ∈ [N ] that is not intersected by S t (i.e. S t ∩ Z it = ∅) and reveals the decision variables that would cover i t (i.e. Z it ). The algorithm then chooses a set s t such that s t ∈ Z it is permanently added to the solution: S t = S t-1 ∪ {s t }. This process continues until all elements are covered and the adversary has no possible elements to choose. Note that the online set cover problem is harder than the standard set cover problem: if we were given all elements, we can simulate an adversary. Furthermore, the number of iterations t is exactly the size of the final set. Alon et al. (2009) provides an algorithm that returns a solution of size O(log(N ) log(m)C OPT ) where C OPT is the optimal (offline) solution. In this work, we examine another interactive set cover variant that is harder than standard set cover but is a relaxed version of online set cover. We call this exploratory set cover. Like online set cover, the elements are not known at the beginning; the algorithm is only given the number of decision variables m. The algorithm has access to an adversarial oracle which takes a potential solution S t ⊂ [m] and reveals either that S t is a valid solution, or chooses an element i t such that Z it has an empty intersection with S t and reveals Z it . However, unlike the online set cover version, the sets S t are arbitrary; for example, they need not be nested. At any iteration, the algorithm can return a final solution Ŝ that must intersect all Z i . An algorithm is evaluated by two evaluation aspects, the number of exploratory iterations and the size of the final solution | Ŝ|. The proof of Proposition 1 can be found in Appendix F.6. The invariance to consistent additions is more general than ranked minimal error learners: see an example in Appendix F.7.1.

F.3 INDUCED HYPOTHESIS CLASSES

Fix an size m dataset {(x i , y i )} m i=1 ⊂ (X × Y) m . Note that the learner might have a very large hypothesis class, maybe even large enough to fit arbitrary labels for a m-sized dataset (|H| = |Y| m ). However, we can define the induced hypothesis class as H = {L({(x i , y i ) : i ∈ S}) : S ⊂ [m]} which could be significantly smaller, depending on the structure of the model and data. Then, we can define the error sets as E = {{i : h(x i ) ̸ = y i } : h ∈ H} -{∅} (92) where E h ∈ E is the set of datapoints that eliminate a hypothesis h. For notational convenience, define L (err) : P([m]) → P([m]) as the set of errors when trained on a subset S: L (err) (S) = {i : L({(x i ′ , y i ′ ) : i ′ ∈ S})(x i ) ̸ = y i } (94) then, E is simply, E = {L err (S) : S ⊂ [m]} -{∅}

F.4 ERROR INCLUSIVITY

We wish to find a set S such that makes as few errors as if we trained on the entire dataset: |L (err) (S)| = |L (err) ([m])|. However, this condition is a bit general as we might happen to have |L (err) (∅)| = |L (err) ([m] )|, in which case S = ∅ is perhaps unlikely to generalize to test points. Instead, we might wish to find a set S that, when trained on, yields no errors outside of S. Definition 5. For a fixed dataset, we say a set is "error inclusive" if the following holds: L (err) (S) ⊂ S Note that [m] is always error inclusive. Furthermore, if a learner is invariant to consistent additions, L (err) (S) = L (err) ([m]). Proposition 5. Suppose a learner L is invariant to consistent additions. If a set S is error inclusive, L (err) (S) = L (err) ([m]) The proof of Proposition 5 is in Appendix F. 

F.4.1 INTERPOLATING LEARNERS

We say a learner is "interpolating" if it makes no training errors: L (err) (S) ∩ S = ∅. In this case, we can achieve 0 errors on the entire dataset. Furthermore, note that for interpolating learners, error inclusivity of a set S implies that training on S results in zero error on the entire dataset.

F.4.2 RESULTS

Define the extended error sets as E + = {L (err) (S) \ S : S ⊂ [m]} -{∅} Proposition 3. Fix a dataset. If a learner is invariant to consistent additions, S fully intersects E + ⇐⇒ S is error inclusive (99) The proof can be found in Appendix F.6. Define d * = min S fully intersects E + |S| (100) Note that in the case of interpolating learners, E + = E, and d * is the size of the optimal teaching set. Thus, as in the machine teaching case, we have reduced the problem of finding a good subset to finding a minimal setcover, though of an expanded set. Note that we can convert a data subset selection algorithm with interpolating learners (online or exploratory set cover algorithm) into a subset selection algorithm without assuming an interpolating learner, simply by acting as if training errors don't exist. In particular, if we train on a subset S t and receive a hypothesis h t , rather than intersecting {i : h t (x i ) ̸ = y i }, we can intersect {i : h t (x i ) ̸ = y i ∧ i ̸ ∈ S t }. Put another way, when using a realizable algorithm on a non-interpolating learner, we can "squash" the training errors and pretend they don't exist. In fact, this technique is used as a heuristic in Cicalese et al. (2020) to make the algorithm work in practice. Thus, we provide a framework justifying the use of this technique previously used as a heuristic. F.5 COMPARISON TO K-EXTENDED TEACHING DIMENSION Cicalese et al. (2020) introduces the concept of k-extended teaching set for a hypothesis class H and a dataset {(x i , y i )} m i=1 ⊂ (X × Y) m . In particular, let k = min h∈H |{i ∈ [m] : h(x i ) ̸ = y i }| be the minimal number of errors of a hypothesis in H. A set S is a k-extended teaching set, if for any h ∈ H such that |{i ∈ [m] : h(x i ) ̸ = y i }| ≥ k + 1, |{i ∈ S : h(x i ) ̸ = y i }| ≥ k + 1. In other words, any hypothesis that makes k + 1 errors (more than the optimal number) on the entire dataset, makes k + 1 errors on the selected subset S. Here, for ranked minimal error learners, we compare k-extended teaching sets to error inclusive sets (and thus sets that fully intersect E + ). Fix a dataset, hypotheses H, and ranking σ. Let k be the minimal number of errors for a hypothesis h ∈ H for ⃗ D. For a ranked minimal error learner, |L (err) ([m])| = k, so an error inclusive set must make k errors. Therefore any hypothesis h where σ(h) < σ(L({(x i , y i ) : i ∈ [m]})) = σ * must incur at least k + 1 errors on S, and any hypothesis h where σ(h) > σ(L({(x i , y i ) : i ∈ [m]})) = σ * must incur at least k errors on S. A comparison of the two concepts is shown in Table 5 . Note that error inclusivity requires one less error on Type II hypotheses and k more errors on Type III (error optimal) hypotheses. We can construct cases where the smallest error inclusive subset is of size m but the smallest k-extended teaching set is of size 0 (see Appendix F.7.3). In the other direction, there are cases where the smallest error inclusive subset is of size 2 while the smallest k-extended teaching set is of size m -1 (see Appendix F.7.4). Thus, the two notions of optimality are similar but neither is a stronger condition than the other. Table 5 : A comparison between k-extended teaching sets and error inclusive sets. The three columns correspond to three types of hypotheses separated by two attributes: whether a hypothesis is ranked higher or lower than L (err) ([m]), and whether the total number of errors for a hypothesis is optimal, k, or sub-optimal, ≥ k + 1. Note that L (err) ([m]) is the highest ranked hypothesis with optimal errors k, so there are no hypotheses with k errors and rank lower than σ * . By Lemma 11, there are two cases:

Hypothesis type

Case 1: |(L (err) (S ′ ) \ S ′ ) ∩ S| ≥ 1. We are done. We examine a situation with three datapoints and binary labels. The dataset is D = {(x 1 , 0), (x 2 , 0), (x 3 , 0)} and the learner L has two hypotheses h 1 and h 2 where h 1 (x 1 ) = 1 (125) h 1 (x 2 ) = 0 (126) h 1 (x 3 ) = 0 (127) h 2 (x 1 ) = 0 (128) h 2 (x 2 ) = 1 (129) h 2 (x 3 ) = 1 (130) the learner has the following mapping: L(∅) = h 2 (131) L({(x 1 , 0)}) = h 2 (132) L({(x 2 , 0)}) = h 1 (133) L({(x 3 , 0)}) = h 1 (134) L({(x 1 , 0), (x 2 , 0)}) = h 2 (135) L({(x 1 , 0), (x 3 , 0)}) = h 2 (136) L({(x 2 , 0), (x 3 , 0)}) = h 1 (137) L({(x 1 , 0), (x 2 , 0), (x 3 , 0)}) = h 2 (138) Note that h 2 has two errors on D, while h 1 only has one error. However, L(D) = h 2 . Thus, the learner does not return the hypothesis with minimal error. However, a simple examination yields invariance to consistent additions. For example, L({(x 2 , 0)})(x 3 ) = y 3 and L({(x 2 , 0)}) = L({(x 2 , 0), (x 3 , 0)}).

F.7.2 COMPARISON BETWEEN INDUCED HYPOTHESIS CLASS AND EXTENDED ERROR SETS

Here we show that E + can be larger than E. We examine a situation with three datapoints and binary labels. The dataset is D = {(x 1 , 0), (x 2 , 0), (x 3 , 0)} and the learner L has two hypotheses h 1 and h 2 where h 1 (x 1 ) = 1 (139) h 1 (x 2 ) = 0 (140) h 1 (x 3 ) = 0 (141) h 2 (x 1 ) = 0 (142) h 2 (x 2 ) = 1 (143) h 2 (x 3 ) = 1 (144) L(∅) = h 2 (145) L({(x 1 , 0)}) = h 2 (146) L({(x 2 , 0)}) = h 1 (147) L({(x 3 , 0)}) = h 1 (148) L({(x 1 , 0), (x 2 , 0)}) = h 2 (149) L({(x 1 , 0), (x 3 , 0)}) = h 2 (150) L({(x 2 , 0), (x 3 , 0)}) = h 1 (151) L({(x 1 , 0), (x 2 , 0), (x 3 , 0)}) = h 1 (152) Then, E + = {{1}, {2, 3}, {2}, {3}} while E = {{1}, {2, 3}}.

F.7.3 K-EXTENDED TEACHING SET SUPERIORITY

In the following example, H = {h i } i and σ(h i ) = i. y • = (0, 0, 0, . . . , 0, 0) (153) h 1 (x • ) = (1, 0, 0, 0, 0, . . . , 0, 0) (154) h 2 (x • ) = (0, 1, 0, 0, 0, . . . , 0, 0) (155) h 3 (x • ) = (0, 0, 1, 0, 0, . . . , 0, 0) (156) h 4 (x • ) = (0, 0, 0, 1, 0, . . . , 0, 0) (157) . . . (158) h m (x • ) = (0, 0, 0, 0, 0, . . . , 0, 1) In this case, the smallest 1-extended teaching set is ∅ while the smallest error inclusive set is {1, 2, . . . , m}.

F.7.4 ERROR INCLUSIVITY SUPERIORITY

In the following example, H = {h i } i and σ(h i ) = i. y • = (0, 0, 0, 0, 0, . . . , 0, 0, 0) (160) h 1 (x • ) = (1, 0, 0, 0, 0, . . . , 0, 0, 0) (161) h 2 (x • ) = (0, 1, 0, 0, 0, . . . , 0, 0, 1) (162) h 3 (x • ) = (0, 0, 1, 0, 0, . . . , 0, 0, 1) (163) h 4 (x • ) = (0, 0, 0, 1, 0, . . . , 0, 0, 1) (164) . . . (165) h m-1 (x • ) = (0, 0, 0, 0, 0, . . . , 0, 1, 1) In this case, the smallest 1-extended teaching set is {2, 3, . . . , m} while the smallest error inclusive set is {1, m}.



CINIC10 downloaded from https://datashare.ed.ac.uk/handle/10283/3192 https://github.com/kuangliu/pytorch-cifar



Suppose d ≥ d * and N ≥ N := |H|. Then, with probability at least 1 -δ, Algorithm 1 returns successfully within O(d * log(m/d * )) queries to the learner and the size of the returned set is | Ŝ| = O( d(log N + log log m + log(1/δ)).

Figure 2: Plots for CIFAR10 across three model architectures. Error bars are from training with 10 replications on a subset.

Figure 3: Example cross-architecture plots where the subset selection architecture differs from the evaluation architecture. Error bars are from training on a subset with 10 replications.

Let Z denote the set of integers and Z + denote the set of positive integers. Let [k] = {1, 2, . . . , k}. For a set S, let P(S) denote the power set of S. For a set S, let S k be the repeated Cartesian product; for example, S 3 = S × S × S. Let O(•) denote O(•) where extraneous log factors are ignored. For example, if an algorithm has a O(n log n) runtime, it also has a O(n) runtime. Note that we cannot ignore all log terms (otherwise we can drop everything: O(n) = O(log(exp(n))), only log factors where the non-log version of an expression also appears. B EXPERIMENTAL DETAILS B.1 MACHINE TEACHING IMPLEMENTATION DETAILS

Figure 7: Cross-architecture evaluation on ResNet10.

Suppose d ≥ d * and N ≥ N := |H|. Then, with probability at least 1 -δ, Algorithm 1 returns successfully within O(d * log(m/d * )) queries to the learner and the size of the returned set is | Ŝ| = O( d(log N + log log m + log(1/δ)).

note |H| ≤ |Y| |X | ) and a bijective ranking σ : H → [|H|]. A ranked minimal error learner returns the highest ranked hypothesis among the hypotheses in H achieving minimal error on the dataset:L(D) = arg min h:|{(x,y)∈D:h(x)̸ =y}|=min h ′ ∈H |{(x,y)∈D:h ′ (x)̸ =y}| σ(h)(90)Proposition 1. A ranked minimal error learner is invariant to consistent additions.

6. Interpreted further, |L (err) (S)| = |L (err) ([m])| and L (err) ([m]) ⊂ S, meaning that S contains the errors of training on the full dataset.

Total errors ≥ k + 1 ≥ k + 1 = k Errors for k-extended ≥ k + 1 ≥ k + 1 ≥ 0 Errors for error inclusive ≥ k + 1 ≥ k ≥ k F.6 FRAMEWORK PROOFSProposition 1. A ranked minimal error learner is invariant to consistent additions.Proof. Suppose L is a ranked minimal error learner with hypothesis class H and ranking σ. Fix a dataset D ⊂ P(X × Y), an input x ∈ X , and an output y ∈ Y.It suffices to show that L(D)(x) = y implies L(D) = L(D ∪ (x, y)). Define D + = D ∪ {(x, y)}. Define E : P(X × Y) × H → Z + as E(D, h) = |{(x, y) ∈ D : h(x) ̸ = y}|. = {h : E(D, h) = E}(103)M + = {h : E(D + , h) = E + } ∈ M , E(D, L(D)) = E. Because L(D)(x) = y and E(D, L(D)) = E, E(D + , L(D)) = E, so E + ≤ E. Because D ⊂ D + , so E + ≥ E.Therefore, E + = E.Because E(D, h) ≤ E(D + , h) andE + = E, M + ⊂ M . Furthermore, E(D + , L(D)) = E = E + , so L(D) ∈ M + . σ(L(D)) = min h∈M σ(h) ≤ min h∈M + σ(h) ≤ σ(L(D))(109)Thus, L (err) (S) = L (err) (S ∪ S * ) = L (err) (S * ).Proposition 3. If a learner L is invariant to consistent additions, then,S fully intersects E + ⇐⇒ S is error inclusive(117)Proof. →: Suppose S fully intersects E + . Then, for any set S ′ , eitherL (err) (S ′ ) \ S ′ = ∅ or |(L (err) (S ′ ) \ S ′ ) ∩ S| ≥ 1.Set S ′ = S. Then, because (L (err) (S) \ S) ∩ S = ∅, it must be the case that L (err) (S) \ S = ∅ and thus S is error inclusive. ←: Suppose a set S is error inclusive. Pick any set S ′ ⊂ [m]. It suffices to show that either L (err) (S ′ )/S ′ = ∅ or |(L (err) (S ′ )/S ′ ) ∩ S| ≥ 1.

L (err) (S) = L (err) (S ′ ) and (L(err) (S ′ ) \ S ′ ) ∩ S = ∅. (L (err) (S ′ ) \ S ′ ) ∩ S = ∅ (118) (L (err) (S ′ ) ∩ S) ∩ ¬S ′ = ∅ (119) (L (err) (S) ∩ S) ∩ ¬S ′ = ∅ (120) L (err) (S) ∩ ¬S ′ = ∅ (121) L (err) (S ′ ) ∩ ¬S ′ = ∅ (122) L (err) (S ′ ) \ S ′ = ∅ (123) (124)where the third to last line follows from the error inclusivity of S. F.7 EXAMPLES F.7.1 EXAMPLE OF LEARNER THAT IS NOT A MINIMAL ERROR LEARNER, BUT IS INVARIANT TO CONSISTENT ADDITIONS

This table shows the number of queries to the learner for the machine teaching methods with the ResNet10 architecture. Results for the other architectures can be found in Appendix C.

This table shows the value of the parameters that are related to a log N estimate. For example, λ for DHPZ, ξ for ours, and the number of sampling repetitions for CFLM. All parameters are the minimal element from the set {2 k 10 3 : k ∈ Z} such that the algorithm returns successfully.

This table shows the number of queries to the learner for the machine teaching methods with the Myrtle architecture.

This table shows the number of queries to the learner for the machine teaching methods with the VGG architecture.

E.1 PROOF OF QUERY LOWER BOUND FOR LARGE N

Theorem 2. Fix any λ ≥ 1 and m ≥ 2λ. There exists a distribution over interpolating ranked minimal error learners and an m-sized dataset with optimal subset size d * , such that any teaching algorithm requires 2 Ω(m/λ) queries to achieve at most λd * subset size with probability at least 1/2.Proof. Let the dataset include m points that are all labeled 0.Let K ⊂ [m] be randomly chosen such that K ⊂ [m] and |K| = ⌊m/(2λ)⌋ ≥ 1. K is the random "key".We show that the learner can be written as a ranked minimal error learner.Definee. the learner prefers h S with smaller S).Note that this learner is an interpolating learner because L(K) has zero errors. Furthermore,Note that any set of size |S| = m/2 + 1 achieves no errors (error inclusive) but the size is more than d * λ.Furthermore, to achieve d * λ, a necessary condition for any algorithm is to query a subset where |S| ≤ m/2 ∧ K ⊂ S. Otherwise, there will be many errors (≥ m/2) and no information will be gained on K. This is not sufficient, but is necessary.Furthermore, for any subset |S| ≤ m/2, the probability that K ⊂ S is at most 1/2 |K| .Thus, by a union bound, in 2 |K|-1 iterations, there is a 1/2 chance that the algorithm has not met the necessary condition and thus is not done. Thus, the expected number of iterations is at leastE.2 QUERY LOWER BOUND FOR SMALL N Just for this section, define [n] = {0, 1, . . . , n -1}. In other words, we start at zero rather than one.Here we describe a ranked minimal error learner that requires Ω(d * log(m/d * )) iterations (queries) to even achieve even a rough approximation of the optimal subset.Fix any C, ℓ, k ∈ Z + . Let there be a randomly chosen vectorFor simplicity let the ground truth labels all be 0.Let there be kℓ + 1 hypotheses where one hypothesis is a hypothesis that makes no mistakes h * = h ∞ . For the other kℓ hypotheses, index the hypotheses by [kℓ] .Set cover is one of Karp's 21 NP-hard problems (Karp, 1972) . Additionally, approximating set cover with an approximation factor of o(log n) is computationally hard. To be specific, unless NP problems can be solved in n O(log log n) time (a slightly weaker statement than P = NP), there is no algorithm for set cover that always returns a set cover of size (1 -ϵ) ln(N )C OPT , where C OPT is the optimal set cover size (Feige, 1998) . Alon et al. (2009) shows that approximating online set cover requires a larger approximation factor; any algorithm must return solutions of size at least Ω(ln(N ) ln(m)C OPT ) where C OPT is the optimal (offline) solution (Alon et al., 2009) . Note that the exploratory set cover problem is harder than standard set cover but easier than online set cover.

F.1.3 MACHINE TEACHING

Machine teaching (Shinohara & Miyano, 1991; Goldman & Kearns, 1995) is a machine learning task where a teacher provides data points to the learner. In the most classic setting, the learner has a finite hypothesis class H, where each hypothesis is a mapping from X to Y, the teacher knows the learner's hypothesis class and which hypothesis is correct, and provides a teaching set S ⊂ X which uniquely determines the correct hypothesis. In particular, for a hypothesis class H and a correct hypothesis h * , a set S is a teaching set ifIntuitively, if the teacher provides {(x, h * (x)) : x ∈ S} to the learner, then the only remaining consistent hypothesis is h * .Machine teaching in this setting, where the teacher knows the entire hypothesis class of the learner, has a straightforward reduction to set cover. In particular, identify each hypothesis h ∈ H \ {h * } as an element to be covered andas the decision variables (datapoints in subset) that cover h.An important quantity is the size of the smallest teaching set d * , which is a function of both the hypothesis class H and the true hypothesis h * ∈ H. Note that the "teaching dimension" (Goldman & Kearns, 1995) , is the maximum of d * over all h * ∈ H.For the general non-realizable setting, Cicalese et al. (2020) Throughout this work, we make the following assumption about the learner. Definition 4 (Invariant to consistent additions). We say a learner is invariant to consistent additions if ∀D ⊂ P(X × Y), x ⊂ X , y ⊂ Y:A simple example of a learner that meets this condition is a "ranked minimal error learner". To avoid handling infinite hypothesis classes, assume X is finite for this example. Fix a hypothesis class H Because σ is bijective, it has a unique argmax. so L(D + ) = arg min h∈M + σ(h) = L(D).Lemma 10. Suppose a learner L is invariant to consistent additions. Then, for a fixed dataset {(x i , y i ) : i ∈ [m]}, and for any subsets S, S ′ ⊂ [m],Proof. We prove by induction on the size of S ′ .As the base case, if |S ′ | = 0, the statement trivially holds.Suppose the lemma holds for anyChoose s and S ′′ with |S ′′ | = k such that S ′ = S ′′ ∪ {s}. Then, L (err) (S) ∩ S ′ = ∅ implies L (err) (S) ∩ S ′′ = ∅ and thus L (err) (S ∪ S ′′ ) = L (err) (S).Furthermore, L (err) (S) ∩ S ′ = ∅ implies s ̸ ∈ L (err) (S) = L (err) (S ∪ S ′′ ). Therefore,since the learner is invariant to consistent additions,Proposition 5. Suppose a learner L is invariant to consistent additions. If a set S is error inclusive, Suppose the first conclusion is not satisfied: (L (err) (S) \ S) ∩ S * = ∅, then L (err) (S) ∩ (S * \ S) = ∅.Since the learner is invariant to consistent additions, L (err) (S) = L (err) (S ∪(S * \S)) = L (err) (S ∪S * ).

