COVERAGE-CENTRIC CORESET SELECTION FOR HIGH PRUNING RATES

Abstract

One-shot coreset selection aims to select a representative subset of the training data, given a pruning rate, that can later be used to train future models while retaining high accuracy. State-of-the-art coreset selection methods pick the highest importance examples based on an importance metric and are found to perform well at low pruning rates. However, at high pruning rates, they suffer from a catastrophic accuracy drop, performing worse than even random sampling. This paper explores the reasons behind this accuracy drop both theoretically and empirically. We first propose a novel metric to measure the coverage of a dataset on a specific distribution by extending the classical geometric set cover problem to a distribution cover problem. This metric helps explain why coresets selected by SOTA methods at high pruning rates perform poorly compared to random sampling because of worse data coverage. We then propose a novel one-shot coreset selection method, Coverage-centric Coreset Selection (CCS), that jointly considers overall data coverage upon a distribution as well as the importance of each example. We evaluate CCS on five datasets and show that, at high pruning rates (e.g., 90%), it achieves significantly better accuracy than previous SOTA methods (e.g., at least 19.56% higher on CIFAR10) as well as random selection (e.g., 7.04% higher on CIFAR10) and comparable accuracy at low pruning rates. We make our code publicly available at GitHub 1 .

1. INTRODUCTION

One-shot coreset selection aims to select a small subset of the training data that can later be used to train future models while retaining high accuracy (Coleman et al., 2019; Toneva et al., 2018) . One-shot coreset selection is important because full datasets can be massive in many applications and training on them can be computationally expensive. A favored way to select coresets is to assign an importance score to each example and select more important examples to form the coreset (Paul et al., 2021; Sorscher et al., 2022) . Unfortunately, current SOTA methods for one-shot coreset selection suffer a catastrophic accuracy drop under high pruning rates (Guo et al., 2022; Paul et al., 2021) . For example, for the CIFAR-10, a SOTA method (forgetting score (Toneva et al., 2018 )) achieves 95.36% accuracy with a 30% pruning rate, but that accuracy drops to only 34.03% at a 90% pruning rate (which is significantly worse than random coreset selection). This accuracy drop is currently unexplained and limits the extent to which coresets can be practically reduced in size. In this paper, we provide both theoretical and empirical insights into reasons for the catastrophic accuracy drop and propose a novel coreset selection algorithm that overcomes this issue. We first extend the classical geometrical set cover problem to a density-based distribution cover problem and provide theoretical bounds on model loss as a function of properties of a coreset providing specific coverage on a distribution. Furthermore, based on theoretical analysis, we propose a novel metric AUC pr , which allows us to quantify how a dataset covers a specific distribution (Section 3.1). With the proposed metric, we show that coresets selected by SOTA methods at high pruning rates have much worse data coverage than random pruning, suggesting a linkage between poor data coverage of SOTA methods and poor accuracy at high pruning rates. We note that data coverage has also been studied in active learning setting (Ash et al., 2019; Citovsky et al., 2021) , but techniques from active learning do not trivially extend to one-shot coreset selection. We discuss the similarity and differences in Section 5. We then propose a novel algorithm, Coverage-centric Coreset Selection (CCS), that addresses catastrophic accuracy drop by improving data coverage. Different from SOTA methods that prune unimportant (easy) examples first, CCS is inspired by stratified sampling and guarantees the sampling budget across importance scores to achieve better coverage at high pruning rates. (Section 3.3). We find that CCS overcomes catastrophic accuracy drop at high pruning rates, outperforming SOTA methods by a significant margin, based on the evaluation on five datasets (CIFAR10, CI-FAR100 (Krizhevsky et al., 2009) , SVHN (Netzer et al., 2011 ), CINIC10 (Darlow et al., 2018) , and ImageNet (Deng et al., 2009) ). For example, at 90% pruning rate, on the CIFAR10 dataset, CCS achieves 85.7% accuracy versus 34.03% for a SOTA coreset selection method based on forgetting scores (Toneva et al., 2018) . Furthermore, CCS also outperforms random selection at high pruning rates. For example, at a 90% pruning rate, CCS achieves 7.04% and 5.02% better accuracy than random sampling on CIFAR10 and ImageNet, respectively (Section 4). Our method also outperforms a concurrent work called Moderate (Xia et al., 2023) on CIFAR10 by 5.04% and ImageNet by 5.20% at 90% pruning rate. At low pruning rates, CCS still achieves comparable performance to baselines, outperforming random selection. To summarize, our contributions are as follows: 1. We extend the geometric set cover problem to a density-based distribution cover problem and provide a theoretical bound on model loss as a function of properties of a coreset (Section 3.1, Theorem 1). 2. We propose a novel metric AUC pr to quantify data coverage for a coreset (Section 3.1). As far as we know, AUC pr is the first metric to measure how a dataset covers a distribution. 3. Based on this metric, we show that SOTA coreset selections tend to have poor data coverage at high pruning rates -worse than random selection, thus suggesting a linkage between coverage and observed catastrophic accuracy drop (Section 3.2). 4. To improve coverage in coreset selection, we propose a novel one-shot coreset selection method, CCS, that uses a variation of stratified sampling across importance scores to improve coverage (Section 3.3). 5. We evaluate CCS on five different datasets and compare it with six baselines, and we find that CCS significantly outperforms baselines as well as random coreset selection at high pruning rates and has comparable performance at low pruning rates (Section 4). Based on our results, we consider CCS to provide a new strong baseline for future one-shot coreset selection methods, even at higher pruning rates. 

2. PRELIMINARIES

S = {(x i , y i )} N i=1 , where x i is the data, and y i is the ground-truth label. The goal of one-shot coreset selection is to select a training subset S ′ given a pruning rate α before training to maximize the accuracy of models trained on this subset, which can be formulated as the following optimization problem (Sener & Savarese, 2017) : min S ′ ⊂S: |S ′ | |S| ≤1-α E x,y∼P [l(x, y; h S ′ )], where l is the loss function, and h S ′ is the model trained with a labelled dataset S ′ . SOTA methods typically assign an importance score (also called a difficulty score or importance metric) to each example and preferably select more important (difficult) examples to form the coreset. One proposed importance score is the Forgetting score (Toneva et al., 2018) , which is defined as the number of times an example is incorrectly classified after having been correctly classified earlier during model training. Another is area under the margin (AUM) (Pleiss et al., 2020) , which is a data difficulty metric that identifies mislabeled data. Another importance metric, EL2N score (Paul et al., 2021) estimates data difficulty by the L2 norm of error vectors. SOTA algorithms based on these metrics all prune examples with low importance first (Sorscher et al., 2022) . As discussed in Sorscher et al. (2022) , harder examples usually contain more information and are treated as more important data and thus more desirable for inclusion in a coreset. Although SOTA coreset methods (Toneva et al., 2018; Coleman et al., 2019; Paul et al., 2021) report encouraging performance at low pruning rates, a catastrophic accuracy drop is observed as pruning rate increases (Guo et al., 2022; Paul et al., 2021) . For instance, as shown in Figure 1 , SOTA methods based on forgetting number and AUM outperform random sampling when the pruning rate is less than 70% but considerably underperform random sampling at 90% pruning rate, achieving only 34.03% and 28.06% accuracy, respectively, which are much worse than 79.04% for random sampling.

2.2. CATASTROPHIC

We hypothesize that this accuracy drop is a result of bad data coverage caused by the biased pruning of importance-based methods. In this paper, we conduct an in-depth study on this catastrophic accuracy drop and propose a novel coreset selection method that overcomes the catastrophic accuracy drop at high pruning rates.

3. THE COVERAGE OF CORESETS

We start by studying data coverage provided by different coresets. We first extend the classical geometric set cover to a density-based distribution cover problem and propose a novel metric to compare the data coverage of different coresets (Section 3.1). We then show that biased pruning on easy data leads to poor data coverage of coresets with high pruning rates (Section 3.2). In Section 3.3, inspired by stratified sampling, we propose a novel coverage-centric coreset selection method which jointly consider both coverage and importance of each example based on a given importance metrics.

3.1. DENSITY-BASED PARTIAL COVERAGE

To compare the data coverage of different coreset methods, we need to quantitatively measure how well a dataset S covers a distribution P . In classical geometric set cover setting (Fowler et al., 1981; Sener & Savarese, 2017) , we say a set S ′ is a r-cover of another set S, when a set of r-radius balls centered at each element in S ′ covers the entire S. The radius r can be used as a metric to measure coverage of S ′ on S. However, how to measure the coverage of a dataset on a distribution is not well-studied. To measure how well a set covers a distribution, we extend the classical geometric set cover to the density-based distribution cover. Instead of covering a set, we study the covering on a distribution P µ and consider probability density in different areas of the input space. Instead of taking a complete cover, we introduce the cover percentage p to describe how a set covers different percentages of a distribution to better understand the trade-off between cover radius r and cover percentage p: Definition 1 (p-partial r-cover). Given a metric space (X, d), a set S ⊂ X is a p-partial r-cover for a distribution P µ on the space X if: X 1 ∪ x∈S B d (x,r) (x)dµ(x) = p, where B d (x, r) = {x ′ ∈ X : d(x, x ′ ) ≤ r} is a r-radius ball whose center is x, p measures coverage (which we will usually express as a percentage and refer to as percentage coverage), and µ is the probability measurefoot_1 corresponding to probability density function P µ . The probability measure µ represents the density information in the input space. For a model trained on a r-cover coreset, Sener & Savarese (2017) proved a bounded risk on the entire training dataset. In Theorem 1 below, we extend their result to the distribution coverage scenario. The proof of Theorem 1 relies on the same assumptions as in Sener & Savarese (2017) , in particular Lipschitz-continuity loss function and zero training error on the coreset. The zero training error assumption may not always be realistic, but Sener & Savarese (2017) found that the resulting bounds still hold and we rely on the same assumption. The proof of theorem 1 can be found in the Appendix C. Theorem 1. Given n i.i.d. samples drawn from P µ as S = {x i , y i } i∈ [n] where y i ∈ [C] is the class label for example x i , a coreset S ′ which is a p-partial r-cover for P µ on the input space X, and an ϵ > 1 -p, if the loss function l(•, y, w) is λ l -Lipschitz continuous for all y, w and bounded by L, the class-specific regression function η c (x) = p(y = c|x) is λ η -Lipschitz for all c, and l(x, y; h S ′ ) = 0, ∀(x, y) ∈ S ′ , then with probability at least 1 -ϵ: 1 n x,y∈S l(x, y; h S ′ ) ≤ r(λ l + λ η LC) + L log p p+ϵ-1 2n . ( ) Theorem 1 shows that the model trained on a p-partial r-cover coreset S ′ has a bounded risk on the entire training dataset S. There are several implications of Theorem 1: 1) A model trained on a p-partial r-cover coreset has a training risk bounded with the covering radius r and an additional term. This term goes to zero as n increases to infinity. Also, we want a small ϵ, a large p and a small radius r to get a high probability small bound on the loss. 2) The probability that Equation 2 holds is bounded by cover percentage p (since 1 -ϵ < p). This indicates that the bound in Theorem 1 becomes meaningless (low probability) when the cover percentage p is too small. 3) Theorem 1 is a more general form of Theorem 1 in Sener & Savarese (2017); if we set p = 1 (i.e., complete cover), these two theorems are the same. The dataset without any pruning has the lowest curve (blue). Forgetting curve (green) and random curve (orange) have a crossover around p = 80%. Definition 1 shows that, for a given set S and a distribution P µ , the cover percentage p increases with the cover radius r. Conversely, given a cover percentage p and a set S, we can get a minimum cover radius r for the S to achieve p percent coverage on distribution P µ . For a given set S and a distribution P µ , we can plot a p-r curve to represent this relationship. In practice, the distribution P µ is unknown. So, we estimate the cover percentage p for a given subset S of the training dataset and a radius r by measuring the fraction of the test dataset that is coveredfoot_2 . To assess coverage for a radius r, we use the L2 distance between activations of the final convolutional layer of a model trained with the entire dataset as the distance metric d, as suggested in Babenko & Lempitsky (2015) ; Sharif Razavian et al. ( 2014); Hsieh et al. (2018) . Figure 2 shows the p-r curves for several subsets of CI-FAR10 training dataset: (1) entire dataset (no pruning); (2) random coreset at 90% pruning rate (i.e., random 10% of the training data); and (3) coreset selected based on forgetting score as the importance metric, as proposed by Toneva et al. (2018) . As expected, the required radius r increases to get the desired cover percentage p for all coresets. Also coreset with no pruning has the lowest curve, as expected. In general, lower curves are better and we expect models based on the corresponding datasets to have lower loss. However, we find that there is crossover between forgetting curve and random curve around p = 80%, making a comparison less obvious. To make a more obviously quantitative comparison, we propose to use area under the p-r curve, AUC pr , as a proxy metric to assess the quality of a coreset selection strategy. Lower values of AUC pr suggest better coverage by the coreset. We note that AUC pr is the expected minimum distance between examples following the underlying distribution P µ to those in the coreset, as stated in the following proposition (proved in Appendix C): Proposition 1. Given a metric space (X, d), a distribution P µ , cover percentage p, a set S, and f r (p) : [0, 1] → [0, +∞] representing the mapping between p and r, if f r is Riemannintegrable, the AUC of the p-r curve AU C pr (S) = 1 0 f r (p)dp = E x∼Pµ [d(S, x)], where d(S, x) = min x ′ ∈S d(x ′ , x). In practice, we can assess AUC pr with test set D test : AU C pr (S) = E x∈Dtest [min x ′ ∈S d(x ′ , x)]. We also note that this expected minimum distance (and thus AUC) is related to r in Theorem 1; lower values of AUC pr are preferred for lower loss and thus higher model accuracy. Figure 3 : The relationship between AUC pr and accuracy. Different colors stands for different coreset selected methods. A larger circle size indicates a higher pruning ratio on the dataset. Smaller AUC pr often ends up with better accuracy. With high pruning rates (larger balls), forgetting and AUM have a higher AUC pr and worse accuracy.

3.2. COVERAGE ANALYSIS ON CORESETS

Table 1 shows the AUC pr for different CIFAR10 coresets with different pruning rates. With a 30% or 50% pruning rate, forgetting and AUM have similar AUC pr to random sampling. However, with a pruning rate larger than 70%, forgetting and AUM have larger AUC pr than random pruning, suggesting that forgetting and AUM can be expected to have a worse accuracy than random pruning. The reason behind higher AUC pr for forgetting and AUM is that they prune easy examples, which tend to be located in the high-density area of distribution since more common examples often tend to be well-separated easier examples for classifiers. Low pruning rate coresets still contain plenty of easy examples for all methods; thus coverage is not a significant factor at low pruning rates. However, with high pruning rates, coresets can completely exclude examples from the high-density areas, reducing coverage and raising AUC pr , and thus potentially raising loss and reducing accuracy. For the data points in Table 1 , Figure 3 plots model accuracy and AUC pr . Size of the ball at a point represents pruning rate for that point -larger ball size indicates a larger pruning rate. The plot generally confirms our intuitions from Section 3.1 that higher AUC pr values tend to lead to lower model accuracy.

3.3. METHODOLOGY: COVERAGE-CENTRIC CORESET SELECTION

Based on our density-based distribution setting in Section 3.1, if the sampling budget is limited, easy examples in the high-density area provide more coverage than hard examples in the low-density area. CCS first divides the dataset into different nonoverlapping strata based on importance scores. Each stratum has a fixed-length score range, but may include different numbers of examples. We fix an initial budget on the number of examples to be chosen from each strata, based on the desired pruning rate, but, if a stratum has fewer examples than the budget, remaining budget is evenly assigned to other strata. Thus, stratified sampling differs from random sampling in that examples from "dense" stratum are less likely to be sampled than examples in a "sparse" stratum. Also, with a low pruning rate, stratified sampling tends to first discard data from lowimportance strata, similar to SOTA methods. At high pruning rates, stratified sampling diverges from SOTA methods in that low-importance strata still get budget to guarantee data coverage. (In Appendix D.5, we discuss another coreset selection method based on importance sampling.) Before CCS assigns budgets to strata, CCS prunes β percent "hard" examples based on importance scores (Line 2 in Algorithm 1). This is based on two insights: (1) Mislabeled examples often also have higher importance scores (Swayamdipta et al., 2020) but do not benefit accuracy. As the pruning rate increases, the percentage of mislabeled data in the coreset also rises, which hurts model accuracy. (2) When data is really scarce, we need more data from higher-density areas to get better coverage. Pruning hard examples helps allocate more budget to high-density strata. In practice, we use grid search to find β as a hyperparameter. We discuss how β impacts training in Section 4.2 and further details on selecting β in Appendix B. In Figure 4 , we show an example on CIFAR10 to compare the fraction of data selected between a SOTA method (AUM) and CCS (also based on AUM) for 70% pruning rate. The data selected by the SOTA method (Pleiss et al., 2020) , shown with red shading, prunes all the easy examples (high AUM value). In contrast, CCS selects the data in the green region. Notice that CCS selects slightly fewer examples in the curved area (around 140) because those stratum had insufficient examples for the budget; excess budget is evenly assigned to other strata.

4. EVALUATION

We evaluate CCS against six baselines (random, entropy, forgetting, EL2N, AUM, and Moderate) on five datasets (CIFAR10, CIFAR100, SVHN, CINIC10, and ImageNet). In total, we trained over 800 models for performance comparison. We find that CCS outperforms other baselines at high pruning rates while achieving comparable accuracy at low pruning rates. We conduct an ablation study (Section 4.2) to better understand how hyperparameters and different components influence the accuracy. Due to space limits, we include experimental setting in Appendix B. All training is repeated 5 times with different random seeds to calculate mean accuracy with standard deviation. Baselines. We compare CCS with six baselines, with the latter five being SOTA methods: 1) Random: Select examples with uniform random sampling to form a coreset. 

4.1. CORESET PERFORMANCE COMPARISON

As shown in Figure 5 , our proposed method CCS significantly outperforms SOTA schemes as well as the random selection at high pruning rates by comparing CSS with six other baselines on three datasets (CIFAR10, CIFAR100, and ImageNet). We observe that, besides Moderate, other four SOTA coreset methods tend to have similar or higher accuracy than random at 50% or lower pruning rates, but they have worse performance than random at high pruning rates (e.g., 80% and 90%). CCS achieves better accuracy than all other baseline selection methods at high pruning rates and achieves a comparable performance at low pruning rates. For example, at a 90% pruning rate of CIFAR10, CCS achieves 7.04% better accuracy than random sampling. And CCS achieves 5.08% better accuracy than random sampling on ImageNet at a 90% pruning rate. We have similar findings on SVHN and CINIC10 (detailed results in the Appendix D). CCS outperforms all baselines at high pruning rates on all these datasets.

4.2. ABLATION STUDY & ANALYSIS

We first observe that, when k = 1 and β = 0, CCS is equivalent to random sampling, i.e., random sampling is a special case of CCS. This guarantees that CCS, with appropriate choice of hyperpa- (c) Evaluation on ImageNet-1k. Figure 5 : Performance comparison between our proposed method (CCS) and other baselines on CIFAR10, CIFAR100, and ImageNet-1k. The pruning rate is the fraction of examples removed from the original datasets. The evaluation results show that our method achieves better performance than all other baselines at high pruning rates (e.g., 70%, 80%, 90%) and comparable performance at low pruning rates (e.g., 30%, 50%). We also present detailed numerical numbers in Appendix D. Figure 6 : Ablation study on the number of strata (k) and hard cutoff rate (β). We apply CCS with the forgetting score on CIFAR10. Each curve is corresponding to a pruning rate. Different k leads to similar accuracy. The optimal β goes up as the pruning rate increases. rameters, will not underperform random sampling. In practice, larger values of k tend to be more useful to get the stratified sampling effect and to outperform random. Also, as shown in Figure 6 (a), for large k values, we found that the number of strata is not a sensitive hyperparameter. We then study the hard cutoff rate β. From Figure 6 (b), we observe that the optimal hard cutoff rate β goes up as the pruning rate increases. This is expected since a larger hard cutoff rate helps shift more budget to high-density strata (see Section 3.3), When the total data budget is limited, better coverage on high-density strata helps improve model performance. We also observe that the optimal β is 0 when the pruning rates is 30% and 50%. A possible reason is that the percentage of mislabeled data is low in CIFAR10 (Pleiss et al., 2020) , and deep neural network is robust to a small portion of noisy data (Rolnick et al., 2017) . Grid search for β can introduce extra costs in selecting effective coresets. However, we note that our paper shows the possibility of achieving better accuracy than random sampling at a high pruning rate. We leave how to efficiently finetune β for future work. 2 . By comparing stratified sampling only with the vanilla forgetting method (pruning easy first), we observe that applying stratified sampling improves accuracy compared to the vanilla forgetting method, but it is still inferior to random sampling when we prune 80%, or 90% fraction of examples. Sorscher et al. (2022) discussed that pruning hard data is a better strategy than pruning easy data when the original entire data is scarce. In our evaluation, we observe that pruning hard data outperforms pruning easy data at high pruning rates, but still has worse accuracy than CCS. We also considered using a proportional importance sampling strategy to improve coverage that samples examples proportional after mapping importance score to a probability for one dataset. Unfortunately, the strategy underperformed random sampling at high pruning rates (results in Appendix D.5). This evaluation suggested it is non-trivial to beat random at high pruning rates, even for a strategy that improves coverage.

5. RELATED WORK

Besides the coreset methods based on entropy (Coleman et al., 2019) , forgetting score (Toneva et al., 2018) , AUM (Pleiss et al., 2020) , and EL2N scores (Paul et al., 2021) Data selection is also studied in active learning and some methods recognize importance of diversity or coverage in data selection (Ash et al., 2019; Citovsky et al., 2021; Beluch et al., 2018; Hsu & Lin, 2015; Chang et al., 2017) . While our scheme shares the idea of improving coverage with these methods, one-shot coreset selection philosophically differs from dataset selection in active learning: one-shot coreset selection aims to select a model-agnostic coreset for training new models from scratch, while active learning aims to select a subset that, given the current state of the model, is to improve the model in the next training round. For example, BADGE (Ash et al., 2019), a diversitybased selection method in active learning, selects a subset by performing k-means++ on the gradient embedding based on the latest model and then repeats that for subsequent rounds during training. Another significant difference is that in active learning, total number of examples queried across all rounds of training can be much larger than the size of a one-shot coreset. For instance, in experimental results of Ash et al. ( 2019), BADGE queried a different set of 10K (20%) examples on every round for the CIFAR-10 dataset. The total number of different examples queried across all rounds is even larger. In contrast, CCS finds a fixed small subset (e.g. 10% data) that can be used to train new models from scratch while achieving high accuracy. Another popular coreset selection area is the coreset selection for classical clustering problems such as k-means and k-median (Feldman et al., 2020; Feldman & Langberg, 2011; Cohen-Addad et al., 2021; Huang et al., 2019) . However, these coreset methods are usually designed based on the good mathematical property of classical clustering. For example, importance sampling is a common practice for coreset for clustering (Huang & Vishnoi, 2020) , but it does not generalize well in the deep learning classification scenario.

6. CONCLUSION

We conduct an in-depth study to understand the catastrophic accuracy drop issue of one-shot coreset selection. By extending classical geometric set cover to density-based partial cover, we propose a novel data coverage metric and show that existing importance-based coreset selections lead to poor data coverage, which in turn degrades model performance. Based on this observation, we design a novel Coverage-centric Coreset Selection (CCS) method to improve the data coverage, which empirically outperforms other baseline methods on five datasets. Our paper shows that it is possible to achieve much better accuracy than random sampling, even in a high pruning rate setting, and provides a strong baseline for future research.

REPRODUCIBILITY STATEMENT

We also include detailed experimental setting in the appendix, and the code is available at GitHub.

C THEORETICAL ANALYSIS

We include full proofs of Theorem 1 and Proposition 1 in this section. C.1 PROOF OF THEOREM 1 Our proof flow is similar to the proof of Theorem in Sener & Savarese (2017) . Lemma 1. Berlind & Urner (2015) Fix some p, p ′ ∈ [0, 1] and y ′ ∈ {0, 1}. Then, p y∼p (y = y ′ ) ≤ p y∼p ′ (y = y ′ ) + |p -p ′ | Theorem 1. Given n i.i.d. samples drawn from P µ as S = {x i , y i } i∈ [n] where y i ∈ [C] is the class label for example x i , a coreset S ′ which is a p-partial r-cover for P µ on the input space X, and an ϵ > 1 -p, if the loss function l(•, y, w) is λ l -Lipschitz continuous for all y, w and bounded by L, the class-specific regression function η c (x) = p(y = c|x) is λ η -Lipschitz for all c, and l(x, y; h S ′ ) = 0, ∀(x, y) ∈ S ′ , then with probability at least 1 -ϵ: 1 n x,y∈S l(x, y; h S ′ ) ≤ r(λ l + λ η LC) + L log p p+ϵ-1 Proof. We first bound E yi∼η(xi) [l(x i , y i ; h S ′ )], where η(x) is the output of the model, and we represent {y i = k} ∼ η k (x i ) with y i ∼ η k (x i ). We have a condition which states that there exists an x j in δ ball around x i such that x j has 0 loss. Since S ′ is a p-partial r-cover for the distribution, for x i ∼ µ, with the probability p, there is x j ∈ S ′ such that |x i -x j | ≤ r and x j has 0 loss. The following inequality holds with the probability p: E yi∼η(xi) [l(x i , y i ; h S ′ )] = k∈[C] p yi∼η k (xi) (y i = k)l(x i , k; h S ′ ) By applying Lemma 1 and 0 training loss condition on x j , we have: ≤ k∈[C] p yi∼η k (xj ) (y i = k)l(x i , k; h S ′ ) + k∈[C] |η k (x i ) -η k (x j )|l(x i , k; h S ′ ) = k∈[C] p yi∼η k (xj ) (y i = k)[l(x i , k; h S ′ ) -l(x j , k; h S ′ )] + k∈[C] |η k (x i ) -η k (x j )|l(x i , k; h S ′ ). By applying the Lipschitz property of regression function and loss function, we have: ≤ k∈[C] p yi∼η k (xj ) (y i = k)λ l |x i -x j | + k∈[C] l(x i , k; h S ′ )λ η |x i -x j | ≤ λ l |x i -x j | + k∈[C] l(x i , k; h S ′ )λ η |x i -x j | By applying the loss bound and the p-partial r-cover constraint, we have: ≤ λ l |x i -x j | + LCλ η |x i -x j | = (λ l + λ η LC)|x i -x j | ≤ r(λ l + λ η LC). At last, we have the following inequality with probability p: E yi∼η(xi) [l(x i , y i ; h S ′ )] ≤ r(λ l + λ η LC). Also, by the Hoeffding's Bound (Hoeffding, 1994) , with probability at least 1 -γ, we have: 1 n x,y∈S l(x, y; h S ′ ) -E yi∼η(xi) [l(x i , y i ; h S ′ )] ≤ L log(1/γ) 2n . By combining Equation 4and Equation 5, we have that with probability at least 1 -ϵ: 1 n Proof. Since f r is Riemann-integrable, we can calculate 1 0 f r (p)dp, which is AU C pr , with Riemann sum. Specifically, by evenly dividing [0, 1] to 0 = p 0 < p 1 < p 2 < ... < p n = 1, we have: 1 0 f r (p)dp = lim n→∞ n i=1 f r (p i )(p i -p i-1 ). Since p i = P µ (d(S, x) ≤ f r (p i )), we have 1 0 f r (p)dp = lim n→∞ n i=1 f r (p i )P µ (f r (p i-1 ) < d(S, x) ≤ f r (p i )) We know f r is an increasing function and f r (0) = 0. Let ∆ i = f r (p i ) -f r (p i-1 ), we have: = lim n→∞ n i=1 ( i j=1 ∆ j )P µ (f r (p i-1 ) < d(S, x) ≤ f r (p i-1 ) + ∆ i ) As n → ∞, we have ∆ i → 0, so we have: = lim ∆→0 ∞ 0 rP µ (r ≤ d(S, x) < r + ∆)dr, which is the expectation of the minimum distance to data in S. So we conclude that AU C pr (S) is the expectation of the minimum distance to data in S.

D.1 EVALUATION RESULTS ON SVHN AND CINIC10

In this section, we present our comparison experiment on SVHN and CINIC10 in Table 5 and Table 6 , respectively. Since SVHN classification is simpler than other classification tasks, we also study the performance of coreset with a 95% pruning rate. We find similar comparison results that we discussed in Section 4.1. For both datasets, we observe that CCS achieves comparable accuracy with low pruning rates, but higher accuracy than random and other baseline coreset selection methods.

D.2 CORESET COVERAGE ANALYSIS

Figure 7 and Table 7 compare AUC pr between CCS with forgetting score and the vanilla forgetting methods (pruning easy first) on the CIFAR10 dataset. We observe that CCS has a lower AUC pr and better accuracy with the same pruning rate, which suggests that the proposed coverage-centric method does improve the data coverage of the coresets. For the results in Table 10 , we used the popular softmax function. More specifically, since a smaller AUM score indicates a larger importance, we calculated the sampling probability p i for the ith example as follows: p i = e smax -s i smax N n=1 e smax -sn smax , where s i is the AUM score for each example, s max is the maximum possible AUM score, and N is the number of examples. We do not rule out that a better-designed importance sampling technique exists, say, by rethinking the relationship between importance score and probability of selection. CCS uses stratified sampling, which we found to work really well. CCS thus provides a good baseline for any further work in this area.



https://github.com/haizhongzheng/Coverage-centric-coreset-selection By Radon-Nikodym theorem, there is a probability measure µ: Pµ(x)dx = dµ(x). We noticed that some misclassified examples have a much larger minimum cover radius than other data. To eliminate the influence of outliers, when we plot p-r curves, we ignore the data misclassified by the model trained with the entire CIFAR10 dataset (with a 95.25% accuracy). Further evaluations confirm that coresets selected by CCS have a lower AUCpr (i.e., better coverage) than the SOTA method, which prunes easy examples first. See Appendix D.2.



Figure1: Existing coreset solutions have better accuracy than random sampling at low pruning rates, but perform worse at high pruning rates.

Figure 2: The p-r curves of different subsets of CIFAR10 training dataset.The dataset without any pruning has the lowest curve (blue). Forgetting curve (green) and random curve (orange) have a crossover around p = 80%.

Figure4: CIFAR10 data distribution with AUM as the importance metric (lower AUM values are more important). At 70% pruning rate, SOTA method selects data from the red region, since it prune low-importance (easy) examples first. Our method, CCS, selects data from the green region using stratified sampling across the importance metric, along with pruning hardest examples on the left. The coreset selected by the SOTA method lacks "easy" examples on the right in the high-density region.

2) Entropy(Coleman et al., 2019): Select highest entropy examples for coreset, since entropy reflects uncertainty of training examples and thus more importance to training. 3) Forgetting score(Toneva et al., 2018): Select examples with highest Forgetting scores, which is the number of times an example is incorrectly classified after being correctly classified earlier during model training; 4) EL2N(Paul et al., 2021): Select examples with highest EL2N scores, which estimate data difficulty or importance by the L2 norm of error vectors. 5) Area under the margin (AUM)(Pleiss et al., 2020): Select examples with highest AUM scores, which measures the probability gap between the target class and the next largest class across all training epochs. A larger AUM suggests higher difficulty and importance. 6) Moderate(Xia et al., 2023) is a concurrent work that proposes a distance-based score for one-shot coreset selection. Moderate treats examples close to the median value of feature space as more important examples. For all baselines, we prune unimportant examples first when selecting coresets, based on a chosen importance metric. In Section 4.1, we combine CCS with AUM score, and we also discuss the performance of combining CCS with other importance metrics in Appendix D.

Relationship between β and accuracy.

y; h S ′ ) ≤ r(λ l + λ η LC) Given a metric space (X, d), a distribution P µ , cover percentage p, a set S, and f r (p) : [0, 1] → [0, +∞] representing the mapping between p and r, if f r is Riemannintegrable, the AUC of the p-r curve AU C pr (S) = 1 0 f r (p)dp = E x∼Pµ [d(S, x)], where d(S, x) = min x ′ ∈S d(x ′ , x).

Figure7: The relationship between AUC pr and accuracy. A larger circle size indicates a higher pruning ratio on the dataset. Compared to the vanilla forgetting method, CCS with the forgetting score achieves a lower AUC pr and better accuracy.

AUC pr of different methods with different pruning rates. With low pruning rates (30%, 50%), forgetting and AUM have similar AUC pr to random. With high pruning rates (80%, 90%), forgetting and AUM have larger AUC pr than random, which means worse data coverage.

Algorithm 1: Coverage-centric Coreset Selection (CCS)Inputs: S = {(x i , y i , s i )} n i=1 : dataset with the importance score for each example; α: dataset pruning rate; β: hard cutoff rate (β ≤ 1 -α); k: the number of strata. S ′ ← S \ {⌊n * β⌋ hardest examples } ; / * Prune hard examples first * / R 1 , R 2 , ..., R k ← Split scores in S ′ into k ranges with an even range width; B ← {B i , : B i consists of examples whose scores are in R i , i = 1...k}; / * Select the stratum with fewest examples * / m B ← min{|B min |, ⌊ m |B| ⌋} ; / * Compute budget for selected stratum * / S B ← randomly sample m B examples from B min ; However, unlike SOTA methods that simply select important examples first, CCS aims to still guarantee the data coverage on high-density area at high pruning rates to improve coverage and lower the value of AUC pr . 4

Ablation study on different components of coverage-centric coreset select method for importance metric of forgetting score. Best accuracy is in bold in each column.Next, we study the impact of stratified sampling and pruning hard examples on model accuracy. We measure the forgetting score for CIFAR10 and then apply only stratified sampling or only pruning hard examples to select the coreset, respectively, in Table

discussed in Section 2.1,Sorscher et al. (2022) use k-means to estimate importance scores, andSener & Savarese (2017) apply greedy k-center to choose the coreset with good data coverage. However, k-means and kcenter algorithms need to calculate the distance matrix between examples, which leads to an O(n 2 ) time complexity and are expensive to scale to large datasets.

We compare the testing accuracy of CCS against baseline methods and Random on CI-FAR10 dataset at different pruning rates. When we prune out 30% or 50% fraction of data, CCS achieves comparable accuracy. At a 70%, 80%, or 90% pruning rate, CCS outperforms baselines by a large margin. The model trained with the entire dataset has 95.23% accuracy.

Accuracy results on CIFAR100 comparing CCS against other baselines. At low pruning rate, CCS achieves comparable accuracy to baselines. At higher pruning rates, CCS outperforms baselines and random. The model trained with the entire dataset has 78.21% accuracy.Similar toColeman et al. (2019), we also conduct a coreset transferability study. We train a ResNet18 model with the entire CIFAR10 dataset and use the forgetting score to select coresets with different pruning rates. Then we train the ResNet50 model with these coresets. The evaluation results are shown in Table8. We find that coresets show good transferability across these two architectures.

Accuracy results on SVHN comparing CCS against other baseline methods. Pruning rate is the fraction of examples that are removed from the original dataset. We report the testing accuracy of the model trained on the pruned dataset. The model trained with the entire dataset has 96.12% accuracy. When we prune out 30% or 50% fraction of data, CCS achieves comparable accuracy. With a 70%, 90%, or 95% pruning rate, CCS outperforms baseline methods by a large margin. ±0.13 95.61 ±0.25 94.87 ±0.28 56.53 ±1.68 21.05 ±1.85 AUM 95.91 ±0.38 95.76 ±0.10 94.93 ±0.17 52.58 ±5.98 21.61 ±1.68 ±0.41 92.27 ±0.67 88.86 ±1.11

Accuracy results on CINIC10 comparing CCS against other baseline methods. Pruning rate is the fraction of examples that are removed from the original dataset. We report the testing accuracy of the model trained on the pruned dataset. The model trained with the entire dataset has 89.97% accuracy. When we prune out 30% or 50% fraction of data, CCS achieves comparable accuracy. With a 70%, 80%, or 90% pruning rate, CCS outperforms baseline methods by a large margin.

AUC pr comparison between the vanilla forgetting method and CCS. Coresets obtained by CCS show a lower AUC pr , which suggests better data coverage.

We train ResNet50 with coresets of CIFAR10 selected by the forgetting score calculated on training information with ResNet18. The evaluation results shows that the coresets selected by CCS have good transferability across models. Random 94.53 92.87 89.03 88.48 78.94 Forgetting 95.22 94.67 91.95 71.10 35.21 Forgetting (CCS) 95.08 94.99 92.84 89.59 85.65

Accuracy results on CIFAR10 comparing importance sampling against other baseline methods. Pruning rate is the fraction of examples that are removed from the original dataset. We report the testing accuracy of the model trained on the pruned dataset. Importance sampling shows a better performance than AUM at high pruning rates, but it has a worse performance than CCS and random sampling. We highlight the highest accuracy for each pruning rate in bold.

ACKNOWLEDGEMENTS

We thank Yuwei Bao, Tony Zhang, and Yiwen Zhang for helpful discussions. We thank the anonymous reviewers and area chairs for their time and efforts during the reviewing process. Their feedback help us better improve our paper. This work is partially supported by DARPA under agreement number 885000. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of our research sponsors.

A OVERVIEW

This appendix provides details on our experimental setting, theoretical analysis, and additional evaluation results. In Section B, we introduce the detailed setup of our experiment to improve the reproducibility of our paper. In Section C, we provide proof for Theorem 1 and Proposition 1. At last, in Section D, we present evaluation results on SVHN and CINIC10. We also conduct a study on the transferability of the coreset selected by CCS.B DETAILED EXPERIMENTAL SETTING CIFAR10 and CIFAR100 (Krizhevsky et al., 2009) . We use ResNet18 (He et al., 2016) as the network architecture for CIFAR10/CIFAR100. For all coresets with different pruning rates, we train models for 40, 000 iterations with a 256 batch size (about 200 epochs over the entire dataset). We use the SGD optimizer (0.9 momentum and 0.0002 weight decay) with a 0.1 initial learning rate. The learning rate scheduler is the cosine annealing learning rate scheduler (Loshchilov & Hutter, 2017) with a 0.0001 minimum learning rate. We use a 4-pixel padding crop and a randomly horizontal flip as data augmentation.SVHN (Netzer et al., 2011) . We use ResNet18 (He et al., 2016) as the network architecture. For all coresets with different pruning rates, we train models for 30, 000 iterations with a 256 batch size (about 100 epochs over the entire dataset). We use the SGD optimizer (0.9 momentum and 0.0002 weight decay) with a 0.1 initial learning rate. The learning rate scheduler is the cosine annealing learning rate scheduler (Loshchilov & Hutter, 2017 ) with a 0.0001 minimum learning rate.CINIC10 (Darlow et al., 2018) . The CINIC10 dataset contains 270, 000 images in total and is evenly split to three subsets: training, valid, and test. Guided by (Darlow et al., 2018) , we combine the training and valid set to form a large training dataset containing 180, 000 images and measure the test accuracy with the test set (containing 90, 000 examples). We use ResNet18 (He et al., 2016) as the network architecture. For all coresets with different pruning rates, we train models for 70, 000 iterations with a 256 batch size (about 100 epochs over the entire dataset). We use the SGD optimizer (0.9 momentum and 0.0002 weight decay) with a 0.1 initial learning rate. The learning rate scheduler is the cosine annealing learning rate scheduler (Loshchilov & Hutter, 2017 ) with a 0.0001 minimum learning rate. We use a 4-pixel padding crop and a randomly horizontal flip as data augmentation.ImageNet (Deng et al., 2009) . We use ResNet34 (He et al., 2016) as the network architecture. For all coresets with different pruning rates, we train models for 300, 000 iterations with a 256 batch size (about 60 epochs over the entire dataset). We use the SGD optimizer (0.9 momentum and 0.0001 weight decay) with a 0.1 initial learning rate. The learning rate scheduler is the cosine annealing learning rate scheduler (Loshchilov & Hutter, 2017) .Difficulty score calculation. For CIFAR10 and CIFAR100, we train a model with the entire dataset for 200 epochs to calculate the forgetting score and AUM and use the last checkpoint to calculate entropy for each training example. We use the first 10 epoch's training information to calculate the EL2N score. For SVHN and CINIC10, we train a model with the entire dataset for 100 epochs to calculate the forgetting score and AUM and use the last checkpoint to calculate entropy for each training example. For ImageNet, we train the entire dataset for 90 epochs to calculate the forgetting and AUM score. We use the first 10 epoch's training information to calculate the EL2N score.Coverage-centric methods setting. We set the number of strata k = 50 for all datasets and pruning rates. We use the grid search with 0.1 step size to find an optimal hard cutoff rate β for different datasets and pruning rates. For each dataset, we list the optimal β value for every α as follows in the format of tuple (α, β). For CIFAR10, the optimal setting is (30%, 0), (50%, 0), (70%, 10%), (80%, 10%), (90%, 30%). For CIFAR100, the optimal setting is (30%, 10%), (50%, 20%), (70%, 20%), (80%, 40%), (90%, 50%). For SVHN, the optimal setting is (30%, 0), (50%, 0), (70%, 10%), (90%, 10%), (95%, 20%). For CINIC10, the optimal setting is (30%, 0), (50%, 0), (70%, 10%), (80%, 10%), (90%, 20%). For ImageNet, the optimal setting is (30%, 0), (50%, 10%), (70%, 20%), (80%, 20%), (90%, 30%).Each model is trained on a NVIDIA 2080TI GPU. We also include the implementation in the supplementary material.

D.4 ADDITIONAL EVALUATION ON IMAGENET

In this part, we present our comparison experiment on ImageNet, a larger scale dataset compared to the other four datasets evaluated, in Table 9 . Due to the large computational cost for the ImageNet training, we only apply CCS on the AUM importance score, and similar to the prior work Sorscher et al. (2022) , we train each model for one time.From Table 9 , we notice that CCS outperforms all baseline methods at high pruning rates and achieves comparable performance at low pruning rates. We find that EL2N and AUM still experience the catastrophic accuracy drop at high pruning rates. Specifically, compared to random sampling, CCS achieves a 5.02% improvements at a 90% pruning rate. The evaluation results on ImageNet indicate that CCS is also scaled to the large dataset.Forgetting performs very similarly to random at high pruning rates. We think the reason behind this is that the training dynamics of ImageNet are only collected for 90 epochs, which causes an example to be forgotten at most 45 times. Considering the huge number of the ImageNet dataset, many example can share the same forgetting number. Since coreset selection will randomly sample examples when the scores of these examples are the same, forgetting method for ImageNet adds more randomness to the coreset selection. Another potential way to resolve the coverage issue is the importance sampling based method: sample each example in the data set with a probability based to its importance score. Examples with less importance are assigned smaller sampling probability and examples with larger importance are assigned larger sampling probability. We present preliminary experiments on CIFAR10, the AUM as the underlying importance metric, and using a softmax function to map importance scores to probabilities.Unfortunately, experimental results in Table 10 show that the above scheme failed to beat random at high pruning rates. It did beat the SOTA method at high pruning rates. Not beating random is a big deal, since random selection is the simplest possible strategy for constructing a coreset.The above result also suggests that it is not easy to design a coreset selection scheme that beats random selection at high pruning rates. Our proposed scheme, CCS does that on five of the datasets that we evaluated it on, and it is the first coreset selection scheme to achieve that, as far as we are aware.We note that the relationship between data importance score and good-property sampling probability is not obvious. Unlike many statistical machine learning methods like KNN or k-means, which usually have a good math property, data importance scores in deep learning classification are usually heuristically calculated (as we discussed at the beginning of Section 4), which makes it non-trivial to figure out an effective way to calculate probabilities based on heuristic importance scores.

