COVERAGE-CENTRIC CORESET SELECTION FOR HIGH PRUNING RATES

Abstract

One-shot coreset selection aims to select a representative subset of the training data, given a pruning rate, that can later be used to train future models while retaining high accuracy. State-of-the-art coreset selection methods pick the highest importance examples based on an importance metric and are found to perform well at low pruning rates. However, at high pruning rates, they suffer from a catastrophic accuracy drop, performing worse than even random sampling. This paper explores the reasons behind this accuracy drop both theoretically and empirically. We first propose a novel metric to measure the coverage of a dataset on a specific distribution by extending the classical geometric set cover problem to a distribution cover problem. This metric helps explain why coresets selected by SOTA methods at high pruning rates perform poorly compared to random sampling because of worse data coverage. We then propose a novel one-shot coreset selection method, Coverage-centric Coreset Selection (CCS), that jointly considers overall data coverage upon a distribution as well as the importance of each example. We evaluate CCS on five datasets and show that, at high pruning rates (e.g., 90%), it achieves significantly better accuracy than previous SOTA methods (e.g., at least 19.56% higher on CIFAR10) as well as random selection (e.g., 7.04% higher on CIFAR10) and comparable accuracy at low pruning rates. We make our code publicly available at GitHub 1 .

1. INTRODUCTION

One-shot coreset selection aims to select a small subset of the training data that can later be used to train future models while retaining high accuracy (Coleman et al., 2019; Toneva et al., 2018) . One-shot coreset selection is important because full datasets can be massive in many applications and training on them can be computationally expensive. A favored way to select coresets is to assign an importance score to each example and select more important examples to form the coreset (Paul et al., 2021; Sorscher et al., 2022) . Unfortunately, current SOTA methods for one-shot coreset selection suffer a catastrophic accuracy drop under high pruning rates (Guo et al., 2022; Paul et al., 2021) . For example, for the CIFAR-10, a SOTA method (forgetting score (Toneva et al., 2018 )) achieves 95.36% accuracy with a 30% pruning rate, but that accuracy drops to only 34.03% at a 90% pruning rate (which is significantly worse than random coreset selection). This accuracy drop is currently unexplained and limits the extent to which coresets can be practically reduced in size. In this paper, we provide both theoretical and empirical insights into reasons for the catastrophic accuracy drop and propose a novel coreset selection algorithm that overcomes this issue. We first extend the classical geometrical set cover problem to a density-based distribution cover problem and provide theoretical bounds on model loss as a function of properties of a coreset providing specific coverage on a distribution. Furthermore, based on theoretical analysis, we propose a novel metric AUC pr , which allows us to quantify how a dataset covers a specific distribution (Section 3.1). With the proposed metric, we show that coresets selected by SOTA methods at high pruning rates have much worse data coverage than random pruning, suggesting a linkage between poor data coverage of SOTA methods and poor accuracy at high pruning rates. We note that data coverage has also been studied in active learning setting (Ash et al., 2019; Citovsky et al., 2021) , but techniques from active learning do not trivially extend to one-shot coreset selection. We discuss the similarity and differences in Section 5. We then propose a novel algorithm, Coverage-centric Coreset Selection (CCS), that addresses catastrophic accuracy drop by improving data coverage. Different from SOTA methods that prune unimportant (easy) examples first, CCS is inspired by stratified sampling and guarantees the sampling budget across importance scores to achieve better coverage at high pruning rates. (Section 3.3). We find that CCS overcomes catastrophic accuracy drop at high pruning rates, outperforming SOTA methods by a significant margin, based on the evaluation on five datasets (CIFAR10, CI-FAR100 (Krizhevsky et al., 2009) , SVHN (Netzer et al., 2011 ), CINIC10 (Darlow et al., 2018 ), and ImageNet (Deng et al., 2009) ). For example, at 90% pruning rate, on the CIFAR10 dataset, CCS achieves 85.7% accuracy versus 34.03% for a SOTA coreset selection method based on forgetting scores (Toneva et al., 2018) . Furthermore, CCS also outperforms random selection at high pruning rates. For example, at a 90% pruning rate, CCS achieves 7.04% and 5.02% better accuracy than random sampling on CIFAR10 and ImageNet, respectively (Section 4). Our method also outperforms a concurrent work called Moderate (Xia et al., 2023) on CIFAR10 by 5.04% and ImageNet by 5.20% at 90% pruning rate. At low pruning rates, CCS still achieves comparable performance to baselines, outperforming random selection. To summarize, our contributions are as follows: 1. We extend the geometric set cover problem to a density-based distribution cover problem and provide a theoretical bound on model loss as a function of properties of a coreset (Section 3.1, Theorem 1). 2. We propose a novel metric AUC pr to quantify data coverage for a coreset (Section 3.1). As far as we know, AUC pr is the first metric to measure how a dataset covers a distribution. 3. Based on this metric, we show that SOTA coreset selections tend to have poor data coverage at high pruning rates -worse than random selection, thus suggesting a linkage between coverage and observed catastrophic accuracy drop (Section 3.2). 4. To improve coverage in coreset selection, we propose a novel one-shot coreset selection method, CCS, that uses a variation of stratified sampling across importance scores to improve coverage (Section 3.3). 5. We evaluate CCS on five different datasets and compare it with six baselines, and we find that CCS significantly outperforms baselines as well as random coreset selection at high pruning rates and has comparable performance at low pruning rates (Section 4). Based on our results, we consider CCS to provide a new strong baseline for future one-shot coreset selection methods, even at higher pruning rates. (1)

2. PRELIMINARIES

where l is the loss function, and h S ′ is the model trained with a labelled dataset S ′ . SOTA methods typically assign an importance score (also called a difficulty score or importance metric) to each example and preferably select more important (difficult) examples to form the coreset. One proposed importance score is the Forgetting score (Toneva et al., 2018) , which is defined as



https://github.com/haizhongzheng/Coverage-centric-coreset-selection



ONE-SHOT CORESET SELECTION Consider a classification task with a training dataset containing N examples drawn i.i.d. from an underlying distribution P . We denote the training dataset as S = {(x i , y i )} N i=1 , where x i is the data, and y i is the ground-truth label. The goal of one-shot coreset selection is to select a training subset S ′ given a pruning rate α before training to maximize the accuracy of models trained on this subset, which can be formulated as the following optimization problem (Sener & Savarese, 2017): min S ′ ⊂S: |S ′ | |S| ≤1-α E x,y∼P [l(x, y; h S ′ )],

