MODERATE CORESET: A UNIVERSAL METHOD OF DATA SELECTION FOR REAL-WORLD DATA-EFFICIENT DEEP LEARNING

Abstract

Deep learning methods nowadays rely on massive data, resulting in substantial costs of data storage and model training. Data selection is a useful tool to alleviate such costs, where a coreset of massive data is extracted to practically perform on par with full data. Based on carefully-designed score criteria, existing methods first count the score of each data point and then select the data points whose scores lie in a certain range to construct a coreset. These methods work well in their respective preconceived scenarios but are not robust to the change of scenarios, since the optimal range of scores varies as the scenario changes. The issue limits the application of these methods, because realistic scenarios often mismatch preconceived ones, and it is inconvenient or unfeasible to tune the criteria and methods accordingly. In this paper, to address the issue, a concept of the moderate coreset is discussed. Specifically, given any score criterion of data selection, different scenarios prefer data points with scores in different intervals. As the score median is a proxy of the score distribution in statistics, the data points with scores close to the score median can be seen as a proxy of full data and generalize different scenarios, which are used to construct the moderate coreset. As a proof-of-concept, a universal method that inherits the moderate coreset and uses the distance of a data point to its class center as the score criterion, is proposed to meet complex realistic scenarios. Extensive experiments confirm the advance of our method over prior state-of-the-art methods, leading to a strong baseline for future research. The implementation is available at https://github.com/tmllab/Moderate-DS.

1. INTRODUCTION

Large-scale datasets, comprising millions of examples, are becoming the de-facto standard to achieve state-of-the-art deep learning models (Zhao et al., 2021; Zhao & Bilen, 2021; Yang et al., 2022) . Unfortunately, at such scales, both storage costs of the data and computation costs of deep learning model training are tremendous and usually unaffordable by startups or non-profit organizations (Wang et al., 2018a; Coleman et al., 2020; Sorscher et al., 2022; Pooladzandi et al., 2022) , which limits the success of deep learning models to specialized equipment and infrastructure (Yang et al., 2023) . For instance, the storage needs of ImageNet-22k (Deng et al., 2009 ) and BDD100K datasets (Yu et al., 2020a ) are 1TB and 1.8TB respectively. Training PaLM (Chowdhery et al., 2022) once requires a training dataset containing 780 billion high-quality tokens and then takes almost 1,920 TPU years. Additionally, hyper-parameter tuning or network architecture search could further increase the computation costs, which is pessimistic (Strubell et al., 2019; Schwartz et al., 2020) . Data selection came into being to deal with large data and mitigate the above issues for data-efficient deep learning. More specifically, data selection aims to find the most essential data points and build a coreset of large data. Training on the coreset is expected to continue the model performance achieved by training on the large data (Huang et al., 2021b; Chen et al., 2021) . Based on carefully-designed score criteria, recent works have presented various algorithms of data selection, e.g., in terms of loss values (Han et al., 2018; Jiang et al., 2018) , forgetfulness (Toneva et al., 2019; Sorscher et al., 2022) , and gradient matching (Paul et al., 2021; Pooladzandi et al., 2022) . In respect of procedures, these works first sort the scores achieved by all data points and then simply select the data points with either smaller scores or larger scores, according to different scenarios. For instance, for a loss-based score criterion, if data is presumed a priori to be perfectly labeled, larger-loss data points are more important and selected (Lei et al., 2022) . Conversely, if data is corrupted by outliers, smaller-loss data points are more critical because of the concern on model robustness (Lyu & Tsang, 2019) . State-of-the-art methods on data selection can achieve promising performance as they show. However, they are specially designed for preconceived scenarios. The deliberate characteristic makes them work well under certain situations and demands, but not stable or even extremely sensitive to the change of situations or demands, even though the change is slight (the concerns on complex realistic scenarios are detailed in Section 2.2). The issue severely restrict the practical applications of the methods, since realistic scenarios cannot always match preconceived ones well, and realistic demands are frequently changed over time (Hendrycks & Gimpel, 2017; Wu et al., 2021; Arjovsky et al., 2019; Piratla et al., 2020; Creager et al., 2021; Shen et al., 2021; Li et al., 2022b; Wei et al., 2023; Huang et al., 2023) . It is inconvenient, troubled, and often unachievable to tweak our method accordingly (Lu et al., 2018) . In this paper, to address the issue, we discuss a new concept about data selection, i.e., the moderate coreset, which is generic in multiple realistic tasks without any task-specific prior knowledge and adjustments. For the construction of the moderate coreset, given any score criterion of data selection, we characterize the score statistics as a distribution with respect to different scenarios. Namely, different scenarios correspond to and require data points with scores in different ranges. The distribution can be generally depicted by the median of scores (James et al., 2013) . Accordingly, data points with scores that are close to the score median can be seen as a proxy of all data points, which is used to build a moderate coreset and generalize different scenarios. As a proof-of-concept, we present a universal method of data selection in complex realistic scenarios. Specifically, working with extracted representations extracted by deep models, we implement the distance of a data point to its class center as a score criterion. Data points with scores close to the score median are selected as a coreset for following tasks. Comparing the complicated and timeconsuming data selection procedure of many works, e.g., using Hessian calculation (Yang et al., 2023) , our method is simple and does not need to access model architectures and retrain models. We show that existing state-of-the-art methods are not robust to the slight change of presumed scenarios. The proposed method is more superior to them in diverse data selection scenarios. Contributions. Before delving into details, we clearly emphasize our contribution as follows: • Different from prior works targeting preconceived scenarios, we focus on data selection in the real world, where encountered scenarios always mismatch preconceived ones. The concept of the moderate coreset is proposed to generalize different tasks without any task-specific prior knowledge and fine-tuning. • As a proof-of-concept, we propose a universal method operating on deep representations of data points for data selection in various realistic scenarios. The method successfully combines the advantages of simplicity and effectiveness. • Comprehensive experiments for the comparison of our method with state-of-the-arts are provided. Results demonstrate that our method is leading in multiple realistic cases, achieving lower time cost in data selection and better performance on follow-up tasks. This creates a strong baseline of data selection for future research.

2. BACKGROUND

2.1 DATA SELECTION RECAP Data selection vs. data distillation/condensation. Data selection is a powerful tool as discussed. In data-efficient deep learning, there are also other approaches that are widely studied nowadays, such as data distillation (Cazenavette et al., 2022; Bohdal et al., 2020; Wang et al., 2018a; Such et al., 2020; Nguyen et al., 2021; Sucholutsky & Schonlau, 2021) and data condensation (Wang et al.,



† Corresponding author (tongliang.liu@sydney.edu.au).

