MODERATE CORESET: A UNIVERSAL METHOD OF DATA SELECTION FOR REAL-WORLD DATA-EFFICIENT DEEP LEARNING

Abstract

Deep learning methods nowadays rely on massive data, resulting in substantial costs of data storage and model training. Data selection is a useful tool to alleviate such costs, where a coreset of massive data is extracted to practically perform on par with full data. Based on carefully-designed score criteria, existing methods first count the score of each data point and then select the data points whose scores lie in a certain range to construct a coreset. These methods work well in their respective preconceived scenarios but are not robust to the change of scenarios, since the optimal range of scores varies as the scenario changes. The issue limits the application of these methods, because realistic scenarios often mismatch preconceived ones, and it is inconvenient or unfeasible to tune the criteria and methods accordingly. In this paper, to address the issue, a concept of the moderate coreset is discussed. Specifically, given any score criterion of data selection, different scenarios prefer data points with scores in different intervals. As the score median is a proxy of the score distribution in statistics, the data points with scores close to the score median can be seen as a proxy of full data and generalize different scenarios, which are used to construct the moderate coreset. As a proof-of-concept, a universal method that inherits the moderate coreset and uses the distance of a data point to its class center as the score criterion, is proposed to meet complex realistic scenarios. Extensive experiments confirm the advance of our method over prior state-of-the-art methods, leading to a strong baseline for future research. The implementation is available at https://github.com/tmllab/Moderate-DS.

1. INTRODUCTION

Large-scale datasets, comprising millions of examples, are becoming the de-facto standard to achieve state-of-the-art deep learning models (Zhao et al., 2021; Zhao & Bilen, 2021; Yang et al., 2022) . Unfortunately, at such scales, both storage costs of the data and computation costs of deep learning model training are tremendous and usually unaffordable by startups or non-profit organizations (Wang et al., 2018a; Coleman et al., 2020; Sorscher et al., 2022; Pooladzandi et al., 2022) , which limits the success of deep learning models to specialized equipment and infrastructure (Yang et al., 2023) . For instance, the storage needs of ImageNet-22k (Deng et al., 2009 ) and BDD100K datasets (Yu et al., 2020a ) are 1TB and 1.8TB respectively. Training PaLM (Chowdhery et al., 2022) once requires a training dataset containing 780 billion high-quality tokens and then takes almost 1,920 TPU years. Additionally, hyper-parameter tuning or network architecture search could further increase the computation costs, which is pessimistic (Strubell et al., 2019; Schwartz et al., 2020) . Data selection came into being to deal with large data and mitigate the above issues for data-efficient deep learning. More specifically, data selection aims to find the most essential data points and build a coreset of large data. Training on the coreset is expected to continue the model performance achieved by training on the large data (Huang et al., 2021b; Chen et al., 2021) . Based on carefully-designed



† Corresponding author (tongliang.liu@sydney.edu.au).

