MODERATE CORESET: A UNIVERSAL METHOD OF DATA SELECTION FOR REAL-WORLD DATA-EFFICIENT DEEP LEARNING

Abstract

Deep learning methods nowadays rely on massive data, resulting in substantial costs of data storage and model training. Data selection is a useful tool to alleviate such costs, where a coreset of massive data is extracted to practically perform on par with full data. Based on carefully-designed score criteria, existing methods first count the score of each data point and then select the data points whose scores lie in a certain range to construct a coreset. These methods work well in their respective preconceived scenarios but are not robust to the change of scenarios, since the optimal range of scores varies as the scenario changes. The issue limits the application of these methods, because realistic scenarios often mismatch preconceived ones, and it is inconvenient or unfeasible to tune the criteria and methods accordingly. In this paper, to address the issue, a concept of the moderate coreset is discussed. Specifically, given any score criterion of data selection, different scenarios prefer data points with scores in different intervals. As the score median is a proxy of the score distribution in statistics, the data points with scores close to the score median can be seen as a proxy of full data and generalize different scenarios, which are used to construct the moderate coreset. As a proof-of-concept, a universal method that inherits the moderate coreset and uses the distance of a data point to its class center as the score criterion, is proposed to meet complex realistic scenarios. Extensive experiments confirm the advance of our method over prior state-of-the-art methods, leading to a strong baseline for future research. The implementation is available at https://github.

1. INTRODUCTION

Large-scale datasets, comprising millions of examples, are becoming the de-facto standard to achieve state-of-the-art deep learning models (Zhao et al., 2021; Zhao & Bilen, 2021; Yang et al., 2022) . Unfortunately, at such scales, both storage costs of the data and computation costs of deep learning model training are tremendous and usually unaffordable by startups or non-profit organizations (Wang et al., 2018a; Coleman et al., 2020; Sorscher et al., 2022; Pooladzandi et al., 2022) , which limits the success of deep learning models to specialized equipment and infrastructure (Yang et al., 2023) . For instance, the storage needs of ImageNet-22k (Deng et al., 2009) and BDD100K datasets (Yu et al., 2020a) are 1TB and 1.8TB respectively. Training PaLM (Chowdhery et al., 2022) once requires a training dataset containing 780 billion high-quality tokens and then takes almost 1,920 TPU years. Additionally, hyper-parameter tuning or network architecture search could further increase the computation costs, which is pessimistic (Strubell et al., 2019; Schwartz et al., 2020) . Data selection came into being to deal with large data and mitigate the above issues for data-efficient deep learning. More specifically, data selection aims to find the most essential data points and build a coreset of large data. Training on the coreset is expected to continue the model performance achieved by training on the large data (Huang et al., 2021b; Chen et al., 2021) . Based on carefully-designed score criteria, recent works have presented various algorithms of data selection, e.g., in terms of loss values (Han et al., 2018; Jiang et al., 2018) , forgetfulness (Toneva et al., 2019; Sorscher et al., 2022) , and gradient matching (Paul et al., 2021; Pooladzandi et al., 2022) . In respect of procedures, these works first sort the scores achieved by all data points and then simply select the data points with either smaller scores or larger scores, according to different scenarios. For instance, for a loss-based score criterion, if data is presumed a priori to be perfectly labeled, larger-loss data points are more important and selected (Lei et al., 2022) . Conversely, if data is corrupted by outliers, smaller-loss data points are more critical because of the concern on model robustness (Lyu & Tsang, 2019) . State-of-the-art methods on data selection can achieve promising performance as they show. However, they are specially designed for preconceived scenarios. The deliberate characteristic makes them work well under certain situations and demands, but not stable or even extremely sensitive to the change of situations or demands, even though the change is slight (the concerns on complex realistic scenarios are detailed in Section 2.2). The issue severely restrict the practical applications of the methods, since realistic scenarios cannot always match preconceived ones well, and realistic demands are frequently changed over time (Hendrycks & Gimpel, 2017; Wu et al., 2021; Arjovsky et al., 2019; Piratla et al., 2020; Creager et al., 2021; Shen et al., 2021; Li et al., 2022b; Wei et al., 2023; Huang et al., 2023) . It is inconvenient, troubled, and often unachievable to tweak our method accordingly (Lu et al., 2018) . In this paper, to address the issue, we discuss a new concept about data selection, i.e., the moderate coreset, which is generic in multiple realistic tasks without any task-specific prior knowledge and adjustments. For the construction of the moderate coreset, given any score criterion of data selection, we characterize the score statistics as a distribution with respect to different scenarios. Namely, different scenarios correspond to and require data points with scores in different ranges. The distribution can be generally depicted by the median of scores (James et al., 2013) . Accordingly, data points with scores that are close to the score median can be seen as a proxy of all data points, which is used to build a moderate coreset and generalize different scenarios. As a proof-of-concept, we present a universal method of data selection in complex realistic scenarios. Specifically, working with extracted representations extracted by deep models, we implement the distance of a data point to its class center as a score criterion. Data points with scores close to the score median are selected as a coreset for following tasks. Comparing the complicated and timeconsuming data selection procedure of many works, e.g., using Hessian calculation (Yang et al., 2023) , our method is simple and does not need to access model architectures and retrain models. We show that existing state-of-the-art methods are not robust to the slight change of presumed scenarios. The proposed method is more superior to them in diverse data selection scenarios. Contributions. Before delving into details, we clearly emphasize our contribution as follows: • Different from prior works targeting preconceived scenarios, we focus on data selection in the real world, where encountered scenarios always mismatch preconceived ones. The concept of the moderate coreset is proposed to generalize different tasks without any task-specific prior knowledge and fine-tuning. • As a proof-of-concept, we propose a universal method operating on deep representations of data points for data selection in various realistic scenarios. The method successfully combines the advantages of simplicity and effectiveness. • Comprehensive experiments for the comparison of our method with state-of-the-arts are provided. Results demonstrate that our method is leading in multiple realistic cases, achieving lower time cost in data selection and better performance on follow-up tasks. This creates a strong baseline of data selection for future research.

2.1. DATA SELECTION RECAP

Data selection vs. data distillation/condensation. Data selection is a powerful tool as discussed. In data-efficient deep learning, there are also other approaches that are widely studied nowadays, such as data distillation (Cazenavette et al., 2022; Bohdal et al., 2020; Wang et al., 2018a; Such et al., 2020; Nguyen et al., 2021; Sucholutsky & Schonlau, 2021) and data condensation (Wang et al., 2022; Zhao et al., 2021; Zhao & Bilen, 2021) . This series of works focus on synthesizing a small but informative dataset as an alternative of the original large-scale dataset. However, the works on data distillation and condensation are criticized for only synthesizing a small number of examples (e.g., 50 images per class) due to computational power limitations (Yang et al., 2023) . In addition, their performance is far from satisfactory. Therefore, the performances of data distillation/condensation and data selection are not directly comparable. Score criteria in data selection. Data selection has recently gained lots of interest. As for the procedure of data selection, generally speaking, existing methods first design a score criterion based on a preconceived scenario (e.g., using prior knowledge or other scenario judgment techniques). Afterwards, all data points of a dataset are sorted with the score criterion. The data points with scores in a certain range are selected for follow-up tasks. For instance, regarding the scenario where data is corrupted with outliers, the data points with smaller losses are favored. Popular score criteria include but do not limit to prediction confidence (Pleiss et al., 2020) , loss values (Jiang et al., 2018; Wei et al., 2020) , margin separation (Har-Peled et al., 2007) , gradient norms (Paul et al., 2021) , and influence function scores (Koh & Liang, 2017; Yang et al., 2023; Pooladzandi et al., 2022) .

2.2. INSTABILITY OF PRIOR WORKS TO CHANGED SCENARIOS

With designed score criteria, prior state-of-the-art methods have achieved promising performance in their respective scenarios. Unfortunately, they are not robust or rather unstable to the change of scenarios, which limits their realistic applications. We detail the issue in three practical respects. Distribution shift scenarios. For methods intended for ideal data, they cannot be applied directly when data contains outliers, because outliers will be identified to contribute more to model learning (Zhang & Sabuncu, 2018) and selected. Conversely, for the methods for the data involving outliers, they will select data points far from the decision boundary for model robustness. However, the uncontaminated data points close to the decision boundary are more important for model performance (Huang et al., 2010) , which are entangled with contaminated ones and cannot be identified exactly. A covariate-shift problem (Sugiyama & Kawanabe, 2012) arises to degenerate model performance, if we only use the data points far from the decision boundary for model training. Changing demands of coreset sizes. Different scenarios have different requirements of coreset sizes. Specifically, if a coreset is allowed to have a relatively large size, the coreset built by the data points that contribute more to model generalization (e.g., those close to the decision boundary), can work well for next tasks. However, if a small size is required, the selection with only these data points will make the convergence of deep models difficult to hurt model performance (Bengio et al., 2009; Zhou & Wu, 2021; Sorscher et al., 2022) . Namely, the data selection way is unstable to the change in coreset sizes. Varied competence on deep models. Lots of state-of-the-art methods rely on accessing the architectures of deep models and the permission of model retraining for data selection (Toneva et al., 2019; Yang et al., 2023; Sorscher et al., 2022; Paul et al., 2021; Feldman & Zhang, 2020) . In some scenarios, requirements for these methods can be met to make them work well. However, if the requirement is not satisfied due to secrecy concerns, these methods will become invalid. Obviously, the issue of instability to changed scenarios restricts the applications of existing methods, since realistic scenarios cannot always match preconceived ones. Moreover, in many cases, it is expensive or not possible to tune methods frequently according to the changes, which demonstrates the urgency of developing new technologies. We define the problem of data selection in data-efficient deep learning. Formally, we are given a large-scale dataset S = {s 1 , . . . , s n } with a sample size n, where s i = (x i , y i ), x i ∈ 4 d , and

3. METHODOLOGY

y i ∈ [k]. The aim of data selection here is to find a subset of S for follow-up tasks, which reduces both storage and training consumption. The subset is called the coreset that is expected to practically perform on par with full data S. We denote the coreset as S * = {ŝ 1 , . . . , ŝm } with a sample size m and S * ⊂ S. The data selection ratio in building the coreset is then m/n.

3.2. PROCEDURE DESCRIPTION

Representation extraction. Given a well-trained deep model denoted by f (⋅) = g(h(⋅)), where h(⋅) denotes the part of the model mapping input data to hidden representations at the penultimate layer, and g(⋅) is the part mapping such hidden representations to the output f (⋅) for classification. Namely, for a data point s = (x, y), its hidden representation is h(x). Therefore, with the trained deep model f (⋅) and full training data S = {s 1 , . . . , s n }, the hidden representations of all data points are acquired as {z 1 = h(x 1 ), . . . , z n = h(x n )}. At the representational level, the class center of each class is {z j = ∑ n i=1 1[y i = j]z i ∑ n i=1 1[y i = j] } k j=1 , ( ) where the mean of the representations from one class is calculated as the mean of every single dimension in the representations. 

Distance

where a = (n -m)/2 relates to the data selection ratio. The set {s a+1 , . . . , sn-a } is regarded as the coreset S * ⊂ S for follow-up tasks. It should be noted that the distance-based score is similar to the confidence-based score. That is to say, for a data point, a closer distance to its class center means that the deep model often gives a higher class posterior probability for this data point. However, the proposed method is easier to apply in the real world. The reason is that representations are more accessible than class posterior probabilities, e.g., directly using the public pretrained models of multiple tasks as representation extractors. Clearly, it is convenient to implement our method without accessing the internal structures of deep models and retraining them. The superiority leads to low time costs of data selection. The lower time costs are consistent with the aim of data-efficient deep learning, which allow our method to be applied to large-scale datasets, e.g., ImageNet (Deng et al., 2009) .

3.3. MORE JUSTIFICATIONS FOR THE PROPOSED METHOD

We provide more justifications to discuss why our method can work well in realistic scenarios. The justifications come from two perspectives that are popularly used to analyze robustness and generalization of deep models. Representation structures. Prior works have claimed that model performance (e.g., accuracy and robustness) is highly related to representation structures (Yu et al., 2020b; Chan et al., 2022) . Good representations should satisfy the following three properties and obtain a trade-off among them. (1) Between-class discriminative: representations of data from different classes are highly uncorrelated. (2) Within-class compressible: representations of data from the same class should be relatively correlated. (3) Maximally diverse representations: variance of representations for each class should be as large as possible as long as they stay uncorrelated from the other classes. Referring to Figure 1 , we can see that the proposed method meets the three properties, where the trade-off is achieved. However, the data selection with the other ways cannot balance these three properties simultaneously. This perspective can provide support for the effectiveness of our method in realistic scenarios. Representation information measurement. Information bottleneck theory (Tishby et al., 2000; Achille & Soatto, 2018) has claimed that optimal representations should be both sufficient and minimal for various practical tasks. As did in (Achille & Soatto, 2018) , we give the following definition. We use a mutual information estimator (Belghazi et al., 2018) to estimate I(z; y), I(x; y), and I(x; z). The technical details of the estimator are provided in Appendix E. Experiments are conducted on CIFAR-100 (Krizhevsky, 2009) . Each estimation is repeated 20 times for reported mean. As shown in Table 1 , compared with the other two data selection ways, our method achieves a trade-off between sufficiency and minimality, which justifies the effectiveness of our method.

4.1. EXPERIMENTAL SETUP

Datasets and network structures. We evaluate the effectiveness of our method on three popularly used datasets, i.e., CIFAR-100 (Krizhevsky, 2009) , Tiny-ImageNet (Le & Yang, 2015) , and ImageNet-1k (Deng et al., 2009) . We first study the effectiveness of data selection methods with a preconfigured network structure. That is to say, the coreset and full data are utilized for the same network structure. ResNet-50 (He et al., 2016) is exploited here. In Appendix D, we provide experiments with multiple network architectures. Baselines. Multiple data selection methods act as baselines for comparison. Specifically, we use (1) Random; (2) Herding (Welling, 2009) ; (3) Forgetting (Toneva et al., 2019) ; (4) GraNd-score (Paul et al., 2021) ; (5) EL2N-score (Paul et al., 2021) ; (6) Optimization-based (Yang et al., 2023) ; (7) Self-sup.-selection (Sorscher et al., 2022) . Due to the limited page, we provide the technical details of these baselines in Appendix A. Notice that the methods Forgetting, GraNd-score, and EL2N-score rely on model retraining for data selection. Besides, the method Optimization-based has heavy computational costs, due to the calculation of Hessian in the influence function 1 and its iterative data selection process. The method Self-sup.-selection is troubled by the same issue, where both self-supervised pre-training and cluster- ing are time-consuming (Na et al., 2010; Jaiswal et al., 2020) . By contrast, Herding and our method only need a distance calculation and a sort operation when there is a representation extractor, leading to low time consumption in data selection. We name the proposed method as Moderate-DS. Implementation details. All experiments are conducted on NVIDIA GTX3090 GPUs with Py-Torch (Paszke et al., 2019) . For experiments on CIFAR-100, we adopt a batch size of 128, an SGD optimizer with a momentum of 0.9, weight decay of 5e-4, and an initial learning rate of 0. For experiments on ImageNet-1k, following (Sorscher et al., 2022) , the VISSL library (Goyal et al., 2021) is exploited. We adopt a base learning rate of 0.01, a batch size of 256, an SGD optimizer with a momentum of 0.9, and a weight decay of 1e-3. 100 epochs are set totally. Note that because of huge calculation consumptions, the experiment in each case is performed once. All hyperparameters and experimental settings of training before and after data selection are kept the same.

4.2. COMPARISON WITH THE STATE-OF-THE-ARTS

We use test accuracy achieved by training on coresets to verify the effectiveness of the proposed method. For experiments on CIFAR-100 and Tiny-ImageNet, as shown in Figure 2 , the proposed method is much competitive with state-of-the-art methods. Particularly, when the data selection ratio is low, e.g., 20%, 30%, and 40%, our method always achieves the best performance. Besides, for experiments on more challenging ImageNet-1k, we can see that our method obtains the best results when data selection ratios are 60%, 70%, and 90%. Also, the achieved result in the 80% case is very close to the best one. Therefore, combining the above discussions on data selection time consumption, we can claim that our method can achieve promising performance on follow-up tasks with lower time costs in data selection.

4.3. MORE ANALYSES

Unseen network structure generalization. Prior works (Yang et al., 2023) show that, although we perform data selection with a predefined network structure, the selected data, i.e., the coreset, can generalize to those unknown network architectures that are inaccessible during data selection. As did in this line, we train ResNet-50 on CIFAR-100 and Tiny-ImageNet for data selection and further use the selected data to train different network architectures, i.e., SENet (Hu et al., 2018) and EfficientNet-B0 (Tan & Le, 2019) . The experimental results in Table 2 show that the selected data by our method has a good generalization on unseen network architectures. The superiority indicates that the proposed method can be employed in a wide range of applications regardless of specific network architectures. Ablation study. We compare the proposed method with the other three data selection ways that share the same distance-based score criterion. Specifically, we perform data selection with (1) data 3) data points at two ends, i.e., both data points close to class centers and far from class centers (named as Two-ends). Experiments are conducted on CIFAR-100 and Tiny-ImageNet. Note that data points far from class centers make model training difficult, hence degenerating model performance, especially when the coreset is limited to a small size. The results are shown in Figure 3 , which demonstrate that our method is more promising than other data selection ways. Boosting baselines with moderate coresets. As discussed before, we propose the concept of moderate coresets, which can be applied to other score-based data selection methods. To support the claim, we apply moderate coresets to boost baselines GraNd-score and EL2N-score. In other words, based on the score criteria of GraNd-score and EL2N-score, we select the data points with scores close to the score median. We name the boosted methods as GraNd-score+ and EL2N-score+ respectively. Experiments are conducted on CIFAR-100 and Tiny-ImageNet. We provide the evidence in Figure 4 , which demonstrates the use of moderate coresets can bring clear performance improvement.

5.1. ROBUSTNESS TO CORRUPTED IMAGES

In realistic scenes, training data is always polluted by corrupted images (Wang et al., 2018b; Hendrycks & Dietterich, 2019; Li et al., 2021; Xia et al., 2021b) . Here, we demonstrate that our method is more robust than several state-of-the-art methods, when corrupted images occur. Specifi- cally, to corrupt images, we employ the following five types of realistic noises, i.e., Gaussion noise, random occlusion, resolution, fog, and motion blur, as shown in Figure 6 of Appendix C. The noises are then injected into the fractional images of CIFAR-100 and Tiny-ImageNet. The other settings are kept unchanged. The corruption rate is set to 5%, 10%, and 20% respectively. Experimental results are presented in Figure 5 . As can be seen, for CIFAR-100 and Tiny-ImageNet with corrupted images, with the variability of selection ratios, our method outperforms baseline methods in most cases. More specifically, for the method that prefers easier data to reduce the side-effect of corrupted images, e.g., Herding, although corrupted images can be less selected, a covariate-shift problem (Sugiyama & Kawanabe, 2012) arises to degenerate model performance if we only use these data. In addition, for the method that prefers harder data, e.g., Forgetting, the corruptions are incorrectly selected, leading to unsatisfactory performance. Compared with baselines, our method selects moderate data to avoid corruption meanwhile ensure generalization, which hence achieves better performance.

5.2. ROBUSTNESS TO LABEL NOISE

Real-world datasets inevitably involve label noise, where partial clean labels are flipped to incorrect ones, resulting in mislabeled data (Northcutt et al., 2021; Xia et al., 2019; 2020) . Label-noise robustness is of significance for the method's application (Song et al., 2022; Wei & Liu, 2021; Zhu et al., 2022) . To discuss the robustness of different data selection methods against label noise, we conduct synthetic experiments with CIFAR-100 and Tiny-ImageNet. Symmetric label noise (Patrini et al., 2017; Xia et al., 2021a; Li et al., 2022a) is injected to generate mislabeled data. The noise rate is set to 20%. Although mislabeled data makes class centers somewhat biased, Herding and our method can empirically works well. Experimental results are provided in Table 3 . The results demonstrate that, compared with prior state-of-the-arts, our method is very effective in the situation of data selection under label noise.

5.3. DEFENSE AGAINST ADVERSARIAL ATTACKS

It has been shown that deep networks are vulnerable to adversarial examples that are crafted by adding imperceptible but adversarial noise on natural examples (Szegedy et al., 2014; Ma et al., 2018b; Eykholt et al., 2018; Huang et al., 2021a; Zhou et al., 2021) . In the real world, adversarial robustness is greatly important for the application of a method (Tramèr et al., 2018; Dong et al., 2022; Zhou et al., 2022) . We utilize two popularly used adversarial attacks, i.e., PGD attacks (Madry et al., 2017) and GS attacks (Goodfellow et al., 2014) . We set default perturbation budget ϵ = 8/255 for both CIFAR-100 4 . As can be seen, the proposed method is more robust than several state-of-the-art methods, leading to the first average rank. The results confirm that our method is more competitive to be applied in practice.

6. CONCLUSION

In this paper, we focus on data selection to boost data-efficient deep learning. Different from existing works that are usually limited to preconceived scenarios, we propose a concept of the moderate coreset to generalize different scenarios. As a proof-of-concept, a universal method operating with data representations is presented for data selection in various circumstances. Extensive experiments confirm the effectiveness of the proposed method. For future work, we are interested in adapting our method to other domains such as natural language processing. Furthermore, we are also interested in applying the concept of the moderate coreset concept to multiple advanced data selection methods theoretically and empirically.

A TECHNICAL DETAILS OF BASELINE METHODS

Here we introduce the technical details of baselines in the following. • "Random" means that we randomly select partial data from full data. • "Herding" (Welling, 2009) selects data points that are closer to class centers. • "Forgetting" (Toneva et al., 2019) selects data points that are easy to be forgotten during optimization. • "GraNd-score" (Paul et al., 2021) includes data points with larger loss gradient norms. • "EL2N-score" (Paul et al., 2021) focuses on data points with larger norms of the error vector that is the predicted class probabilities minus one-hot label encoding. • "Optimization-based" (Yang et al., 2023) employs the influence function (Koh & Liang, 2017) and picks data points that yields strictly constrained generalization gap. • "Self-sup.-selection" (Sorscher et al., 2022) selects data points on the difficulty of each data point by the distance to its nearest cluster centroid, after self-supervised pre-training and clustering. To avoid the tuning of the cluster number, we set it to the class number. Data points with larger distances are selected here.

B EXACT NUMERICAL EXPERIMENTAL RESULTS

In the main paper, we have provided the illustrations of comparing the proposed method (Moderate-DS) with several state-of-the-arts. Here, exact numerical experimental results are presented in Tables 5, 6, and 7 for checks and references.

C SUPPLEMENTARY EXPERIMENTAL SETTINGS

In the main paper, we employ five types of realistic noises, i.e., Gaussion noise, random occlusion, resolution, fog, and motion blur to corrupt images. The noise is shown in Figure 6 . To verify the effectiveness of our method with different convolutional neural networks (CNNs), we employ VGG-16 (Simonyan & Zisserman, 2014) and ShuffleNet V2 (Ma et al., 2018a) . Experiments are conducted on Tiny-ImageNet. Results are provided in Table 8 . As can be seen, with the changes of different network architectures, our Moderate-DS still works very well.

D.2 EXPERIMENTS WITH TRANSFORMER

To further demonstrate the superiority of our Moderate-DS, we employ Transformer. The implementation is based on the public Github repository 2 , where ViT small is used. We conduct experiments on simulated CIFAR-100. Following the main paper, we consider: (1) setting, where the selection ratio is 20%. Empirical results in Table 9 verify the effectiveness of our method with Transformer.

D.3 SUPPLEMENTARY EXPERIMENTS WITH MISLABELED DATA

In the main paper, we have demonstrated the effectiveness of our method, when the noise rate of 20%. Here, we improve the noise rate to 35% to demonstrate the superiority of our method Moderate-DS. We provide results in Table 10 . Besides, we before set the selection ratio to 20% and 30% when mislabeled data occur. We furthermore increase the selection ratio to show the advantage of our method. Experiments are conducted on Tiny-ImageNet, where results are provided in Table 11 and Figure 7 . As we can see, although the baseline Forgetting achieves competitive performance in some cases, our method works well in most cases. When the selection ratio is small (20%, 30%, and 40%), the data-selection task is more challenging. Our method can achieve the best performance. Besides, the average rank achieved by our method is the best, which confirms the superiority of Moderate-DS.

D.4 SUPPLEMENTARY EXPERIMENTS WITH CORRUPTED IMAGES

We increase the ratio of corrupted images to 30% to show the effectiveness of the proposed method. Experimental results are provided in 

E TECHNICAL DETAILS OF THE MUTUAL INFORMATION ESTIMATOR

In the main paper, we employ the mutual information estimator (MINE) (Belghazi et al., 2018) to justify our claims. Here, we discuss the technical details of the estimator. Given two random variables X and Z, and n i.i.d. examples drawn from a joint distribution 2 XZ , MINE is defined as where T θ is parameterized by a deep neural network with parameters θ, and 2n is the empirical distribution associated to n i.i.d. examples. After the definition, we follow (Belghazi et al., 2018) to give the procedure of MINE for the estimation of mutual information (see Algorithm 1). Î (X; Z) n = sup θ∈Θ -2 (n) XZ [T θ ] -log(-2 (n) XZ ⊗ 2(n) X [e T θ ]),

Algorithm 1 MINE θ ← initialize network parameters repeat

Draw b minibatch examples from the joint distribution: (x (1) , z (1) ), . . b i=1 e T θ (x (i) , z(i) ) ) Evaluate bias corrected gradients (e.g., moving average): Ĝ(θ) ← ∇θ V(θ) Update the statistics network parameters: θ ← θ + Ĝ(θ) until convergence



We utilize the PyTorch implementation in https://github.com/nimarb/pytorch_influence_functions. https://github.com/kentaroy47/vision-transformers-cifar10



In the sequel, vectors, matrices, and tuples are denoted by bold-faced letters. We use ∥ ⋅ ∥ p as the ℓ p norm of vectors or matrices. Let [n] = {1, . . . , n}. Let 1[B] be the indicator of the event B.

Figure 1: Illustrations of representation structures achieved by different data selection ways. Different shapes, i.e., the circle and cross, correspond different classes. Pentagrams correspond class centers. The shaded data points are selected for coresets. (a) Selecting data points closer to class centers. (b) Selecting data points far from class centers. (c) Selecting data points with distances close to the distance median.

Evaluations on ImageNet-1k.

Figure 2: Illustrations of comparing our proposed method with several data selection baselines on CIFAR-100 (a), Tiny-ImageNet (b), and ImageNet-1k (c). Note that the method Optimization-based is not compared on ImageNet-1k due to its huge time costs of data selection. Exact numerical results can be found in in Appendix B for convenient checks.

Figure 3: Illustrations of results of ablation study on different data selection ways. (a) Evaluations on CIFAR-100. (b) Evaluations on Tiny-ImageNet.

Figure5: Illustrations of comparing our proposed method with several data selection baselines on synthetic CIFAR-100 and Tiny-ImageNet, where corruption is injected. Exact numerical results (mean±standard deviation) can be found in in Appendix B for convenient checks.

Figure 6: Examples of noise injected to CIFAR-100 images.

. , (x(b) , z(b) ) ∼ 2 XZ Draw b examples from the Z marginal distribution: z(1) , . . . , z(b) ∼ 2 Z Evaluate the lower-bound: V(θ) ← 1 b ∑ b i=1 T θ (x (i) , z (i) ) -log( 1 b ∑

-based score for data selection. With hidden representations {z 1 , . . . , z n } and representational class centers {z 1 , . . . , z k }, the Euclidean distance from each representation to the corresponding class center can be simply calculated with d(s) = ∥z-z j ∥ 2 . The set consisting of the distances is {d(s 1 ), . . . , d(s n )} with the median as M(d(s)). All data points are then sorted with an ascending order based on the distance set, which are denoted by {d(s 1 ), . . . , d(s n )}. Afterwards, the data points with distances close to the distance median M(d(s)) are selected as the coreset S * , i.e.,



Mean and standard deviation of results (%) on CIFAR-100 and Tiny-ImageNet with transferred models. The average rank is calculated with the ranks of a method in 20% and 30% cases of "ResNet-50→SENet" and "ResNet-50→EfficientNet-B0". The best result in each case is in bold.

Mean and standard deviation of experimental results (%) on CIFAR-100 and Tiny-ImageNet with mislabeled data. The average rank is calculated with the ranks of a method in 20% and 30% cases in two synthetic datasets. The best result in each case is in bold.

Mean and standard deviation of experimental results (%) on CIFAR-100 and Tiny-ImageNet with adversarial examples. The average rank is calculated with the ranks of a method in 20% and 30% cases of two types of attacks. The best result in each case is in bold. and Tiny-ImageNet. We use the adversarial attacks and the models trained on CIFAR-100 and Tiny-ImageNet to achieve adversarial examples. Afterwards, different methods are applied on adversarial examples and model retraining on selected data. Experimental results are provided in Table

Mean and standard deviation of experimental results (%) on different versions of CIFAR-100. The average rank is calculated with the ranks of a method in 20%, 30%, 40%, 60%, and 80% cases. The best result in each case is in bold.

Under this experimental setting, Herding is strong. Compared to it, Moderate is more powerful, which achieves the best average rank.

Mean and standard deviation of experimental results (%) on Tiny-ImageNet. The average rank is calculated with the ranks of a method in 20%, 30%, 40%, 60%, and 80% cases. The best result in each case is in bold.

Top-5 test accuracy (%) on ImageNet-1k. The average rank is calculated with the ranks of a method in 60%, 70%, 80%, and 90% cases. The best result in each case is in bold.

Mean and standard deviation of experimental results (%) on Tiny-ImageNet. VGG-16 (V) and ShuffleNet (S) are exploited. The best result in each case is in bold.

Mean and standard deviation of experimental results (%) on simulated CIFAR-100 with transformer ViT-small. The selection ratio is 20% consistently. The best result in each case is in bold.

Mean and standard deviation of experimental results (%) on CIFAR-100 (C) and Tiny-ImageNet (T) with 35% mislabeled data. The best result in each case is in bold. Illustrations of experimental results on Tiny-ImageNet with 20% mislabeled data.

Mean and standard deviation of experimental results (%) on Tiny-ImageNet with 20% mislabeled data. ResNet-50 and ShuffleNet V2 are exploited. The average rank is calculated with the ranks of a method in 20%, 30%, 40%, 60%, and 80% cases. The best result in each case is in bold.

Mean and standard deviation of experimental results (%) Tiny-ImageNet with simultaneous corrupted, mislabeled, and adversarial examples. The best result in each case is in bold.

acknowledgement

ACKNOWLEDGEMENTS Xiaobo Xia was supported by Australian Research Council Project DE-190101473 and Google PhD Fellowship. Jun Yu is sponsored by Natural Science Foundation of China (62276242), CAAI-Huawei MindSpore Open Fund (CAAIXSJLJJ-2021-016B, CAAIXSJLJJ-2022-001A), Anhui Province Key Research and Development Program (202104a05020007), USTC-IAT Application Sci. & Tech. Achievement Cultivation Program (JL06521001Y). Xu Shen was (partially) supported by the National Key R&D Program of China under Grant 2020AAA0103902. Bo Han was supported by NSFC Young Scientists Fund No. 62006202, Guangdong Basic and Applied Basic Research Foundation No. 2022A1515011652 and RGC Early Career Scheme No. 22200720. Tongliang Liu was partially supported by Australian Research Council Projects IC-190100031, LP-220100527, DP-220102121, and FT-220100318. The authors would give special thanks to Wenhao Yang from Nanjing University for helpful discussions.

