DATASET PRUNING: REDUCING TRAINING DATA BY EXAMINING GENERALIZATION INFLUENCE

Abstract

The great success of deep learning heavily relies on increasingly larger training data, which comes at a price of huge computational and infrastructural costs. This poses crucial questions that, do all training data contribute to model's performance? How much does each individual training sample or a sub-training-set affect the model's generalization, and how to construct the smallest subset from the entire training data as a proxy training set without significantly sacrificing the model's performance? To answer these, we propose dataset pruning, an optimization-based sample selection method that can (1) examine the influence of removing a particular set of training samples on model's generalization ability with theoretical guarantee, and (2) construct the smallest subset of training data that yields strictly constrained generalization gap. The empirically observed generalization gap of dataset pruning is substantially consistent with our theoretical expectations. Furthermore, the proposed method prunes 40% training examples on the CIFAR-10 dataset, halves the convergence time with only 1.3% test accuracy decrease, which is superior to previous score-based sample selection methods.

1. INTRODUCTION

The great advances in deep learning over the past decades have been powered by ever-bigger models crunching ever-bigger amounts of data. However, this success comes at a price of huge computational and infrastructural costs for network training, network inference, hyper-parameter tuning, and model architecture search. While lots of research efforts seek to reduce the network inference cost by pruning redundant parameters Blalock et al. (2020) ; Liu et al. (2018) ; Molchanov et al. (2019) , scant attention has been paid to the data redundant problem, which is crucial for network training and parameter tuning efficiency. Investigating the data redundant problem not only helps to improve the training efficiency, but also helps us understand the representation ability of small data and how many training samples is required and sufficient for a learning system. Previous literatures on subset selection try to sort and select a fraction of training data according to a scalar score computed based on some criterions, such as the distance to class center Welling (2009) ; Rebuffi et al. (2017) ; Castro et al. (2018) ; Belouadah & Popescu (2020) , the distance to other selected examples Wolf (2011) ; Sener & Savarese (2018b) , the forgetting score Toneva et al. (2019a) , and the gradient norm Paul et al. (2021) . However, these methods are (a) heuristic and lack of theoretically guaranteed generalization for the selected data, also (b) discard the joint effect, i.e., the norm of averaged gradient vector of two high-gradient-norm samples could be zero if the direction of these two samples' gradient is opposite. To go beyond these limitations, a natural ask is, how much does a combination of particular training examples contribute to the model's generalization ability? However, simply evaluating the test performance drop caused by removing each possible subset is not acceptable. Because it requires to re-train the model for 2 n times given a dataset with size n. Therefore, the key challenges of dataset pruning are: (1) how to efficiently estimate the generalization influence of all possible subsets without iteratively re-training the model, and (2) how to identify the smallest subset of the original training data with strictly constrained generalization gap. In this work, we present an optimization-based dataset pruning method that can identify the largest redundant subset from the entire training dataset with (a) theoretically guaranteed generalization gap and (b) consideration of the joint influence of all collected data. Specifically, we define the parameter influence of a training example as the model's parameter change caused by omitting the example from training. The parameter influence can be linearly approximated without re-training the model by Influence Function Koh & Liang (2017a) . Then, we formulate the subset selection as a constrained discrete optimization problem with an objective of maximizing the number of collected samples, and a constraint of penalizing the network parameter change caused by removing the collected subset within a given threshold ϵ. We then conduct extensive theoretical and empirical analysis to show that the generalization gap of the proposed dataset pruning can be upper bounded by the pre-defined threshold ϵ. Superior to previous score-based sample selection methods, our proposed method prunes 40% training examples in CIFAR-10 Krizhevsky (2009) , halves the convergence time, and achieves only 1.3% test accuracy drop. Before delving into details, we summarize our contributions as below: • This paper proposes a dataset pruning problem, which extends the sample selection problem from cardinality constraint to generalization constraint. Specifically, previous sample selection methods find a data subset with fixed size budget and try to improve their performance, while dataset pruning tries to identify the smallest subset that satisfies the expected generalization ability. • This paper proposes to leverage the influence function to approximate the network parameter change caused by omitting each individual training example. Then an optimization-based method is proposed to identify the largest subset that satisfies the expected parameter change. • We prove that the generalization gap caused by removing the identified subset can be up-bounded by the network parameter change that was strictly constrained during the dataset pruning procedure. The observed empirical result is substantially consistent with our theoretical expectation. • The experimental results on dataset pruning and neural architecture search (NAS) demonstrate that the proposed dataset pruning method is extremely effective on improving the network training and architecture search efficiency. The rest of this paper is organized as follows. In Section. 2, we briefly review existing sample selection and dataset condensation research, where we will discuss the major differences between our proposed dataset pruning and previous methods. In Section. 3, we present the formal definition of the dataset pruning problem. In Section. 4 and Section. 5, our optimization-based dataset pruning method and its generalization bound are introduced. In Section. 6, we conduct extensive experiments to verify the validity of the theoretical result and the effectiveness of dataset pruning. Finally, in Section. 7, we conclude our paper. Jin et al. (2021; 2022) . This series of works focus on synthesizing a small but informative dataset as an alternative to the original large dataset. For example, Dataset Distillation Wang et al. (2018) try to learn the synthetic dataset by directly minimizing the classification loss on the real training data of the neural networks trained on the synthetic data. Cazenavette et al. (2022) proposed to learn the synthetic images by matching the training trajectories. Zhao et al. (2021) ; Zhao & Bilen (2021a; b) proposed to learn the synthetic images by matching the gradients and features. However, due to the computational power limitation, these methods usually only synthesize an extremely small number of examples (e.g. 50 images per class) and the performance is far from satisfactory. Therefore, the performances of dataset distillation and dataset pruning are not directly comparable. 

2. RELATED WORKS

E z∼P (D) L(z, θ) ≃ E z∼P (D) L(z, θ-D ) where P (D) is the data distribution, L is the loss function, and θ and θ-D are the empirical risk minimizers on the training set D before and after pruning D, respectively, i.e., θ = arg min θ∈Θ 1 n zi∈D L(z i , θ) and θ-D = arg min θ∈Θ 1 n-m zi∈D\ D L(z i , θ). Considering the neural network is a locally smooth function Rifai et al. (2012) ; Goodfellow et al. (2014) ; Zhao et al. (2021) , similar weights (θ ≈ θ) imply similar mappings in a local neighborhood and thus generalization performance. Therefore, we can achieve Eq. 1 by obtaining a θ-D that is very close to θ (the distance between θ-D and θ is smaller than a given very small value ϵ). To this end, we first define the dataset pruning problem on the perspective of model parameter change, we will later provide the theoretical evidence that the generalization gap in Eq. 1 can be upper bounded by the parameter change in Section. 5. Definition 1 (ϵ-redundant subset.). Given a dataset D = {z 1 , . . . , z n } containing n training points where Alternatively, the researches of Influence Function Cook (1977) ; Cook & Weisberg (1980) ; Cook (1986) ; Cook & Weisberg (1982) ; Koh & Liang (2017a) provide us an accurate and fast estimation of the parameter change caused by weighting an example z for training. Considering a training example z was weighted by a small δ during training, the empirical risk minimizer can be written as θδ,z = arg min θ∈Θ z i = (x i , y i ) ∈ X × Y, considering D is a subset of D where D = {ẑ 1 , . . . , ẑm }, D ⊂ D. We say D is an ϵ-redundant subset of D if θ-D -θ 2 ≤ ϵ, 1 n zi∈D L (z i , θ) + δL (z, θ). Assigning -1 n to δ is equivalent to removing the training example z. Then, the influence of weighting z on the parameters is given by I param (z) = d θδ,z dδ δ=0 = -H -1 θ ∇ θ L(z, θ) where H θ = 1 n zi∈D ∇ 2 θ L(z i , θ) is the Hessian and positive definite by assumption, I param (z) ∈ R N , N is the number of network parameters. The proof of Eq.equation 2 can be found in Koh & Liang (2017a) . Then, we can linearly approximate the parameter change due to removing z without retraining the model by computing θ-zθ ≈ - 1 n I param (z) = 1 n H -1 θ ∇ θ L(z, θ). Similarly, we can approximate the parameter change caused by removing a subset D by accumulating the parameter change of removing each example, θ- D -θ ≈ zi∈ D -1 n I param (z i ) = zi∈ D 1 n H -1 θ ∇ θ L(z i , θ ). The accumulated individual influence can accurately reflect the influence of removing a group of data, as proved in Koh et al. (2019) .

4.2. DATASET PRUNING AS DISCRETE OPTIMIZATION

Combining definition 1 and the parameter influence function (Eq. equation 2), it is easy to derive that if the parameter influence of a subset D satisfies zi∈ D - (a) generalization-guaranteed pruning: 1 n I param (z i ) 2 ≤ ϵ, then it is an ϵ-redundant subset of D. Denote S = {-1 n I param (z 1 ), • • • , -1 n I param (z n )}, maximize W n i=1 W i subject to W T S 2 ≤ ϵ W ∈ {0, 1} n (3) (b) cardinality-guaranteed pruning: minimize W W T S 2 subject to n i=1 W i = m W ∈ {0, 1} n (4) where W is a discrete variable that is needed to be optimized. For each dimension i, W i = 1 indicates the training sample z i is selected into Dmax ϵ , while W i = 0 indicates the training sample z i is not pruned. After solving W in Eq .3, the largest ϵ-redundant subset can be constructed as Dmax ϵ = {z i |∀z i ∈ D, W i = 1}. For some scenarios when we need to specify the removed subset  S i = -1 n I param (z i ) = 1 n H -1 θ ∇ θ L(z i ,

5. GENERALIZATION ANALYSIS

In this section, we theoretically formulate the generalization guarantee of the minimizer given by the pruned dataset. Our theoretical analysis suggests that the proposed dataset pruning method have a relatively tight upper bound on the expected test loss, given a small enough ϵ. For simplicity, we first assume that we prune only one training sample z (namely, δ is onedimensional). Following the classical results, we may write the influence function of the test loss as I loss (z test , z) = dL z test , θz,δ dδ δ=0 = ∇ θ L(z test , θ) ⊤ d θδ,z dδ δ=0 = ∇ θ L(z test , θ) ⊤ I param (z) which indicates the first-order derivative of the test loss L z test , θz,δ with respect to δ. We may easily generalize the theoretical analysis to the case that prunes m training samples, if we let δ be a m-dimensional vector (-1 n , . . . , -1 n ) and ẑ ∈ D. Then we may write the multi-dimensional form of the influence function of the test loss as  I loss (z test , D) = ∇ θ L(z test , θ) ⊤ d θδ sup |L( θ-D ) -L( θ)| = O( ϵ n + m n 2 ). Proof. We express the test loss at θ-D using the first-order Taylor approximation as L z test , f ( θ-D ) = L z test , f ( θ) + I loss (z test , D)δ + O(∥δ∥ 2 ), where the last term O(∥δ∥ 2 ) is usually ignorable in practice, because ∥δ∥ 2 = m n 2 is very small for popular benchmark datasets. The same approximation is also used in related papers which used influence function for generalization analysis Singh et al. (2021) . According to Equation equation 8, we have L( θ-D ) -L( θ) ≈E ztest∼P (D) [I loss (z test , D)]δ (9) = ẑi∈ D - 1 n E ztest∼P (D) [I loss (z test , ẑi )] (10) = - 1 n E ztest∼P (D) [-∇ θ L(z test , θ) ⊤ ]   ẑi∈ D H -1 θ ∇ θ L(ẑ i , θ)   (11) ≤ 1 n ∥∇ θ L( θ)∥ 2 ẑi∈ D I param (ẑ i ) 2 (12) ≤ ϵ n ∥∇ θ L( θ)∥ 2 , where the first inequality is the Cauchy-Schwarz inequality, the second inequality is based on the algorithmic guarantee that ẑi∈ D I param (ẑ i )

2

≤ ϵ in Eq. equation 3, and ϵ is a hyperparameter. Given the second-order Taylor term, finally, we obtain the upper bound of the expected loss as sup |L( θ-D ) -L( θ)| = O( ϵ n + m n 2 ). ( ) The proof is complete. Theorem 1 demonstrates that, if we hope to effectively decrease the upper bound of the generalization gap due to dataset pruning, we should focus on constraining or even directly minimizing ϵ. The basic idea of the proposed optimization-based dataset pruning method exactly aims at penalizing ϵ via discrete optimization, shown in Eq. 3 and Eq. 4. Moreover, our empirical results in the following section successfully verify the estimated generalization gap. The estimated generalization gap of the proposed optimization-based dataset pruning approximately has the order of magnitude as O(10 -3 ) on CIFAR-10 and CIFAR-100, which is significantly smaller the estimated generalization gap of random dataset pruning by more than one order of magnitude.

6. EXPERIMENT

In the following paragraph, we conduct experiments to verify the validity of the theoretical results and the effectiveness of the proposed dataset pruning method. In Section. 6.1, we introduce all experimental details. In Section. 6.2, we empirically verify the validity of the Theorem. 1. In Section. 6.3 and Section. 6.4, we compare our method with several baseline methods on dataset pruning performance and cross-architecture generalization. Finally, in Section. 6.5, we show the proposed dataset pruning method is extremely effective on improving the training efficiency. Figure 1 : We compare our proposed optimization-based dataset pruning method with several sampleselection baselines. Our optimization-based pruning method considers the 'group effect' of pruned examples and exhibits superior performance, especially when the pruning ratio is high.

6.1. EXPERIMENT SETUP AND IMPLEMENTATION DETAILS

We evaluate dataset pruning methods on CIFAR10, CIFAR100 Krizhevsky (2009) , and TinyIma-geNet Le & Yang (2015) 2016) (25.56M parameters). All hyper-parameters and experimental settings of training before and after dataset pruning were controlled to be the same. Specifically, in all experiments, we train the model for 200 epochs with a batch size of 128, a learning rate of 0.01 with cosine annealing learning rate decay strategy, SGD optimizer with the momentum of 0.9 and weight decay of 5e-4, data augmentation of random crop and random horizontal flip. In Eq. 2, we calculate the Hessian matrix inverse by using seconder-order optimization trick Agarwal et al. (2016) which can significantly boost the Hessian matrix inverse estimation time. To further improve the estimation efficiency, we only calculate parameter influence in Eq. 2 for the last linear layer. In Eq. 3 and Eq. 4, we solve the discrete optimization problem using simulated annealing Van Laarhoven & Aarts (1987) , which is a popular heuristic algorithm for solving complex discrete optimization problem in a given time budget. Our proposed optimization-based dataset pruning method tries to collect the smallest subset by constraining the parameter influence ϵ. In Theorem. 1, we demonstrate that the generalization gap of dataset pruning can be upper-bounded by O( ϵ n + m n 2 ). The term of m n 2 can be simply ignored since it usually has a much smaller magnitude than ϵ n . To verify the validity of the generalization guarantee of dataset pruning, we compare the empirically observed test loss gap before and after dataset pruning L( θ-D ) -L( θ) and our theoretical expectation ϵ n in Fig. 2 . It can be clearly observed that the actual generalization gap is highly consistent with our theoretical prediction. We can also observe that there is a strong correlation between the pruned dataset generalization and ϵ, therefore Eq. 3 effectively guarantees the generalization of dataset pruning by constraining ϵ. Compared with random pruning, our proposed optimization-based dataset pruning exhibits much smaller ϵ and better generalization.

6.3. DATASET PRUNING

In the previous section, we motivated the optimization-based dataset pruning method by constraining or directly minimizing the network parameter influence of a selected data subset. The theoretical Figure 3 : To evaluate the unseen-architecture generalization of the pruned dataset, we prune the CIFAR10 dataset using a relatively small network and then train a larger network on the pruned dataset. We consider three networks from two families with different parameter complexity, SENet (1.25M parameters), ResNet18 (11.69M parameters), and ResNet50 (25.56M parameters). The results indicate that the dataset pruned by small networks can generalize well to large networks. result and the empirical evidence show that constraining parameter influence can effectively bound the generalization gap of the pruned dataset. In this section, we evaluate the proposed optimization-based dataset pruning method empirically. We show that the test accuracy of a network trained on the pruned dataset is comparable to the test accuracy of the network trained on the whole dataset, and is competitive with other baseline methods. We prune CIFAR10, CIFAR100 Krizhevsky (2009) and TinyImageNet Le & Yang (2015) using a random initialized ResNet50 network He et al. (2016) . We compare our proposed method with the following baselines, To make our method comparable to those cardinality-based pruning baselines, we pruned datasets using the cardinality-guaranteed dataset pruning as in Eq. 4. After dataset pruning and selecting a training subset, we obtain the test accuracy by retraining a new random initialized ResNet50 network on only the pruned dataset. The retraining settings of all methods are controlled to be the same. The experimental results are shown in Fig. 1 . In Fig. 1 , our method consistently surpasses all baseline methods. Among all baseline methods, the forgetting and EL2N methods achieve very close performance to ours when the pruning ratio is small, while the performance gap increases along with the increase of the pruning ratio. The influence-score, as a direct baseline of our method, also performs poorly compared to our optimization-based method. This is because all these baseline methods are score-based, they prune training examples by removing the lowest-score examples without considering the influence iteractions between high-score examples and low-score examples. The influence of a combination of high-score examples may be minor, and vice versa. Our method overcomes this issue by considering the influence of a subset rather than each individual example, by a designed optimization process. The consideration of the group effect make our method outperforms baselines methods, especially when the pruning ratio is high.

6.4. UNSEEN ARCHITECTURE GENERALIZATION

We conduct experiments to verify the pruned dataset can generalize well to those unknown or larger network architectures that are inaccessible during dataset pruning. To this end, we use a smaller network to prune the dataset and further use the pruned dataset to train larger network architectures. As shown in Fig. 3 , we evaluate the CIFAR10 pruned by SENet Hu et al. (2018) on two unknown and larger architectures, i.e., ResNet18 He et al. (2016) and ResNet50 He et al. (2016) . SENet contains 1.25 million parameters, ResNet18 contains 11.69 parameters, and ResNet50 contains 25.56 parameters. The experimental results show that the pruned dataset has a good generalization on network architectures that are unknown during dataset pruning. This indicates that the pruned dataset can be used in a wide range of applications regardless of specific network architecture. In addition, the results also indicate that we can prune the dataset using a small network and the pruned dataset can generalize to larger network architectures well. These two conclusions make dataset pruning a proper technique to reduce the time consumption of neural architecture search (NAS).

6.5. DATASET PRUNING IMPROVES THE TRAINING EFFICIENCY.

The pruned dataset can significantly improve the training efficiency while maintaining the performance, as shown in Fig. 4 . Therefore, the proposed dataset pruning benefits when one needs to train many trails on the same dataset. One such application is neural network search (NAS) Zoph et al. ( 2018) which aims at searching a network architecture that can achieve the best performance for a specific dataset. A potential powerful tool to accelerate the NAS is by searching architectures on the smaller pruned dataset, if the pruned dataset has the same ability of identifying the best network to the original dataset. We construct a 720 ConvNets searching space with different depth, width, pooling, activation and normalization layers. We train all these 720 models on the whole CI-FAR10 training set and four smaller proxy datasets that are constructed by random, herding Welling (2009) , forgetting Toneva et al. (2019b) , and our proposed dataset pruning method. All the four proxy datasets contain only 100 images per class. We train all models with 100 epochs. The proxy dataset contains 1000 images in total, which occupies 2% storage cost than training on the whole dataset. Table .1 reports (a) the average test performance of the best selected architectures trained on the whole dataset, (b) the Spearmen's rank correlation coefficient between the validation accuracies obtained by training the selected top 10 models on the proxy dataset and the whole dataset, (c) time of training 720 architectures on a Tesla V100 GPU, and (d) the memory cost. Table . 1 shows that searching 720 architectures on the whole dataset raises huge timing cost. Randomly selecting 2% dataset to perform the architecture search can decrease the searching time from 3029 minutes to 113 minutes, but the searched architecture achieves much lower performance when trained on the whole dataset, making it far from the best architecture. Compared with baselines, our proposed dataset pruning method achieves the best performance (the closest one to that of the whole dataset) and significantly reduces the searching time (3% of the whole dataset). 



Our method is inspired by Influence FunctionKoh & Liang (2017a)  in statistical machine learning. Removing a training example from the training dataset and not damage the generalization indicates that the example has a small influence on the expected test loss. Earlier works focused on studying the influence of removing training points from linear models, and later works extended this to more general modelsHampel et al. (2011);Cook (1986);Thomas & Cook (1990);Chatterjee & Hadi (1986);Wei et al. (1998).Liu et al. (2014) used influence functions to study model robustness and to fo fast cross-validation in kernel methods.Kabra et al. (2015) defined a different notion of influence that is specialized to finite hypothesis classes. Koh & Liang (2017a) studied the influence of weighting an example on model's parameters.Borsos et al. (2020) proposes to construct a coreset by learning a per-sample weight score, the proposed selection rule eventually converge to the formulation of influence functionKoh & Liang (2017b). But different with our work,Borsos et al. (2020) did not directly leverage the influence function to select examples and cannot guarantee the generalization of the selected examples.3 PROBLEM DEFINITIONGiven a large-scale dataset D = {z 1 , . . . , z n } containing n training points where z i = (x i , y i ) ∈ X × Y, X is the input space and Y is the label space. The goal of dataset pruning is to identify a set of redundant training samples from D as many as possible and remove them to reduce the training cost. The identified redundant subset, D = {ẑ 1 , . . . , ẑm } and D ⊂ D, should have a minimal impact on the learned model, i.e. the test performances of the models learned on the training sets before and after pruning should be very close, as described below:

where θ-D = arg min θ∈Θ 1 n-m zi∈D\ D L(z i , θ) and θ = arg min θ∈Θ 1 n zi∈D L(z i , θ), then write D as Dϵ . Dataset Pruning: Given a dataset D, dataset pruning aims at finding its largest ϵ-redundant subset Dmax ϵ , i.e., ∀ Dϵ ⊂ D, | Dmax ϵ | ≥ | Dϵ |, so that the pruned dataset can be constructed as D = D\ Dmax ϵ . 4 METHOD To achieve the goal of dataset pruning, we need to evaluate the model's parameter change θ-D -θ 2 caused by removing each possible subset of D. However, it is impossible to re-train the model for 2 n times to obtain Dmax ϵ for a given dataset D with size n. In this section, we propose to efficiently approximate the Dmax ϵ without the need to re-train the model. 4.1 PARAMETER INFLUENCE ESTIMATION We start from studying the model parameter change of removing each single training sample z from the training set D. The change can be formally written as θ-zθ, where θ-z = arg min θ∈Θ 1 n-1 zi∈D,zi̸ =z L(z i , θ). Estimating the parameter change for each training example by re-training the model for n times is also unacceptable time-consuming, because n is usually on the order of tens or even hundreds of thousands.

Dϵ | simultaneously, we formulate the generalization-guaranteed dataset pruning as a discrete optimization problem as below (a):

Generalization guaranteed dataset pruning. Require: Dataset D = {z 1 , . . . , z n } Require: Random initialized network θ; Require: Expected generalization drop ϵ; 1: θ = arg min θ∈Θ 1 n zi∈D L(z i , θ); //compute ERM on D 2: Initialize S = ϕ; 3: for i = 1, 2, . . . , n do 4:

θ); //store the parameter influence of each example 5: end for 6: Initialize W ∈ {0, 1} n ; 7: Solve the following problem to get W : 2 ≤ ϵ //guarantee the generalization dropW ∈ {0, 1} n 8: Construct the largest ϵ-redundant subset Dmax ϵ = {z i |∀z i ∈ D, W i = 1}; 9: return Pruned dataset: D = D \ Dmax ϵ ;size m and want to minimize their influence on parameter change, we provide cardinality-guaranteed dataset pruning in Eq. 4.

,ẑ dδ δ=(0,...,0) = (I loss (z test , ẑ1 ), . . . , I loss (z test , ẑm )), (6) which is an N × m matrix and N indicates the number of parameters. We define the expected test loss over the data distribution P (D) as L(θ) = E ztest∼P (D) L (z test , θ) and define the generalization gap due to dataset pruning as |L( θ-D ) -L( θ)|. By using the influence function of the test loss Singh et al. (2021), we obtain Theorem 1 which formulates the upper bound of the generalization gap. Theorem 1 (Generalization Gap of Dataset Pruning). Suppose that the original dataset is D and the pruned dataset is D = {ẑ i } m i=1 . If ẑi∈ D I param (ẑ i ) 2 ≤ ϵ, we have the upper bound of the generalization gap as

Figure 2: The comparison of empirically observed generalization gap and our theoretical expectation in Theorem. 1. We ignore the term of m/n 2 since it has much smaller magnitude with ϵ/n.

(a) Random pruning, which selects an expected number of training examples randomly. (b) Herding Welling (2009), which selects an expected number of training examples that are closest to the cluster center of each class. (c) Forgetting Toneva et al. (2019b), which selects training examples that are easy to be forgotten. (d) GraNd Paul et al. (2021), which selects training examples with larger loss gradient norms. (e) EL2N Paul et al. (2021), which selects training examples with larger norms of the error vector that is the predicted class probabilities minus one-hot label encoding. (f) Influence-score Koh & Liang (2017a), which selects training examples with high norms of the influence function in Eq. 2 without considering the 'group effect'.

Figure 4: Dataset pruning significantly improves the training efficiency with minor performance scarification. When pruning 40% training examples, the convergence time is nearly halved with only 1.3% test accuracy drop. The pruned dataset can be used to tune hyperparameters and network architectures to reduce the searching time.

datasets. CIFAR10 and CIFAR100 contains 50,000 training examples from 10 and 100 categories, respectively. TinyImageNet is a larger dataset contains 100,000 training examples from 200 categories. We verify the cross-architecture generalization of the pruned dataset on three popular deep neural networks with different complexity, e.g., SqueezeNet Hu et al. (2018) (1.25M parameters), ResNet18 He et al. (2016) (11.69M parameters), and ResNet50 He et al. (

Neural architecture search on proxy-sets and whole dataset. The search space is 720 ConvNets. We do experiments on CIFAR10 with 100 images/class proxy dataset selected by random, herding, forgetting, and our proposed optimization-based dataset pruning. The network architecture selected by our pruned dataset achieves very close performance to the upper-bound. 7 CONCLUSION This paper proposes a problem of dataset prunning, which aims at removing redundant training examples with minor impact on model's performance. By theoretically examining the influence of removing a particular subset of training examples network's parameter, this paper proposes to model the sample selection procedure as a constrained discrete optimization problem. During the sample selection, we constrain the network parameter change while maximize the number of collected samples. The collected training examples can then be removed from the training set. The extensive theoretical and empirical studies demonstrate that the proposed optimization-based dataset pruning method is extremely effective on improving the training efficiency while maintaining the model's generalization ability.Bo Zhao, Konda Reddy Mopuri, and Hakan Bilen. Dataset condensation with gradient matching. In International Conference on Learning Representations, 2021. URL https://openreview. net/forum?id=mSAKhLYLSsl. Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8697-8710, 2018.

