DYNAMS: DYANMIC MARGIN SELECTION FOR EFFI-CIENT DEEP LEARNING

Abstract

The great success of deep learning is largely driven by training over-parameterized models on massive datasets. To avoid excessive computation, extracting and training only on the most informative subset is drawing increasing attention. Nevertheless, it is still an open question how to select such a subset on which the model trained generalizes on par with the full data. In this paper, we propose dynamic margin selection (DynaMS). DynaMS leverages the distance from candidate samples to the classification boundary to construct the subset, and the subset is dynamically updated during model training. We show that DynaMS converges with large probability, and for the first time show both in theory and practice that dynamically updating the subset can result in better generalization. To reduce the additional computation incurred by the selection, a light parameter sharing proxy (PSP) is designed. PSP is able to faithfully evaluate instances following the underlying model, which is necessary for dynamic selection. Extensive analysis and experiments demonstrate the superiority of the proposed approach in data selection against many state-of-the-art counterparts on benchmark datasets.

1. INTRODUCTION

Deep learning has achieved great success owing in part to the availability of huge amounts of data. Learning with such massive data, however, requires clusters of GPUs, special accelerators, and excessive training time. Recent works suggest that eliminating non-essential data presents promising opportunities for efficiency. It is found that a small portion of training samples 1 contributes a majority of the loss (Katharopoulos & Fleuret, 2018; Jiang et al., 2019) , so redundant samples can be left out without sacrificing much performance. Besides, the power law nature (Hestness et al., 2017; Kaplan et al., 2020) of model performance with respect to the data volume indicates that loss incurred by data selection can be tiny when the dataset is sufficiently large. In this sense, selecting only the most informative samples can result in better trade-off between efficiency and accuracy. The first and foremost question for data selection is about the selection strategy. That is, how to efficiently pick training instances that benefit model training most. Various principles have been proposed, including picking samples that incur larger loss or gradient norm (Paul et al., 2021; Coleman et al., 2020) , selecting those most likely to be forgotten during training, as well as utilizing subsets that best approximate the full loss (Feldman, 2020) or gradient (Mirzasoleiman et al., 2020; Killamsetty et al., 2021) . Aside from selection strategies, existing approaches vary in the training schemes which can be divided roughly into two categories: static ones and dynamic (or adaptive) ones. Static methods (Paul et al., 2021; Coleman et al., 2020; Toneva et al., 2019) decouple the subset selection and the model training, where the subset is constructed ahead and the model is trained on such a fixed subset. Dynamic methods (Mindermann et al., 2022; Killamsetty et al., 2021) , however, update the subset in conjunction with the training process. Though effectively eliminates amounts of samples, it is still not well understood how these different training schemes influence the final model. In this paper, we propose dynamic margin selection (DynaMS). For the selection strategy, we inquire the classification margin, namely, the distance to the decision boundary. Intuitively, samples close to the decision boundary influence more and are thus selected. Classification margin explicitly utilizes the observation that the decision boundary is mainly determined by a subset of the data. For the training scheme, we show the subset that benefits training most varies as the model evolves during training, static selection paradigm may be sub-optimal, thus dynamic selection is a better choice. Synergistically integrating classification margin selection and dynamic training, DynaMS is able to converge to the optimal solution with large probability. Moreover, DynaMS admits theoretical generalization analysis. Through the lens of generalization analysis, we show that by catching the training dynamics and progressively improving the subset selected, DynaMS enjoys better generalization compared to its static counterpart. Though training on subsets greatly reduces the training computaiton, the overhead introduced by data evaluation undermines its significance. Previous works resort to a lighter proxy model. Utilizing a separate proxy (Coleman et al., 2020) , however, is insufficient for dynamic selection, where the proxy is supposed to be able to agilely adapt to model changes. We thus propose parameter sharing proxy (PSP), where the proxy is constructed by multiplexing part of the underlying model parameters. As parameters are shared all along training, the proxy can acutely keep up with the underlying model. To train the shared network, we utilize slimmable training (Yu et al., 2019) with which a well-performing PSP and the underlying model can be obtained in just one single train. PSP is especially demanding for extremely large-scale, hard problems. For massive training data, screening informative subset with a light proxy can be much more efficient. For hard problems where model evolves rapidly, PSP timely updates the informative subset, maximally retaining the model utility. Extensive experiments are conducted on benchmarks CIFAR-10 and ImageNet. The results show that our proposed DynaMS effectively pick informative subsets, outperforming a number of competitive baselines. Note that though primarily designed for supervised learning tasks, DynaMS is widely applicable as classifiers have become an integral part of many applications including foundation model training (Devlin et al., 2019; Brown et al., 2020; Dosovitskiy et al., 2021; Chen et al., 2020) , where hundreds of millions of data are consumed. In summary, the contributions of this paper are three-folds: • We establish dynamic margin select (DynaMS), which selects informative subset dynamically according to the classification margin to accelerate the training process. DynaMS converges to its optimal solution with large probability and enjoys better generalization. • We explore constructing a proxy by multiplexing the underlying model parameters. The resulting efficient PSP is able to agilely keep up with the model all along the training, thus fulfill the requirement of dynamic selection. • Extensive experiments and ablation studies demonstrate the effectiveness of DynaMS and its superiority over a set of competitive data selection methods.

2. METHODOLOGY

To accelerate training, we propose dynamic margin selection (DynaMS) whose framework is presented in Figure 1 . Instances closest to the classification decision boundary are selected for training, and the resulting strategy is named margin selection (MS). We show that the most informative subset changes as the learning proceeds, so that a dynamic selection scheme that progressively improves the subset can result in better generalization. Considering the computational overhead incurred by selection, we then explore parameter sharing proxy (PSP), which utilizes a much lighter proxy model to evaluate samples. PSP is able to faithfully keep up with the underlying model in the dynamics selection scheme. The notations used in this paper are summarized in Appendix H Intuitively, these samples should be influential most to the model decision. Following (Mickisch et al., 2020; Emam et al., 2021) , the decision boundary between two classes c 1 and c 2 ∈ {1, . . . C} is

2.1. SELECTION

B := {x | f c1 (x) = f c2 (x)}, where f c (x) is the c entry of model output, indicating the probability of x belonging to class c. The classification margin is then: M (x, c 1 , c 2 ) = min δ ∥δ∥ 2 s.t. x + δ ∈ B (1) which is the minimal perturbation required to move x form c 1 to c 2 . Directly computing the margin is infeasible for deep neural networks, so scoring is conducted in the feature space instead as in (Emam et al., 2021) . Typically neural networks applies a linear classifier on top of the features (Goodfellow et al., 2016) , so the classification margin M (x, c 1 , c 2 ) can be easily obtained as: Given the subset selected, model is subsequently trained on S. Conventional static training scheme assumes that the optimal subset converges and is not related to the model training dynamic (Paul et al., 2021; Coleman et al., 2020) . Though effectively eliminate instances, the "converged optimal subset" assumption may be too strong. To investigate whether the most informative samples vary during training, we plot the overlap ratio of samples selected in two consecutive selections during the training of ResNet models, shown in Figure 2 (a). We train for 200 epochs and 120 epochs on CIFAR-10 and ImageNet respectively, and conduct selection every 10 epochs. It can be observed that the overlap ratio is on average 0.83 for CIFAR-10 and 0.73 for ImageNet rather than 1.0, meaning that samples that most benefit model training vary as the model evolves. A fixed subset may be outdated after parameter updates, thus yielding sub-optimal results. M (x, c 1 , c 2 ) = (W c1 -W c2 ) ⊤ h(x)/ ∥W c1 -W c2 ∥ 2 , where W ∈ R d×C We thus resort to a dynamic scheme where data selection is performed after each Q epochs trainingfoot_1 . By selecting in conjunction with training, the informative subset gets updated according to the current model status. For the kth selection, the informative subset S k is constructed by picking portion γ k samples so that |S k | = γ k |T |. The selection ratio γ k determines the critical margin κ k , where only samples with classification margin smaller than κ k are kept. S k will then be used for training Q epochs. In the following, we provide a convergence analysis of DynaMS and show that DynaMS achieves better generalization by constantly improving the selected subset. Convergence Analysis We now study the conditions for the convergence of training loss achieved by DynaMS. We use logistic regression (LR) to demonstrate and then show the conditions are well satisfied when LR is used on top of deep feature extractors. We have the following theorem: Theorem. Consider logistic regression f (x) = 1 1+e -w ⊤ x with N Gaussian training samples x ∼ N (0, Σ), x ∈ R d . Assume ∥w∥ 2 ≤ D and N d < α. Let w * be the optimal parameters and λ be the largest eigenvalue of the covariance Σ. For t ∈ {1, . . . T } and constants ε > D λ 2 -1, ζ > 1, µ >> α, select subset with critical margin κ t = (1 + ε) log(ζT -t) and update parameters with learning rate η = DN E √ T . Then with probability at least 1 -α µ min t L(w t ) -L(w * ) ≤ DE 1 T 1 4 + c ε,ζ T 3 4 +ε + c ε,ζ,λ T β where The proof is left in Appendix B. Theorem 2.2 indicates that dynamically selecting data based on the classification margin is able to converge and achieve the optima w * with large probability. The Gaussian input assumption is overly strong in general, but when the linear classifier is adopted on top of a wide enough feature extractor, the condition is well satisfied because a infinitely wide neural network resembles Gaussian process (Lee et al., 2019; Xiao et al., 2018; de G. Matthews et al., 2018) . E = √ dλ(1 + (2µ) 1 4 ), β = (1+ε) 2 2D 2 λ -1 4 , c ε, Generalization Analysis Recently, (Sorscher et al., 2022) developed an analytic theory for data selection. Assume training data x i ∼ N (0, I) and there exists an oracle model w o ∈ R d which generates the labels such that y i = sign w ⊤ o • x i . Following static selection, when an estimator w is used to pick samples that have a small classification margin, the generalization error takes the form E(α, γ, θ) in the high dimensional limit. α = |T | d indicates the abundance of training samples before selection; γ determines the selection budget and θ = arccos w ⊤ wo ∥w∥ 2 •∥wo∥ shows the closeness of the estimator to the oracle. The full set of self-consistent equations characterizing E(α, γ, θ) is given in Appendix C. By solving these equations the generalization error E(α, γ, θ) can be obtained. We then extend it to the dynamic scheme. For the kth selection, we use the model trained on S k-1 as the estimator w k , which deviates from oracle by angle θ k = arccos w ⊤ k-1 wo ∥w k-1 ∥ 2 •∥wo∥ 2 , to evaluate and select samples. The resulting subset S k will be used for subsequent training of model w k+1 , which will later be used as an estimator at k + 1 to produce S k+1 . In this way, generalization of dynamic scheme can be obtained by recurrently solving the equations characterizing E(α, γ k , θ k ) with updated keeping ratio γ k and estimator deviation θ k . Note that in each round of selection, samples are picked with replacement, so the abundance of training samples α is kept fixed. The keeping ratio γ k , determining the subset size, can be scheduled freely to meet various requirements. We compare the generalization of dynamic selection and its static counterpart in Figure 2(b) . We show the landscape of E(α, γ, θ) with different γ and θ by solving the generalization equations numerically. α = 3.2 is kept fixed, which means the initial training data is abundant; We use static training with θ s = 40 • and γ s = 0.6 as control group. To make the comparison fair, we make sure 1 K K k=1 |γ k | = γ s , so that the averaged number of samples used in the dynamic scheme equals the subset size used in the static scheme. From Figure 2 (b), we see that in dynamic selection, the estimator gets constantly improved (θ k decreases), so that the subsets get refined and the model achieves better generalization. Discussion on selecting with different α, γand θ is given in Appendix D. With dynamic selection, the number of updates is reduced. However, the computational overhead incurred by data selection undermines its significance, especially when the model is complex and samples are evaluated frequently. Aside from designing efficient selection strategies, previous works explored utilizing a lighter model as proxy to evaluate the instances so that the problem can be ameliorated. Pretrain a separate proxy and evaluate instances prior to model training (Coleman et al., 2020) , however, is insufficient for dynamic selection, as a static proxy can not catch the dynamics of the underlying model. A proxy that fulfills the requirements of dynamic selection is still absent.

Proxy output channels

We thus propose parameter sharing proxy (PSP), where part of the model is used as the proxy. Taking convolutional neural network as an example, for a layer with kernel W ∈ R ci×co×u×u , where c i , c o and u are number of input filters, number of output filters and kernel size respectively, the corresponding kernel of proxy is then: W proxy = W 1:pci,1:pco,:,: , where p ∈ [0, 1] is a slimming factor. As shown in Figure 3 , the proxy kernel is constructed with the first pc i input channels and first pc o output channels. A p times thinner proxy can be obtained by applying p to each layer. With separate batch normalization for proxy and model, PSP forms a slimmable network (Yu et al., 2019) , where multiple models of different widths are jointly trained and they all yield good performance. As the parameters are shared, the proxy can acutely keep up with the model change, thus applicable for dynamic selection. We further investigate the gradients alignment of the proxy and the original model through their cosine similarity: cos(g, g proxy ) = g ⊤ g proxy ∥g∥ 2 • ∥g∥ 2 , where g = ∇ W L (W) , g proxy = ∇ W L (W proxy ) A positive cosine value indicates g proxy stands in the same side with g, thus updates on proxy and the model benefits each other. We compare the gradient alignment of PSP and a stand-alone proxy in Figure 2 (c) on ResNet-50. With p = 0.5, we see that cos(g, g proxy ) for PSP is much larger than the stand-alone proxy. 

3. RELATED WORK

Accelerating training by eliminating redundant training instances has long been a research focus in academia. This is accomplished by adopting an effective selection strategy and an appropriate training scheme. We summarize the related literature from these two strands of research in the following. Selection Strategy Sample selection can be accomplished with various principles. (Loshchilov & Hutter, 2015; Jiang et al., 2019; Paul et al., 2021) tend to pick samples that incur large loss or gradient norm (CE-loss, EL2N, GraNd). (Toneva et al., 2019) inspects the "unforgettable"  ((C • d + log |S|) • |T |) O(|T |) EL2N O((C • d + log |S|) • |T |) O(|T |) GraNd O((C • d + log |S|) • |T |) O(|T |) Craig O(C • d • |T | • |S|) O(|S| • |T |) GradMatch O(C • d • |T | • |S|) O(C • d • |T |) MS O(C • (d + log |S|) • |T |) O(C • |T |) examples that are rarely misclassified once learned, and believes these samples can be omitted without much performance degradation. Other works adopt uncertainty. Samples with the least prediction confidence are preferred (Settles, 2010) Training Schemes Data selection brings more options to training. Under the conventional static training scheme (Paul et al., 2021; Toneva et al., 2019; Coleman et al., 2020) , data selection is conducted prior to model update, and the informative subset is kept fixed. Contrastively, online batch selection picks batch data each iteration (Loshchilov & Hutter, 2015; Alain et al., 2015; Zhang et al., 2019; Mindermann et al., 2022) . Though sufficiently considered the training dynamics, the overly frequent sample evaluation incurs prohibitive computational overhead. Recently, (Killamsetty et al., 2021) tried selecting after several epochs' training, which is similar to our dynamic scheme. However, the dynamic training scheme is just utilized as a compromise to avoid overly frequent selection. A formal analysis of its advantage over the static scheme is absent. By systematically considering the selection strategy, the model training, as well as the proxy design, Our proposed DynaMS forms an effective data selection framework for efficient training.

4. EXPERIMENTS

In this section, we first analyse the effectiveness of each design ingredient in Section 4.2. Then we compare to state-of-the-art algorithms in Section 4.3. Code is available at https://github. com/ylfzr/DynaMS-subset-selection.

4.1. EXPERIMENTAL SETUP

We conduct experiments on CIFAR-10 Krizhevsky & Hinton (2009) and ImageNet Jia et al. (2009) , following standard data pre-processing in He et al. (2016) . A brief summarization of the experimental setup is introduced below, while complete hyper-parameter settings and implementation details can be found in Appendix F.

CIFAR-10 Experiments

For CIFAR-10, we train ResNet-18 (He et al., 2016) for 200 epochs. Selection is conducted every 10 epochs, so overall there are 19 selections (K = 19). For subset size, we adopt a simple linear schedule: γ k = 1 -k • a for k = 1, . . . , K , where a determines the reduction ratio. We make sure γ avg = 1 K K k=1 γ k = γ s . In this way, the averaged number of data used in the dynamic scheme (γ avg ) is kept equal to that of static training (γ s ) for fair comparison. For 0.6× acceleration, a = 0.042. We conduct experiments on a NVIDIA Ampere A-100. ImageNet Experiments For ImageNet, we choose ResNet-18 and ResNet-50 as base models. Following the conventions, the total training epoch is 120. Selection is also conducted every 10 epochs, so altogether K = 11. For subset size, aside from the linear schedule, we also explore a power schedule where γ k decays following a power law: γ k = m • k -r + b for k = 1, 2, . . . , K. For 0.6× acceleration, we set m = 0.398, r = 0.237 and b = 0.290. Please see Appendix F for more details. The power schedule reserves more samples in late training, preventing performance degradation caused by over data pruning. We conduct experiments on four NVIDIA Ampere A-100s. We use ResNet-50 on ImageNet to illustrate the effect of each ingredient in DynaMS, that is, the classification margin criteria, the dynamic training scheme as well as the parameter sharing proxy.

The effect of classification margin selection

To inspect the effect of classification margin selection (MS), we compare MS against two widely applied selection strategies CE-loss (Loshchilov & Hutter, 2015; Jiang et al., 2019) and EL2N (Paul et al., 2021) . CE-loss selects samples explicitly through the cross-entropy loss they incur while EL2N picks samples that incur large L2 error. We compare the three under the conventional static scheme so any other factors aside from the selection strategy is excluded. Samples are evaluated after 20 epochs of pretraining. The model is then reinitialized and trained on the selected subset, which contains 60% original samples. As shown in Table 2 , MS achieves the best accuracy among the three, validating its effectiveness.

The effect of dynamic training

We then apply dynamic selection on MS, where the average subset size is also kept to be 60% of the original dataset. From Table 2 we see that DynaMS outperforms MS by 1.67%, which is significant on large scale dataset like ImageNet. The superiority of DynaMS validates that by constantly improving the model and updating the subset, dynamic selection scheme can result in better performance. Note that DynaMS can be more practical since it does not require the 20 epochs training prior to selection as required in the static scheme. The effect of parameter sharing proxy We now study the parameter sharing proxy (PSP). An effective proxy is supposed to be faithful, and can agilely adapt to model updates. In Figure 4 , we plot the Spearman rank correlation as well as the overlap ratio of samples selected with the proxy and the model. We see that all along the training, the rank correlation is around 0.68, and over 78% samples selected are the same, indicating that the proxy and the model are fairly consistent. We then investigate how will the complexity, measured by floating point operations (FLOPs), of proxy affect. We enumerate over the slimming factor p ∈ {0.25, 0.5, 0.75, 1.0} to construct proxies of different widths, the corresponding FLOPs are 6.25%, 25.00%, 56.25%, 100% respectively. In Table 3 , we see that significant computation reduction can be achieved with moderate performance degradation.

4.3. COMPARISONS WITH STATE-OF-THE-ARTS

Finally, we compare DynaMS against various state-of-the-art methods. Aside from CE-loss and EL2N, Random picks samples uniformly at random. GraNd (Paul et al., 2021) select samples that incur large gradient norm. Forget (Toneva et al., 2019) counts how many times a sample is mis-classified (forget) after it is learned. Samples more frequently forgotten are preferred. We evaluate the forget score after 60 epochs training. To avoid noisy evaluation, many of these static selection approaches ensembles networks before selection. The number of ensambled models is given by the subscription. Auto-assist (Zhang et al., 2019) select samples that incur large loss value on a small proxy. Selection is conducted in each iteration thus forming an online batch selection (OLBS) scheme. DynaCE and DynaRandom apply the corresponding selection strategy, but are trained in a dynamic way. CRAIG and GradMatch propose to reweight and select subsets so that they best cover or approximate the full gradient. In the experiments, we use the per-batch variant of CRAIG and 

Overlap Ratio

GradMatch proposed in (Killamsetty et al., 2021) with 10 epoch warm startfoot_2 . The two approaches utilize dynamic selection scheme, all the training settings are kept the same as our DynaMS. In table 4, average accuracy from 5 runs on CIFAR-10 as well as their running time are reported. Due to limited space, the standard deviation is given in Appendix E. We see that DynaMS achieves comparable performance against the strongest baselines (EL2N 10 , GraNd 10 , Forget 10 ) while being more efficient. Note that the static methods require pretraining one or several models for 20 epochs before selection. Considering this cost (subscript of the reported running time), the acceleration of these methods is less significant. We also compare two online batch selection methods, OnlineMS and Auto-assist (Zhang et al., 2019) . OnlineMS picks samples with MS, but the selection is conducted each iteration. OnlineMS didn't outperform DynaMS, meaning more frequent selection is not necessary. Rather, selecting at each optimization step incurs prohibitive computational overhead. Auto-assist didn't get good performance in this experiment. This may results from the overly simple proxy. The logistic regression proxy adopted may not sufficiently evaluate the candidate samples. For ImageNet, we also report the average accuracy from 5 runs as well as their running time. The standard deviation is given in Appendix E DynaMS outperforms all the baselines. For instance, it achieves 68.65% and 74.56% top-1 accuracy given on average 60% samples for ResNet-18 and ResNet-50 respectively, surpassing the most competitive counterpart Forget by 0.81% and 1.06%. Compared to the static methods which require additional pretraining, 60 epochs for Forget and 20 for the others, DynaMS is much more efficient. CRAIG and GradMatch didn't get good performance on ImageNet. This might because we use the per-batch variant in (Killamsetty et al., 2021) , and set batch size 512 in order to fit the per-sample gradients into memory. The per-batch variant treats each mini-batch as one sample and selects mini-batches during the gradient matching process. So a larger batch size means more coarse grain selection which may lead to inferior performance. We also compare a variant DynaRandom. DynaRandom adopts the dynamic selection scheme but a random subset is constructed at each selection. DynaMS outperforms DynaRandom by 1.06% and 1.93% for ResNet-18 and ResNet-50 respectively, indicating that the superiority of DynaMS over static methods comes from effectively identifying informative samples instead of witnessing more data. ResNet-50 is rather complex and the data evaluation time is non-negligible. We thus apply parameter sharing proxy to reduce the evaluation time. The proxy is 0.5× width so the evaluation requires around 0.25× computation compared to the original model. As the gradients of the proxy and the underlying model are well aligned, we only train DynaMS+PSP for 90 epochs. From table 5, though utilizing a proxy harms performance compared to DynaMS, it still outperforms all the other baselines. Specifically, SVP also uses a proxy for sample evaluation. The proxy, however, is a statically fully trained ResNet-18. The superiority of DynaMS+PSP over SVP shows the necessity of a dynamic proxy that agilely keeps up with the change of underlying model. The advantages of DynaMS+PSP over DynaMS on efficiency can be significant for extremely large scale problems where massive data is available while only a small fraction of data is sufficient for training. To further demonstrate DynaMS, we draw the accuracy curvature of ResNet-50 against different (on average) sample budgets from 60% to 100% in Figure 5 . It can be found that our DynaMS consistently outperforms all the other data selection strategies on different budgets. Finally, To get a better understanding of how the selected samples look like and how they change over time, we visualize samples picked in different selection steps along the training. See G for more details.

5. CONCLUSION

In this paper, we propose DynaMS, a general dynamic data selection framework for efficient deep neural network training. DynaMS prefers samples that are close to the classification boundary and the selected "informative" subset is dynamically updated during the model training. DynaMS has a high probability to converge and we pioneer to show both in practice and theory that dynamic selection improves the generalization over previous approaches. Considering the additional computation incurred by selection, we further design a proxy available for dynamic selection. Extensive experiments and analysis are conducted to demonstrate the effectiveness of our strategy. Model efficiently trained on selected subsets. (Yu et al., 2019) . 1: k = 1; γ k = 1 thus S k = T 2: for epochs t = 1, ..., T do 3: if t % Q == 0 then 4: Select subset, S k = MS(W t , T , γ k ). 5: k = k +

B PROOF FOR THEOREM 2.2

To prove Theorem 2.2, we first inspect the norm of x. We get the following lemma. Lemma 1. For Gaussian data x ∼ N (0, Σ), let µ > 0, T > 1 be constants, d the dimension of x and λ the largest eigenvalue of the covariance Σ, then with probability at least 1 - 1 µT d , ∥x∥ 2 < √ dλ(1 + (2µ) 1 4 )T 1 4 . Proof of Lemma 1. For x ∼ N (0, Σ), ∥x∥ 2 2 follows a generalized chi-squared distribution. The mean and variance can be computed explicitly as E[∥x∥ 2 2 ] = trΣ = j λ j and Var(∥x∥ 2 2 ) = 2trΣ 2 = 2 j λ 2 j . By Chebyshev's inequality, we have P r   ∥x∥ 2 2 < λ j + µT d 2 j λ 2 j   > 1 - 1 µT d where µ > 0 and T > 1 are constants and d is the dimension of x. Then as λ j + √ µT d 2 j λ 2 j ≤ (1 + √ 2µT )dλ where λ = max j λ j is the largest eigenvalue of the covariance Σ, we have: P r ∥x∥ 2 < √ dλ(1 + (2µ) 1 4 )T 1 4 > 1 - 1 µT d (5) Then we can start proving Theorem 2.2. Theorem. Consider logistic regression f (x) = Proof of Theorem 2.2. For logistic regression f (x) = 1 1+e -w ⊤ x with loss function L = 1 N N i=1 ℓ i = 1 N N i=1 -y i log ŷi -(1 -y i ) log(1 -ŷi ) Where ŷi is the predicted value. The gradient incurred training on the selected subset is then: ∂L κ ∂w = 1 N N i=1 (ŷ i -y i )x i • I(|w ⊤ x i | < κ) For those |w ⊤ x i | ≥ κ or "easy" samples, we have | sgn(y i -1 2 ) • w ⊤ x i | ≥ κ and with probability at least 1 -1 µT d ∂ℓ i ∂w 2 ≤ E•T 1 4 1+e κ if sgn(y i -1 2 ) • w ⊤ x i ≥ κ E • T 1 4 if sgn(y i -1 2 ) • w ⊤ x i ≤ -κ where E = √ dλ(1 + (2µ) ). Note that the condition sgn(y i -1 2 ) • w ⊤ x i ≤ -κ means x i is misclassified by w as well as the margin is at least κ. Denote the portion of this kind of misclassified sample in the whole training set by r, we have the estimate of the gradient gap Err t = ∂L κ ∂w - ∂L ∂w 2 = 1 N |w ⊤ x|≥κ ∂ℓ ∂w (x) 2 ≤ ET 1 4 (1 -γ t ) 1 + e κt + ET 1 4 (1 -γ t )r t Where γ t is the fraction of data kept by selecting with margin κ t . The inequality holds with probability at least (1 -1 µT d ) N > 1 -α µT because of Equation 8. Note that Lemma 1 also suggest ∂ℓ ∂w 2 ≤ E • T 1 4 with large probability, therefore L is highly likely to be Lipschitz continuous with parameter ET  t = (1 + ε) log(ζT -t), ζ > 1, we have with probability at least 1 -α µT T ≥ 1 -α µ min t L(w t ) -L(w * ) ≤ DE N T 1 4 + D T T -1 t=1 Err t ≤ DE N T 1 4 + DE T 3 4 T -1 t=1 1 (ζT -t) 1+ε + DE T 3 4 T -1 t=1 r t ≤ DE T 1 4 1 N + c ε,ζ T ε √ T + DE T 3 4 T -1 t=1 r t The first inequality follows the Theorem 1 in (Killamsetty et al., 2021) . The last inequality holds because T -1 t=1 1 (ζT -t) 1+ε ≤ ζT (ζ-1)T 1 s 1+ε ds ≤ c ε,ζ T ε with c ε,ζ = 1 ε(ζ-1) ε , ∀ε > 0 and ζ > 1. To bound the sum of classification error (the last term of Equation 10), again we utilize the data distribution prior. Note that the data points contribute to r are quantified by the following set: E = {w ⊤ o x > 0 ∧ w ⊤ x < -κ} ∪ {w ⊤ o x < 0 ∧ w ⊤ x > κ} := E 1 ∪ E 2 where w o is the oracle classifier such that the true label is generated according to y = sgn(w ⊤ o x). Let ϕ represent the probability density function of standard Gaussian, we see that r = E ϕ(x|Σ)dx = 2 E1 ϕ(x|Σ)dx ≤2 {w ⊤ •x<-κ} ϕ(x|Σ)dx = 2Φ - κ √ w ⊤ Σw ≤2Φ - κ D √ λ where λ is the largest eigenvalue of Σ. Therefore, we have the following estimation: 1 Constants appear in the convergence bound. T 3 4 T -1 t=1 r t ≤ 1 T 3 4 T -1 t=1 2Φ - κ t D √ λ ≤ 2 T 3 4 T -1 t=1 ϕ(κ t /(D √ λ)) κ t /(D √ λ) (Gaussian upper tail bound) = 2D √ λ √ 2π(1 + ε) 1 T 3 4 T -1 t=1 1 log(ζT -t) e -(1+ε) 2 2D 2 λ log 2 (ζT -t) ≤ 2D √ λT 1 4 √ 2π(1 + ε) 1 log((ζ -1)T + 1) 1 ((ζ -1)T + 1) (1+ε) 2 2D 2 λ log((ζ-1)T +1) ≤ c ε,ζ,λ T -β



Without loss of generality, we omit the bias term for notation clarity For extremely large dataset case where training can be accomplished within just one or a few epochs, the selection can be performed every Q iterations For cifar-10, we use the published implementation from https: //github.com/decile-team/cords. For Ima-geNet, we modify the implementation to the distributed setting.



Figure 1: The overall framework of dynamic margin select with parameter sharing proxy (Dy-naMS+PSP). The green model indicates the underlying model to be trained and the blue one is the parameter sharing proxy which efficiently evaluates data. Instances are selected every Q epochs, then the model is trained on the selected subset.

is the weight of the linear classifier 2 and h(x) is the feature of x. In this way, the classification margin of a labeled sample (x, y) along class c is M (x, y, c) if y ̸ = c or min c̸ =y M (x, y, c) if y = c. The former indicates the distance moving (x, y) to class c while the latter is the distance moving (x, y) to the nearest class other than y. To keep the subset balanced, we evenly pick |S|/C samples with the smallest classification margin along each class. The resulting strategy is named margin selection (MS), denoted as MS(w, T , |S|). The procedure is detailed in Algorithm 1 in Appendix A.

Figure 2: (a) Overlap ratio of subsets extracted in two consecutive selections. A near 1 overlap ratio means the selection is converged. (b) Generalization of static as well as dynamic data selection according to classification margin. Selection budget γ avg = γ s = 60%. (c) The averaged gradient alignment cos(g, g proxy ) of parameter sharing proxy and an stand-alone proxy along model training.

ζ and c ε,ζ,λ are constants depending on ε, ζ and λ.

Figure 3: Parameter sharing proxy which is constructed with part of model parameters.

Figure 5: Comparison under different (on average) sample budgets.

the procedure of margin selection (MS). In MS, distances of the current sample (x, y) to each other class c are computed. If y ̸ = c, the classification margin of (x, y) and class c is M (x, y, c), which is the distance of moving x from class y to class c. If y = c, the classification margin is min c̸ =y M (x, y, c), which corresponds to the distance moving (x, y) to another class that is the most close to x. For the whole candidate set T , this generates a |T | × C score matrix. After the classification margins are obtained, |S|/C samples with the smallest classification margin along each class are picked. This keeps samples collected in the subset balanced. Algorithm 1 Margin selection: MS(w, T , γ) Input: Candidate set T , keeping ratio γ, number of classes C; Network with weights w, including weights of the final classification layer W ; Output: Selected subset according to the classification margin S. 1: Compute the keeping budget |S| = γ • |T |, initialize the subset S = {} // Evaluating: compute the classification margin. 2: for (x, y) ∈ T do 3: for c = 1 : C do 4: Compute the classification margin of the sample to the (y, c) boundary: M (x, y, c) = min c̸ =y M (x, y, c) Selecting: pick the samples according to classification margin (Equation 4.) 7: for c = 1 : C do 8: Pick |S|/C samples which have the smallest classification margins (M (•)): Top |S|/C (c). 9: S = S Top |S|/C (c) 10: Remove the already selected samples from the candidate set: T = T -Top |S|/C (c) 11: end for Algorithm 2 Dynamic margin selection (DynaMS) Input: Training data T ; Base network with weights W, learning rate η Keep ratio of each selection γ k where k = 1, ..., K, selection interval Q Output:

full workflow of efficient training with the proposed dynamic margin selection (DynaMS) is shown in Algorithm 2. The model is first trained on the full dataset T for Q epochs to warm up. Subset selection kicks in each Q epochs, samples are evaluated with the current model so the informative subset gets updated according to the distance of samples to the classification boundary. After selection, the model is trained on the selected subset until the next selection. The workflow incorporating parameter sharing proxy is shown in Algorithm 3. Different from naive DynaMS, samples are evaluated and selected with the proxy instead of the underlying model. During the Q epochs' training, the proxy and the original model are updated simultaneously with slimmable training

1 1+e -w ⊤ x with N Gaussian training samples x ∼ N (0, Σ), x ∈ R d . Assume ∥w∥ 2 ≤ D and N d < α. Let w * be the optimal parameters and λ be the largest eigenvalue of the covariance Σ. For t ∈ {1, . . . T } and constants ε > D λ 2 -1, ζ > 1, µ >> α, select subset with critical margin κ t = (1 + ε) log(ζT -t) and update parameters with learning rate η = DN E √ T . Then with probability at least 1β = (1+ε) 2 2D 2 λ -1 4 , c ε,ζ and c ε,ζ,λ are constants depending on ε, ζ and λ.

By setting a constant learning rate η = DN E √ T , and critical margin κ

Figure 9: Images selected at different training stages of the model. As in (Sorscher et al., 2022), we show results on ImageNet class 100 (black swan).

Given the well-aligned gradients, PSP requires fewer training epochs. Overall workflows of DynaMS and DynaMS+PSP is shown in Algorithm 2 and Algorithm 3 of Appendix A. PSP is especially advantageous for large and hard problems. When the data is extremely large, training PSP on a small subset is cheaper than evaluating the extremely large training set with the original model, making it much more efficient. When the task is hard and model changes rapidly during training, PSP can timely updates the informative subset, maximally retaining the model utility.

Computational and space complexity of different selection strategies. For GraNd, Craig and Grad-Match, only gradients of the classification layer are considered to avoid overly large complexity.

Our work utilizes the classification margin to identify informative samples, which is efficient and can synergistically adapt to various training schemes. Comparison of these strategies is given in Table1, where d is the dimension of data feature. MS is slightly slower than selection via loss (CE-loss and EL2N), but much more efficient than Craig and GradMatch. Here we consider only the complexity of the selection strategy itself, time spent for feature extraction is not included. Classification margin has been previously explored in the active learning literature(Ducoffe & Precioso, 2018;Emam et al., 2021), here we utilize it for training acceleration.

Comparison of different data selection strategies. Except for DynaMS, all the other methods conduct training in the conventional static scheme.

Comparison for ResNet-18 on Cifar-10.

Comparison for ResNet-18 and ResNet-50 on ImageNet.

Dynamic margin selection (DynaMS) with parameter sharing proxy (PSP) Keep ratio of each selection γ k where k = 1, ..., K, selection interval Q Slimming factor of the proxy r, thus the proxy weights Wproxy is determined.

annex

Published as a conference paper at ICLR 2023τ is an auxiliary field introduced by Hubbard-Stratonovich transformation. ⟨•⟩ z denotes expectation on p(z). By solving these equations the generalization error can be easily read off as To better understand the generalization under classification margin selection E(α, γ, θ), we provide more results to individually inspect the effect of (on average) select ratio γ avg , initial data abundance α and the closeness of the estimator to the oracle mode θ. As shown in Figure 6 (a), we changed γ avg from 60% to 50%, thus constructing a smaller selection budget case. In Figure 6 (b), we use α = 2.1 instead of α = 3.2 to construct a less abundant data case, where the data before selection is insufficient. In Figure 6 (c), we start selecting samples using a better estimator θ = 30 • instead of θ = 40 • . All the other hyper-parameters aside from the inspected one are kept consistent to those used Figure 2 (b), that is, γ avg = 0.6, α = 3.2 and θ = 40 • . We see that with various γ avg and θ, DynaMS outperforms its static counterpart. The abundance of initial data, however, significantly affects. When data is insufficient, data selection, both static as well as dynamic cause obvious performance degradation. Figure 7 shows a even more serious α = 1.7, the generalization landscape is significantly changed and data selection is not recommended in this case.

E COMPARISON WITH STANDARD DEVIATION

We test each method in Table 4 and Table 5 5 times. The averaged accuracy and standard deviation are reported below in Table 6 and Table 7 . 

F IMPLEMENTATION DETAILS AND HYPER-PARAMETERS

Subset size schedule Dynamic Selection admits more freedom in subset size schedule. In the experiments we consider the linear schedule and the power schedule. For linear schedule, the keeping ratio is determined by Aside from the linear scheduler, we also explore a power schedule where γ k = m • k -r + b for k = 1, 2, . . . , K. Power schedule reserves more samples in late training, preventing performance degradation caused by over data pruning. Determining these hyper-parameters m, r, b is a bit tricky, we just require γ 1 = 1.0 to warm start and γ avg = 1 K K k=1 γ k = γ s for fair comparison. γ K should not be overly small, we empirically find γ K ≈ γ -0.1 yield good results. For different budget γ s = {0.6, 0.7, 0.8, 0.9} the hyper-parameters are given in Appendix F, Table 8 . Post process is carried out to make sure the resulting subset size sequence satisfy the above requirements. ( Killamsetty et al., 2021) utilize a constant schedule, where in each selection the subset size is kept constant as γ s • |T |. This schedule however, do not admit selection without replacement. Linear and power schedule are all monotonically decreasing, thus are natural choices considering this. Figure 8 plots the three schedules on γ s = 0.6 budget. In this paper we just provide a primary exploration on the subset size schedule, in depth study on the relationship between the subset size and the model performance as well as an automatic way determining the optimal subset size schedule is left for future work.Hyper-parameters Finally, the detailed hyper-parameters for DynaMS on both CIFAR-10 and ImageNet datasets are shown in Table 8 . Note that for DynaMS+PSP, the Max Epochs is set to be 90 on ImageNet. 

