SAMPLE IMPORTANCE IN SGD TRAINING

Abstract

Deep learning requires increasingly bigger models and datasets to improve generalization on unseen data, where some training data samples may be more informative than others. We investigate this assumption in supervised image classification by biasing SGD (Stochastic Gradient Descent) to sample important samples more often during training of a classifier. In contrast to state-of-the-art, our approach does not require additional training iterations to estimate the sample importance, because it computes estimates once during training using the training prediction probabilities. In experiments, we see that our learning technique converges on par or faster in terms of training iterations and can achieve higher test accuracy compared to state-of-the-art, especially when datasets are not suitably balanced. Results suggest that sample importance has intrinsic balancing properties and that an importance weighted class distribution can converge faster than the usual balanced class distribution. Finally, in contrast to recent work, we find that sample importance is model dependent. Therefore, calculating sample importance during training, rather than in a pre-processing step, may be the only viable way to go.

1. INTRODUCTION

For many gradient-descent-based models increasing the model and training data sizes boosts the model performance and their ability to generalize. However, the increase in model and training data sizes comes with ever higher computational costs: longer training times and greater energy consumption are required to train a model on a given training data. One way to reduce these costs is to optimize the model training procedure. The most common training approach relies on random shuffling without replacement of the data samples during training for a given amount of epochs. As a consequence, all the samples are seen by the model the same number of times and are therefore implicitly treated as equally important. Recent works on hard example mining (Felzenszwalb et al., 2009; Loshchilov & Hutter, 2015; Simpson, 2015; Alain et al., 2015; Shrivastava et al., 2016; Chang et al., 2017; Katharopoulos & Fleuret, 2018; Arriaga & Valdenegro-Toro, 2020; Pruthi et al., 2020; Lin et al., 2017) and coresets selection (Mirzasoleiman et al., 2020; Killamsetty et al., 2020; 2021; Yoon et al., 2021; Balles et al., 2022) have shown that the training samples are not equally important to learn a given task (Katharopoulos & Fleuret, 2018) and that it is possible to speed up training by respectively focusing on hard samples or subsets of the training dataset that best approximate the full gradient. These methods estimate sample importance during training in an online fashion and leverage this information to speedup the learning process. However, they often require an additional computational overhead to compute the sample importance, which makes them less effective in practice. This is particularly true in the case of coresets selection methods, which are based on conservative estimates (Paul et al., 2021) . Another research line has shown that the training samples in a dataset can be ranked according to different importance scores (Feldman & Zhang, 2020; Feldman, 2020; Jiang et al., 2020; Toneva et al., 2018; Paul et al., 2021) and that less important samples can be pruned prior training the model with little to no loss in test accuracy. Since computing sample scores exactly is computationally unfeasible, these methods rely on approximations that usually require to fully train at least one model, compute and rank the sample scores, and then select the smallest subset of samples best approximating the test accuracy achieved by a model trained on the full training dataset.

