HOW IMPORTANT IS IMPORTANCE SAMPLING FOR DEEP BUDGETED TRAINING?

Abstract

Long iterative training processes for Deep Neural Networks (DNNs) are commonly required to achieve state-of-the-art performance in many computer vision tasks. Core-set selection and importance sampling approaches might play a key role in budgeted training regimes, i.e. when limiting the number of training iterations. The former demonstrate that retaining informative samples is important to avoid large drops in accuracy, and the later aim at dynamically estimating the sample importance to speed-up convergence. This work explores this paradigm and how a budget constraint interacts with importance sampling approaches and data augmentation techniques. We show that under budget restrictions, importance sampling approaches do not provide a consistent improvement over uniform sampling. We suggest that, given a specific budget, the best course of action is to disregard the importance and introduce adequate data augmentation. For example, training in CIFAR-10/100 with 30% of the full training budget, a uniform sampling strategy with certain data augmentation surpasses the performance of 100% budget models trained with standard data augmentation. We conclude from our work that DNNs under budget restrictions benefit greatly from variety in the samples and that finding the right samples to train is not the most effective strategy when balancing high performance with low computational requirements. The code will be released after the review process.

1. INTRODUCTION

The availability of vast amounts of labeled data is crucial in training deep neural networks (DNNs) (Mahajan et al., 2018; Xie et al., 2020) . Despite prompting considerable advances in many computer vision tasks (Yao et al., 2018; Sun et al., 2019a) , this dependence poses two challenges: the generation of the datasets and the large computation requirements that arise as a result. Research addressing the former has experienced great progress in recent years via novel techniques that reduce the strong supervision required to achieve top results (Tan & Le, 2019; Touvron et al., 2019) by, e.g. improving semi-supervised learning (Berthelot et al., 2019; Arazo et al., 2020) , fewshot learning (Zhang et al., 2018b; Sun et al., 2019b) , self-supervised learning (He et al., 2020; Misra & Maaten, 2020) , or training with noisy web labels (Arazo et al., 2019; Li et al., 2020a) . The latter challenge has also experienced many advances from the side of network efficiency via DNN compression (Dai et al., 2018; Lin et al., 2019) or, neural architecture search (Tan & Le, 2019; Cai et al., 2019) ; and optimization efficiency by better exploiting the embedding space (Khosla et al., 2020; Kim et al., 2020) . All these approaches are designed under a common constraint: the large dataset size needed to achieve top results (Xie et al., 2020) , which conditions the success of the training process on computational resources. Conversely, a smart reduction of the amount of samples used during training can alleviate this constraint (Katharopoulos & Fleuret, 2018; Mirzasoleiman et al., 2020) . The selection of samples plays an important role in the optimization of DNN parameters during training, where Stochastic Gradient Descent (SGD) (Dean et al., 2012; Bottou et al., 2018) is often used. SGD guides the parameter updates using the estimation of model error gradients over sets of samples (mini-batches) that are uniformly randomly selected in an iterative fashion. This strategy assumes equal importance across samples, whereas other works suggest that alternative strategies for revisiting samples are more effective in achieving better performance (Chang et al., 2017; Kawaguchi & Lu, 2020) and faster convergence (Katharopoulos & Fleuret, 2018; Jiang et al., 2019) . Similarly, the 1

