HOW IMPORTANT IS IMPORTANCE SAMPLING FOR DEEP BUDGETED TRAINING?

Abstract

Long iterative training processes for Deep Neural Networks (DNNs) are commonly required to achieve state-of-the-art performance in many computer vision tasks. Core-set selection and importance sampling approaches might play a key role in budgeted training regimes, i.e. when limiting the number of training iterations. The former demonstrate that retaining informative samples is important to avoid large drops in accuracy, and the later aim at dynamically estimating the sample importance to speed-up convergence. This work explores this paradigm and how a budget constraint interacts with importance sampling approaches and data augmentation techniques. We show that under budget restrictions, importance sampling approaches do not provide a consistent improvement over uniform sampling. We suggest that, given a specific budget, the best course of action is to disregard the importance and introduce adequate data augmentation. For example, training in CIFAR-10/100 with 30% of the full training budget, a uniform sampling strategy with certain data augmentation surpasses the performance of 100% budget models trained with standard data augmentation. We conclude from our work that DNNs under budget restrictions benefit greatly from variety in the samples and that finding the right samples to train is not the most effective strategy when balancing high performance with low computational requirements. The code will be released after the review process.

1. INTRODUCTION

The availability of vast amounts of labeled data is crucial in training deep neural networks (DNNs) (Mahajan et al., 2018; Xie et al., 2020) . Despite prompting considerable advances in many computer vision tasks (Yao et al., 2018; Sun et al., 2019a) , this dependence poses two challenges: the generation of the datasets and the large computation requirements that arise as a result. Research addressing the former has experienced great progress in recent years via novel techniques that reduce the strong supervision required to achieve top results (Tan & Le, 2019; Touvron et al., 2019) by, e.g. improving semi-supervised learning (Berthelot et al., 2019; Arazo et al., 2020) , fewshot learning (Zhang et al., 2018b; Sun et al., 2019b) , self-supervised learning (He et al., 2020; Misra & Maaten, 2020) , or training with noisy web labels (Arazo et al., 2019; Li et al., 2020a) . The latter challenge has also experienced many advances from the side of network efficiency via DNN compression (Dai et al., 2018; Lin et al., 2019) or, neural architecture search (Tan & Le, 2019; Cai et al., 2019) ; and optimization efficiency by better exploiting the embedding space (Khosla et al., 2020; Kim et al., 2020) . All these approaches are designed under a common constraint: the large dataset size needed to achieve top results (Xie et al., 2020) , which conditions the success of the training process on computational resources. Conversely, a smart reduction of the amount of samples used during training can alleviate this constraint (Katharopoulos & Fleuret, 2018; Mirzasoleiman et al., 2020) . The selection of samples plays an important role in the optimization of DNN parameters during training, where Stochastic Gradient Descent (SGD) (Dean et al., 2012; Bottou et al., 2018) is often used. SGD guides the parameter updates using the estimation of model error gradients over sets of samples (mini-batches) that are uniformly randomly selected in an iterative fashion. This strategy assumes equal importance across samples, whereas other works suggest that alternative strategies for revisiting samples are more effective in achieving better performance (Chang et al., 2017; Kawaguchi & Lu, 2020) and faster convergence (Katharopoulos & Fleuret, 2018; Jiang et al., 2019) . Similarly, the selection of a unique and informative subset of samples (core-set) (Toneva et al., 2018; Coleman et al., 2020) can alleviate the computation requirements during training, while reducing the performance drop with respect to training on all data. However, while removing data samples speeds-up the training, a precise sample selection often requires a pretraining stage that hinders the ability to reduce computation (Mirzasoleiman et al., 2020; Sener & Savarese, 2018) . A possible solution to this limitation might be to dynamically change the important subset during training as done by importance sampling methods (Amiri et al., 2017; Zhang et al., 2019b) , which select the samples based on a sampling probability distribution that evolves with the model and often changes based on the loss or network logits (Loshchilov & Hutter, 2015; Johnson & Guestrin, 2018) . An up-to-date importance estimation is key for current methods to succeed but, in practice, is infeasible to compute (Katharopoulos & Fleuret, 2018) . The real importance of a sample changes after every iteration and estimations become out-dated, yielding considerable drops in performance (Chang et al., 2017; Zhang et al., 2019b) . Importance sampling methods, then, focus on selecting samples and achieve a speed-up during training as a side effect. They do not, however, strictly study possible benefits on DNN training when restricting the number of iterations used for training, i.e. the budget. Budgeted training (Nan & Saligrama, 2017; Kachuee et al., 2019; Li et al., 2020b) imposes an additional constraint on the optimization of a DNN: a maximum number of iterations. Defining this budget provides a concise notion of the limited training resources. Li et al. (2020b) propose to address the budget limitation using specific learning rate schedules that better suit this scenario. Despite the standardized scenario that budgeted training poses to evaluate methods when reducing the computation requirements, there are few works to date in this direction (Li et al., 2020b; Katharopoulos & Fleuret, 2018) . As mentioned, importance sampling methods are closely related, but the avoidance of budget restrictions makes it difficult to understand their utility given the sensitivity to hyperparamenters that they often exhibit (Chang et al., 2017; Loshchilov & Hutter, 2015) . In this paper, we overcome the limitations outlined above by analyzing the effectiveness of importance sampling methods when a budget restriction is imposed (Li et al., 2020b) . Given a budget restriction, we study synergies among important sampling, and data augmentation (Takahashi et al., 2018; Cubuk et al., 2020; Zhang et al., 2018a) . We find the improvements of importance sampling approaches over uniform random sampling are not always consistent across budgets and datasets. We argue and experimentally confirm (see Section 4.4) that when using certain data augmentation (Takahashi et al., 2018; Cubuk et al., 2020; Zhang et al., 2018a) , existing importance sampling techniques do not provide further benefits, making data augmentation the most effective strategy to exploit a given budget.

2. RELATED WORK

Few works exploit a budgeted training paradigm (Li et al., 2020b) . Instead, many approaches aim to speed up the training convergence to a given performance by computing a better sampling strategy or carefully organizing the samples to allow the CNN to learn faster and generalize better. Other works, however, explore how to improve model performance by labeling the most important samples from an unlabeled set of data (Yoo & Kweon, 2019; Ash et al., 2020; Ren et al., 2020) or how to better train DNNs when a limited number of samples per class is available (Chen et al., 2019; Zhou et al., 2020; Albert et al., 2020) . This section reviews relevant works aiming to improve the efficiency of the DNN training. Self-paced learning (SPL) and curriculum learning (CL) aim to optimize the training process and improve model performance by ordering the samples from easy to difficult (Weinshall et al., 2018; Bengio et al., 2009; Hacohen & Weinshall, 2019; Cheng et al., 2019) . For instance, CL manages to speed the convergence of the training at the initial stages due to focusing on samples whose gradients are better estimations of the real gradient (Weinshall et al., 2018) . The main drawback of these methods is that, in most of the cases, the order of the samples (curriculum) has to be defined before training, which is already a costly task that requires manually assessing the sample difficulty, transferring knowledge from a fully trained model, or pre-training the model on the given dataset. Some approaches remedy this drawback with a simple curriculum (Lin et al., 2017) or by learning the curriculum during the training (Jiang et al., 2018) ; these methods, however, do not aim to speed up

