AUTOSAMPLING: SEARCH FOR EFFECTIVE DATA SAM-PLING SCHEDULES

Abstract

Data sampling acts as a pivotal role in training deep learning models. However, an effective sampling schedule is difficult to learn due to its inherent high-dimension as a hyper-parameter. In this paper, we propose the AutoSampling method to automatically learn sampling schedules for model training, which consists of the multi-exploitation step aiming for optimal local sampling schedules and the exploration step for the ideal sampling distribution. More specifically, we achieve sampling schedule search with shortened exploitation cycle to provide enough supervision. In addition, we periodically estimate the sampling distribution from the learned sampling schedules and perturb it to search in the distribution space. The combination of two searches allows us to learn a robust sampling schedule. We apply our AutoSampling method to a variety of image classification tasks illustrating the effectiveness of the proposed method.

1. INTRODUCTION

Data sampling policies can greatly influence the performance of model training in computer vision tasks, and therefore finding robust sampling policies can be important. Handcrafted rules, e.g. data resampling, reweighting, and importance sampling, promote better model performance by adjusting the training data frequency and order (Estabrooks et al., 2004; Weiss et al., 2007; Bengio et al., 2009; Johnson & Guestrin, 2018; Katharopoulos & Fleuret, 2018; Shrivastava et al., 2016; Jesson et al., 2017) . Handcrafted rules heavily rely on the assumption over the dataset and cannot adapt well to datasets with their own characteristics. To handle this issue, learning-based methods (Li et al., 2019; Jiang et al., 2017; Fan et al., 2017) were designed to automatically reweight or select training data utilizing meta-learning techniques or a policy network. However existing learning-based sampling methods still rely on human priors as proxies to optimize sampling policies, which may fail in practice. Such priors often include assumptions on policy network design for data selection (Fan et al., 2017) , or dataset conditions like noisiness (Li et al., 2019; Loshchilov & Hutter, 2015) or imbalance (Wang et al., 2019) . These approaches take images features, losses, importance or their representations as inputs and use the policy network or other learning approaches with small amount of parameters for estimating the sampling probability. However, for example, images with similar visual features can be redundant in training, but their losses or features fed into the policy network are more likely to be close, causing the same probability to be sampled for redundant samples if we rely on aforementioned priors. Therefore, we propose to directly optimize the sampling schedule itself so that no prior knowledge is required for the dataset. Specifically, the sampling schedule refers to order by which data are selected for the entire training course. In this way, we only rely on data themselves to determine the optimal sampling schedule without any prior. Directly optimizing a sampling schedule is challenging due to its inherent high dimension. For example, for the ImageNet classification dataset (Deng et al., 2009) with around one million samples, the dimension of parameters would be in the same order. While popular approaches such as deep reinforcement learning (Cubuk et al., 2018; Zhang et al., 2020) , Bayesian optimization (Snoek et al., 2015) , population-based training (Jaderberg et al., 2017) or simple random search (Bergstra & Bengio, 2012) have already been utilized to tune low-dimensional hyper-parameters like augmentation schedules, their applications in directly finding good sampling schedules remain unexploited. For instance, the dimension of a data augmentation policy is generally only in dozens, and it needs thousands of training runs (Cubuk et al., 2018) to sample enough rewards to find an optimal augmentation policy because high-quality rewards require many epochs of training to obtain. As such, optimizing a sampling schedule may require orders of magnitude more rewards than data augmentation to gather and hence training runs, which result in prohibitively slow convergence. To overcome the aforementioned challenge, we propose a data sampling policy search framework, named AutoSampling, to sufficiently learn an optimal sampling schedule in a population-based training fashion (Jaderberg et al., 2017) . Unlike previous methods, which focus on collecting longterm rewards and updating hyper-parameters or agents offline, our AutoSampling method collects rewards online with a shortened collection cycle but without priors. Specifically, the AutoSampling collects rewards within several training iterations, tens or hundred times shorter than that in existing works (Ho et al., 2019; Cubuk et al., 2018) . In this manner, we provide the search process with much more frequent feedback to ensure sufficient optimization of the sampling schedule. Each time when a few training iterations pass, we collect the reward from the previous several iterations, accumulate them and later update the sampling distribution using the rewards. Then we perturb the sampling distribution to search in distribution space, and use it to generate new mini-batches for later iterations, which are recorded into the output sampling schedule. As illustrated in Sec. 4.1, shortened collection cycles with less interference also can better reflect the training value of each data.

Our contributions are as follows:

• To our best knowledge, we are the first to propose to directly learn a robust sampling schedule from the data themselves without any human prior or condition on the dataset. • We propose the AutoSampling method to handle the optimization difficulty due to the high dimension of sampling schedules, and efficiently learn a robust sampling schedule through shortened reward collection cycle and online update of the sampling schedule. Comprehensive experiments on CIFAR-10/100 and ImageNet datasets (Krizhevsky, 2009; Deng et al., 2009) with different networks show that the Autosampling can increase the top-1 accuracy by up to 2.85% on CIFAR-10, 2.19% on CIFAR-100, and 2.83% on ImageNet.

2. BACKGROUND

2.1 RELATED WORK Data sampling is of great significance to deep learning, and has been extensively studied. Approaches with human-designed rules take pre-defined heuristic rules to modify the frequency and order by which training data is presented. In particular, one intuitive method is to resample or reweight data according to their frequencies, difficulties or importance in training (Estabrooks et al., 2004; Weiss et al., 2007; Drummond et al., 2003; Bengio et al., 2009; Lin et al., 2017; Shrivastava et al., 2016; Loshchilov & Hutter, 2015; Wang et al., 2019; Johnson & Guestrin, 2018; Katharopoulos & Fleuret, 2018; Byrd & Lipton, 2018; Jesson et al., 2017) . These methods have been widely used in imbalanced training or hard mining problems. However, they are often restricted to certain tasks and datasets based on which they are proposed, and their ability to generalize to a broader range of tasks with different data distribution may be limited. In another word, these methods often implicitly assume certain conditions on the dataset, such as cleanness or imbalance. In addition, learning-based methods have been proposed for finding suitable sampling schemes automatically. Methods using meta-learning or reinforcement learning are also utilized to automatically select or reweight data during training (Li et al., 2019; Jiang et al., 2017; Ren et al., 2018; Fan et al., 2017) , but they are only tested on small-scale or noisy datasets. Whether or not they can generalize over tasks of other datasets still remain untested. In this work, we directly study the data sampling without any prior, and we also investigate its wide generalization ability across different datasets such as CIFAR-10, CIFAR-100 and ImageNet using many typical networks. As for hyper-parameter tuning, popular approaches such as deep reinforcement learning (Cubuk et al., 2018; Zhang et al., 2020 ), Bayesian optimization (Snoek et al., 2015) or simply random search (Bergstra & Bengio, 2012) have already been utilized to tune low-dimensional hyper-parameters and proven to be effective. Nevertheless, they have not been adopted to find good sampling schedule due to its inherent high dimensiona. Some recent works tackle the challenge of optimizing highdimensional hyper-parameter. MacKay et al. (2019) uses structured best-response functions and Jonathan Lorraine (2019) achieve this goal through the combinations of the implicit function theorem and efficient inverse Hessian approximations. However, they have not been tested on the task of optimizing sampling schedules, which is the major focus of our work in this paper.

