SOFT SAMPLING FOR EFFICIENT TRAINING OF DEEP NEURAL NETWORKS ON MASSIVE DATA Anonymous

Abstract

We investigate soft sampling which is a simple yet effective approach for efficient training of large-scale deep neural network models when dealing with massive data. Soft sampling selects a subset uniformly at random with replacement from the full data set in each epoch. First, we derive a theoretical convergence guarantee for soft sampling on non-convex objective functions and give the convergence rate. Next, we analyze the data coverage and occupancy properties of soft sampling from the perspective of the coupon collector's problem. And finally, we evaluate soft sampling on various machine learning tasks using various network architectures and demonstrate its effectiveness. Compared to existing coreset-based data selection methods, soft sampling offers a better accuracy-efficiency trade-off. Especially on real-world industrial scale data sets, soft sampling can achieve significant speedup and competitive performance with almost no additional computing cost.

1. INTRODUCTION

Deep learning (LeCun et al., 2015) has made great progress in a broad variety of domains in recent years (Silver et al., 2016; Esteva et al., 2017; Saon et al., 2017; Xiong et al., 2017) . The high performance of deep neural network models having huge numbers of parameters relies on large amounts of training data (Brown et al., 2020; Parthasarathi et al., 2019; Chowdhery et al., 2022) . This comes with a cost of long training time and demands substantial computing and storage resources. High computational complexity sometimes becomes a barrier to the hyper-parameter tuning and model validation steps that are crucial for real-world deployments. In this situation, data selection is often used to select a representative subset of the entire training data to speed up the training while maintaining decent model performance. Subset selection has been shown to be an effective approach to alleviating the computational cost in large scale machine learning (Mirzasoleiman et al., 2020a; Borsos et al., 2021; Kowal, 2022; Guo et al., 2022) . It is also used in distributed training to reduce the communication cost (Reddi et al., 2015) and active learning to create compact sets for human labeling (Hakkani-Tur et al., 2002; Tur et al., 2003; Kaushal et al., 2019; Coleman et al., 2020) . Usually a subset is selected based on some criterion such that the performance of a model trained on the subset is comparable to one trained on the whole dataset, but with much less data and computing efforts. A variety of criteria have been introduced in various applications in the literature. For instance, diversity reward is used in (Lin & Bilmes, 2011) for document summarization and in (Kaushal et al., 2019) for computer vision (CV). Text similarity and saturated coverage are used in (Wei et al., 2013) to select acoustic data for automatic speech recognition (ASR). The maximum entropy principle is applied in (Wu et al., 2007; Yu et al., 2009) to select an informative data subset. Confidence scores are used in (Hakkani-Tur et al., 2002; Tur et al., 2003) based on a well-trained model to select a subset with highest uncertainty for labeling for active learning. In (Sivasubramanian et al., 2021) error bounds on the validation set are taken into account when selecting a data subset for 2 regularized regression problems for better model generalization. In (Mirzasoleiman et al., 2020a; Killamsetty et al., 2021a) subsets are selected to closely approximate the full gradient for training machine learning models using incremental gradient methods. The construction of an optimal subset is combinatorial and NP-hard in principle. In (Wei et al., 2015; 2014b; a; Kirchhoff & Bilmes, 2014; Killamsetty et al., 2021b) subsets are selected leveraging submodular functions with diminishing returns where the subset selection can be formulated as constrained submodular cover optimization (Fujishige, 1991) . Subset selection is also viewed as summarizing the full data set using a coreset (e.g. weighted subset samples) in (Mirzasoleiman et al., 2020a; b; Reddi et al., 2015; Coleman et al., 2020; Killamsetty et al., 2021a) . Most of the subset construction algorithms are greedy algorithms which are computationally efficient, and some of them can provide provable approximation guarantees compared to the solution on the full data set. For many of the existing data selection approaches, the selection is a hard selection where a subset of the full data is selected and models are trained on this constant subset of data while the samples outside the subset are totally discarded (Wu et al., 2007; Lin & Bilmes, 2011; Wei et al., 2014b) . Furthermore, to reduce the cost of data selection, an additional light-weight proxy model is introduced for selecting subsets in a family of so-called selection via proxy (SVP) methods (Coleman et al., 2020; Sachdeva et al., 2021) . However, even with greedy algorithms which are relatively efficient in constructing subsets or selection via proxy, many of the existing data selection techniques still suffer from scaling issues when dealing with large amounts of data and models of large capacity due to demanding processing time and memory requirements (Wei et al., 2014a; Mirzasoleiman et al., 2020a) . In this paper we propose soft sampling, a simple but effective approach to training models with reduced data size for efficiency. Soft sampling selects uniformly at random with replacement a subset from the full data set for each training epoch, so every data sample can be sampled with non-zero probability. The selection of data is agnostic to loss functions and models. Compared to deterministic loss/cost function based data selection methods, soft sampling is significantly faster without requiring additional memory, which makes it very suitable for training deep neural networks using incremental gradient techniques such as stochastic gradient descent (SGD) and its variants. Randomized subset selection has been presented in the literature (Pooladzandi et al., 2022; Killamsetty et al., 2021a; Guo et al., 2022) , where it is mostly treated as an underperforming baseline. It is either compared with coreset selection methods on small datasets with a very low data selection percentage (e.g 1%) or it is not investigated at its full strength when the comparative study is made with other subset selection techniques. In this work we assess random subset selection as a low-cost data selection approach that is very computationally efficient when training deep models with a large number of parameters on large scale datasets. We study this random subset selection approach both theoretically and practically. We show that soft sampling is guaranteed to converge and give its convergence rate. We also analyze its statistical properties on sample coverage and occupancy. Experiments are carried out to extensively evaluate its effectiveness on a variety of datasets from image classification and speech recognition. We show that soft sampling can obtain competitive or superior performance compared with some existing high-performance data selection approaches while being much more efficient in speed and memory usage.

2. RELATED WORK

Subset selection is cast as submodular optimization in (Lin & Bilmes, 2009; 2011; Wei et al., 2013; 2014b; a; 2015; Kirchhoff & Bilmes, 2014; Mirzasoleiman et al., 2015) where submodular functions are defined on discrete sets and optimized under constraints (e.g. cardinality of the selected subset). Submodular optimization based subset selection is mathematically rigorous, as under mild conditions a simple greedy implementation is theoretically guaranteed to be only a constant fraction away from the optimal solution. However, despite the availability of a rich class of functions, suitable submodular functions still need to be carefully chosen and tailored to the problem under investigation given the computational complexity and scale of the data. Furthermore, once the subset is selected, it is usually fixed throughout the training regardless of the iteratively updated model. Coreset algorithms have been explored in (Mirzasoleiman et al., 2020a; Killamsetty et al., 2021b; a; Pooladzandi et al., 2022) where weighted subsets are selected to summarize some desired properties of the full data for efficient training. GLISTER, proposed in (Killamsetty et al., 2021b) , selects a coreset that maximizes the log-likelihood on a validation set. CRAIG in (Mirzasoleiman et al., 2020a) and GRAD-MATCH in (Killamsetty et al., 2021a ) each find a coreset that closely approximates the full gradient. ADACORE in (Pooladzandi et al., 2022) extracts a coreset that dynamically approximates the curvature of the loss function based on the Hessian matrix. CRAIG, GRAD-MATCH and ADACORE are all adaptive methods which are shown to achieve superior performance over a fixed subset. ADACORE relies on second-order statistics which are more computationally demanding, while CRAIG and GRAD-MATCH search for first-order coresets which are computationally more efficient. In this work, we compare the performance on the accuracy-efficiency trade-off between soft

