SOFT SAMPLING FOR EFFICIENT TRAINING OF DEEP NEURAL NETWORKS ON MASSIVE DATA Anonymous

Abstract

We investigate soft sampling which is a simple yet effective approach for efficient training of large-scale deep neural network models when dealing with massive data. Soft sampling selects a subset uniformly at random with replacement from the full data set in each epoch. First, we derive a theoretical convergence guarantee for soft sampling on non-convex objective functions and give the convergence rate. Next, we analyze the data coverage and occupancy properties of soft sampling from the perspective of the coupon collector's problem. And finally, we evaluate soft sampling on various machine learning tasks using various network architectures and demonstrate its effectiveness. Compared to existing coreset-based data selection methods, soft sampling offers a better accuracy-efficiency trade-off. Especially on real-world industrial scale data sets, soft sampling can achieve significant speedup and competitive performance with almost no additional computing cost.

1. INTRODUCTION

Deep learning (LeCun et al., 2015) has made great progress in a broad variety of domains in recent years (Silver et al., 2016; Esteva et al., 2017; Saon et al., 2017; Xiong et al., 2017) . The high performance of deep neural network models having huge numbers of parameters relies on large amounts of training data (Brown et al., 2020; Parthasarathi et al., 2019; Chowdhery et al., 2022) . This comes with a cost of long training time and demands substantial computing and storage resources. High computational complexity sometimes becomes a barrier to the hyper-parameter tuning and model validation steps that are crucial for real-world deployments. In this situation, data selection is often used to select a representative subset of the entire training data to speed up the training while maintaining decent model performance. Subset selection has been shown to be an effective approach to alleviating the computational cost in large scale machine learning (Mirzasoleiman et al., 2020a; Borsos et al., 2021; Kowal, 2022; Guo et al., 2022) . It is also used in distributed training to reduce the communication cost (Reddi et al., 2015) and active learning to create compact sets for human labeling (Hakkani-Tur et al., 2002; Tur et al., 2003; Kaushal et al., 2019; Coleman et al., 2020) . Usually a subset is selected based on some criterion such that the performance of a model trained on the subset is comparable to one trained on the whole dataset, but with much less data and computing efforts. A variety of criteria have been introduced in various applications in the literature. For instance, diversity reward is used in (Lin & Bilmes, 2011) for document summarization and in (Kaushal et al., 2019) for computer vision (CV). Text similarity and saturated coverage are used in (Wei et al., 2013) to select acoustic data for automatic speech recognition (ASR). The maximum entropy principle is applied in (Wu et al., 2007; Yu et al., 2009) to select an informative data subset. Confidence scores are used in (Hakkani-Tur et al., 2002; Tur et al., 2003) based on a well-trained model to select a subset with highest uncertainty for labeling for active learning. In (Sivasubramanian et al., 2021) error bounds on the validation set are taken into account when selecting a data subset for 2 regularized regression problems for better model generalization. In (Mirzasoleiman et al., 2020a; Killamsetty et al., 2021a) subsets are selected to closely approximate the full gradient for training machine learning models using incremental gradient methods. The construction of an optimal subset is combinatorial and NP-hard in principle. In (Wei et al., 2015; 2014b; a; Kirchhoff & Bilmes, 2014; Killamsetty et al., 2021b) subsets are selected leveraging submodular functions with diminishing returns where the subset selection can be formulated as 1

