PROGRESSIVE DATA DROPOUT: AN ADAPTIVE TRAINING STRATEGY FOR LARGE-SCALE SUPERVISED LEARNING

Abstract

Common training strategies for deep neural networks are computationally expensive, continuing to redundantly train and evaluate on classes already wellunderstood by the model. A common strategy to diminish this cost is to reduce data used in training, however this often comes at the expense of the model's accuracy or an additional computational cost in training. We propose progressive data dropout (PDD), an adaptive training strategy which performs class-level data dropout from the training set as the network develops an understanding for each class. Our experiments on large-scale image classification demonstrate PDD reduces the total number of datapoints needed to train the network by a factor of 10, reducing the overall training time without significantly impacting accuracy or modifying the model architecture. We additionally demonstrate improvements via experiments and ablations on computer vision benchmarks, including MNIST, Fashion-MNIST, SVHN, CIFAR, and ImageNet datasets.

1. INTRODUCTION

Deep neural networks have made a significant impact on a broad range of applications over the last decade. However, these networks are notoriously data-intensive, often requiring significant computational power and large datasets in order to properly train for optimal performance. This can become problematic for many real-world applications as the computational expense of training these networks will often prevent them from being adopted. Many optimization techniques have arisen to address this problem in training -from reducing neurons to subsampling data. In this work, we focus on reducing the computational time and cost needed to fully train any deep network, without modifying the model and while utilizing the entire training dataset. In order to properly set the discussion, we need to define some terms that are used throughout this paper. First, we define a datapoint as a single data sample that is sent through the network during the training process. For example, if you were to train a network using a dataset of 10 samples for 5 epochs, you would have used 50 datapoints to train that network. So our datapoint calculations are a collection of the number of samples that were sent to the network during training rather than the number of unique data samples it was provided. This is because networks often need to see examples of a class multiple times before understanding them, Secondly, since our proposed method modifies the training process, the term epoches no longer applicable as we will not iterate over the entire dataset after dropping data. Instead, we will use the term training rounds to indicate how many times we iterated over the remaining training set when training the network. Now, we consider one simple research question: Can data be dropped during training once it is wellunderstood by a deep leaning model? We propose Progressive Data Dropout (PDD), an adaptive training process which leverages the network's understanding of the data to determine when data should be dropped from the training process. In comparison with existing data dropout techniques which focus on identifying "important" samples, we instead evaluate the simple case of full and partial class-removal. The main contributions of our simple adaptive training strategy, PDD, include: • Reduces time and computational resources required for training. • Model-agnostic implementation which works for any supervised single-label classification task. • Data-agnostic implementation which does not preprocess or examine the data for sample quality, balance, etc. • Provides an inherent stopping criterion for training models on sufficiently large datasets. Further sections motivate the use of PDD for large-scale image classification tasks and describe the design of this simple strategy. We propose PDD to be complimentary of existing regularization techniques, as well as recent learning/training strategies such as continual learning, curriculum learning, and others. et al., 2017) showed that the a fixed neuron dropout probability was sub-optimal and instead implemented a time scheduler for updating the dropout probability. An energy-based dropout proposed by EDropout (Salehinejad & Valaee, 2021) used an energy based loss to find the best pruning to apply to the original neural network. In application to different network architectures, Dropout-GAN dropped connections between a generator and multiple discriminators in a GAN in order to ensure diversity of generated samples, avoiding mode collapse (Mordido et al., 2018) .

2. RELATED WORK

While dropout is effective for regularization, these methods often require significant modification of the network itself in order to be adopted. In comparison, our proposed approached is entirely model-agnostic, modifying only the number of samples used in training instead of the network or



There are a breadth of optimization techniques for the training deep neural networks, motivated by the traditionally large data requirements of deep models. Given that we are proposing a generalized strategy for optimizing training, this section discusses several types of solutions -learning techniques, dropout for network regularization, and dropout of data during training. Continual learning aims to prevent the catastrophic forgetting of the original learned taskVeniat et al. (2021). More recently, Co2L (Cha et al., 2021) contrastive continual learning proposed a contrastive learning objective which learned and preserved representation through distillation. These curriculum and continual learning strategies are complimentary to PDD in that our proposed method can be leveraged toward their respective objectives.2.2 DROPOUT FOR NETWORK OPTIMIZATION & REGULARIZATIONWhile our proposed approach is focused on dropping data, the term dropout more often applies to the dropping of neurons during the training process for regularization of a neural network(Srivastava  et al., 2014). Network dropout regularization techniques K C et al. (2021) and pruning approaches(Tanaka et al., 2020)  reduce the size of network during training in response to input stimuli.

