PROGRESSIVE DATA DROPOUT: AN ADAPTIVE TRAINING STRATEGY FOR LARGE-SCALE SUPERVISED LEARNING

Abstract

Common training strategies for deep neural networks are computationally expensive, continuing to redundantly train and evaluate on classes already wellunderstood by the model. A common strategy to diminish this cost is to reduce data used in training, however this often comes at the expense of the model's accuracy or an additional computational cost in training. We propose progressive data dropout (PDD), an adaptive training strategy which performs class-level data dropout from the training set as the network develops an understanding for each class. Our experiments on large-scale image classification demonstrate PDD reduces the total number of datapoints needed to train the network by a factor of 10, reducing the overall training time without significantly impacting accuracy or modifying the model architecture. We additionally demonstrate improvements via experiments and ablations on computer vision benchmarks, including MNIST, Fashion-MNIST, SVHN, CIFAR, and ImageNet datasets.

1. INTRODUCTION

Deep neural networks have made a significant impact on a broad range of applications over the last decade. However, these networks are notoriously data-intensive, often requiring significant computational power and large datasets in order to properly train for optimal performance. This can become problematic for many real-world applications as the computational expense of training these networks will often prevent them from being adopted. Many optimization techniques have arisen to address this problem in training -from reducing neurons to subsampling data. In this work, we focus on reducing the computational time and cost needed to fully train any deep network, without modifying the model and while utilizing the entire training dataset. In order to properly set the discussion, we need to define some terms that are used throughout this paper. First, we define a datapoint as a single data sample that is sent through the network during the training process. For example, if you were to train a network using a dataset of 10 samples for 5 epochs, you would have used 50 datapoints to train that network. So our datapoint calculations are a collection of the number of samples that were sent to the network during training rather than the number of unique data samples it was provided. This is because networks often need to see examples of a class multiple times before understanding them, Secondly, since our proposed method modifies the training process, the term epoches no longer applicable as we will not iterate over the entire dataset after dropping data. Instead, we will use the term training rounds to indicate how many times we iterated over the remaining training set when training the network. Now, we consider one simple research question: Can data be dropped during training once it is wellunderstood by a deep leaning model? We propose Progressive Data Dropout (PDD), an adaptive training process which leverages the network's understanding of the data to determine when data should be dropped from the training process. In comparison with existing data dropout techniques which focus on identifying "important" samples, we instead evaluate the simple case of full and partial class-removal. The main contributions of our simple adaptive training strategy, PDD, include: 1

