STOCHASTIC SUBSET SELECTION FOR EFFICIENT TRAINING AND INFERENCE OF NEURAL NETWORKS Anonymous

Abstract

Current machine learning algorithms are designed to work with huge volumes of high dimensional data such as images. However, these algorithms are being increasingly deployed to resource constrained systems such as mobile devices and embedded systems. Even in cases where large computing infrastructure is available, the size of each data instance, as well as datasets, can provide a huge bottleneck in data transfer across communication channels. Also, there is a huge incentive both in energy and monetary terms in reducing both the computational and memory requirements of these algorithms. For non-parametric models that require to leverage the stored training data at the inference time, the increased cost in memory and computation could be even more problematic. In this work, we aim to reduce the volume of data these algorithms must process through an endto-end two-stage neural subset selection model, where the first stage selects a set of candidate points using a conditionally independent Bernoulli mask followed by an iterative coreset selection via a conditional Categorical distribution. The subset selection model is trained by meta-learning with a distribution of sets. We validate our method on set reconstruction and classification tasks with feature selection as well as the selection of representative samples from a given dataset, on which our method outperforms relevant baselines. We also show in our experiments that our method enhances scalability of non-parametric models such as Neural Processes.

1. INTRODUCTION

The recent success of deep learning algorithms partly owes to the availability of huge volume of data (Deng et al., 2009; Krizhevsky et al., 2009; Liu et al., 2015) , which enables training of very large deep neural networks. However, the high dimensionality of each data instance and the large size of datasets makes it difficult, especially for resource-limited devices (Chan et al., 2018; Li et al., 2019; Bhatia et al., 2019) , to store and transfer the dataset, or perform on-device learning with the data. This problem becomes more problematic for non-parametric models such as Neural Processes (Hensel, 1973; Kim et al., 2019a) which require the training dataset to be stored for inference. Therefore, it is appealing to reduce the size of the dataset, both at the instance (Dovrat et al., 2019; Li et al., 2018b; b) and the dataset level, such that we selects only a small number of samples from the dataset, each of which contains only few selected input features (e.g. pixels). Then, we could use the selected subset for the reconstruction of the entire set (either each instance or the entire dataset) or for a prediction task, such as classification. The simplest way to obtain such a subset is random sampling, but it is highly sub-optimal in that it treats all elements in the set equally. However, the pixels from each image and examples from each dataset will have varying degree of importance (Katharopoulos & Fleuret, 2018) to a target task, whether it is reconstruction or prediction, and thus random sampling will generally incur large loss of accuracy for the target task. There exist some work on coreset construction (Huggins et al., 2016; Campbell & Broderick, 2018; 2019) which proposed to construct a small subset with the most important samples for Bayesian posterior inference. However, these methods cannot be applied straightforwardly to deep learning with an arbitrary target task. How can we then sample elements from the given set to construct a subset, such that it suffers from minimal accuracy loss on any target task? To this end, we propose to learn a sampler that learns to sample the most important samples for a given task, by training it jointly with the target task and additionally meta-learn a sampler over a distribution of datasets for instance selection in the classification task. (a) Selecting a subset of features (pixels) out of an image. This reduces communication cost in data transfer between devices and allows faster training or inference on resource-constrained devices due to the reduced computational cost. (b) Selecting a subset of instances out of a dataset. This helps resource-constrained system to train faster, or to make non-parametric inference more scalable. Specifically, we learn the sampling rate for individual samples in two stages. First we learn a Bernoulli sampling rate for individual sample to efficiently screen out less important elements. Then, to select the most important elements out of this candidate set considering relative importance, we use a Categorical distribution to model the conditional distribution of sampling each element given a set of selected elements. After learning the sampling probability for each stage, we could perform stochastic selection of a given set, with linear time complexity. Our Stochastic Subset Selection (SSS) is a general framework to sample elements from a set, and it can be applied to both feature sampling and instance sampling. SSS can reduce the memory and computation cost required to process data while retaining performance on downstream tasks. Our model can benefit from a wide range of practical applications. For example, when sending an image to an edge device with low computing power, instead of sending the entire image, we could send a subset of pixels with their coordinates, which will reduce both communication and inference cost. Similarly, edge devices may need to perform inference on a huge amount of data that could be represented as a set (e.g. video, point clouds) in real-time, and our feature selection could be used to speed up the inference. Moreover, our model could also help with on-device learning on personal data (e.g. photos), as it can select out examples to train the model at a reduced cost. Finally, it can help with the scalability of non-parametric models which requires storage of training examples, such as Neural Processes, to scale up to large-scale problems. We validate our SSS model on multiple datasets for 1D function regression and 2D image reconstruction and classification for both feature selection and instance selection. The results show that our method is able to select samples with minimal decrease on the target task accuracy, largely outperforming random or an existing sampling method. Our contribution in this work is threefold: • We propose a novel two-stage stochastic subset selection method that learns to sample a subset from a larger set with linear time complexity, with minimal loss of accuracy at the downstream task. • We propose a framework that trains the subset selection model via meta-learning, such that it can generalize to unseen tasks. • We validate the efficacy and generality of our model on various datasets for feature selection from an instance and instance selection from a dataset, on which it significantly outperforms relevant baselines.

2. RELATED WORK

Set encoding -Permutation invariant networks Recently, extensive research efforts have been made in the area of set representation learning with the goal of obtaining order-invariant (or equivariant) and size-invariant representations. Many propose simple methods to obtain set representations by applying non-linear transformations to each element before a pooling layer (e.g. average pooling or max pooling) (Ravanbakhsh et al., 2016; Qi et al., 2017b; Zaheer et al., 2017; Sannai et al., 2019) . However, these models are known to have limited expressive power and sometimes not capable of capturing high moments of distributions. Yet approaches such as Stochastic Deep Network (De Bie et al., 2018) and Set Transformer (Lee et al., 2018) consider the pairwise (or higher order) interactions among set elements and hence can capture more complex statistics of the distributions . These methods often result in higher performance in classification/regression tasks; however, they have run time complexities of O(n 2 ) or higher.



Figure 1: Concept: Our Stochastic Subset Selection method is generic and can be applied to many type of sets.

