EFFICIENT DATA SUBSET SELECTION TO GENERALIZE TRAINING ACROSS MODELS: TRANSDUCTIVE AND IN-DUCTIVE NETWORKS

Abstract

Subset selection, in recent times, has emerged as a successful approach toward efficient training of models by significantly reducing the amount of data and computational resources required. However, existing methods employ discrete combinatorial and model-specific approaches which lack generalizability-for each new model, the algorithm has to be executed from the beginning. Therefore, for data subset selection for an unseen architecture, one cannot use the subset chosen for a different model. In this work, we propose SUBSELNET, a nonadaptive subset selection framework, which tackles these problems with two main components. First, we introduce an attention-based neural gadget that leverages the graph structure of architectures and acts as a surrogate to trained deep neural networks for quick model prediction. Then, we use these predictions to build subset samplers. This leads us to develop two variants of SUBSELNET. The first variant is transductive (called as Transductive-SUBSELNET) which computes the subset separately for each model by solving a small optimization problem. Such an optimization is still super fast, thanks to the replacement of explicit model training by the model approximator. The second variant is inductive (called as Inductive-SUBSELNET) which computes the subset using a trained subset selector, without any optimization. Most state-of-the-art data subset selection approaches are adaptive, in that the subset selection adapts as the training progresses, and as a result, they require access to the entire data at training time. Our approach, in contrast, is non-adaptive and does the subset selection only once in the beginning, thereby achieving resource and memory efficiency along with compute-efficiency at training time. Our experiments show that both the variants of our model outperform several methods on the quality of the subset chosen and further demonstrate that our method can be used for choosing the best architecture from a set of architectures.

1. INTRODUCTION

In the last decade, deep neural networks have enhanced the performance of the state-of-the-art ML models dramatically. However, these neural networks often demand massive data to train, which renders them heavily contingent on availability of high performance computing machinery, e.g., GPUs, CPUs, RAMs, storage disks, etc. However, such resources entail heavy energy consumption, excessive CO 2 emission and maintenance cost. Driven by this challenge, a recent body of work focus on suitably selecting a subset of instances, so that the model can be quickly trained using lightweight computing infrastructure (Boutsidis et al., 2013; Kirchhoff & Bilmes, 2014; Wei et al., 2014a; Bairi et al., 2015; Liu et al., 2015; Wei et al., 2015; Lucic et al., 2017; Mirzasoleiman et al., 2020b; Kaushal et al., 2019; Killamsetty et al., 2021a; b; c) . However, these existing data subset selection algorithm are discrete combinatorial algorithms, which share three key limitations. (1) Scaling up the combinatorial algorithms is often difficult, which imposes significant barrier against achieving efficiency gains as compared to training with entire data. (2) Many of these approaches are adaptive in nature, i.e, the subset changes as the model training progresses. As a result, they require access to the entire training dataset and while they provide compute-efficiency, they do not address memory and resource efficiency challenges of deep model training. (3) The subset selected by the algorithm is tailored to train only a given specific model and it cannot be used to train another model. Therefore, the algorithm cannot be shared across different models. We discuss the related work in detail in Appendix A.

1.1. PRESENT WORK

Responding to the above limitations, we develop SUBSELNET, a trainable subset selection framework, which-once trained on a set of model architectures and a dataset-can quickly select a small training subset such that it can be used to train a new (test) model, without a significant drop in accuracy. Our setup is non-adaptive in that it learns to select the subset before the training starts for a new architecture, instead of adaptively selecting the subset during the training process. We initiate our investigation by writing down an instance of combinatorial optimization problem that outputs a subset specifically for one given model architecture. Then, we gradually develop SUBSELNET, by building upon this setup. SUBSELNET comprises of the following novel components.

Neural model approximator.

The key blocker in scaling up a model-specific combinatorial subset selector across different architectures is the involvement of the model parameters as optimization variables along with the candidate data subset. To circumvent this blocker, we design a neural model approximator which aims to approximate the predictions of a trained model for any given architecture. Thus, such a model approximator can provide per instance accuracy provided by a new (test) model without explicitly training it. This model approximator works in two steps. First, it translates a given model architecture into a set of embedding vectors using graph neural networks (GNNs). Similar to the proposal of Yan et al. ( 2020) it views a given model architecture as a directed graph between different operations and, then outputs the node embeddings by learning a variational graph autoencoder (VAE) in an unsupervised manner. Due to such nature of the training, these node embeddings represent only the underlying architecture-they do not capture any signal from the predictions of the trained model. Hence, in the next step, we build a neural model encoder which uses these node embeddings and the given instance to approximate the prediction made by the trained model. The model encoder is a transformer based neural network which combines the node embedding using self-attention induced weights to obtain an intermediate graph representation. This intermediate representation finally combines with the instance vector x to provide the prediction of the trained architecture. Subset sampler. Having computed the prediction of a trained architecture, we aim to choose a subset of instances that would minimize the predicted loss and at the same time, offers a good representation of the data. Our subset sampler takes the approximate model output and an instance as input and computes a selection score. Then it builds a logit vector using all these selection scores, feeds it into a multinomial distribution and samples a subset from it. This naturally leads to two variants of the model.

Transductive-SUBSELNET:

The first variant is transductive in nature. Here, for each new architecture, we utilize the predictions from the model approximator to build a continuous surrogate of the original combinatorial problem and solve it to obtain the underlying selection scores. Thus, we still need to solve a fresh optimization problem for every new architecture. However, the direct predictions from the model approximator allow us to skip explicit model training. This makes this strategy extremely fast both in terms of memory and time. We call this transductive subset selector as Transductive-SUBSELNET. Inductive-SUBSELNET: In contrast to Transductive-SUBSELNET, the second variant does not require to solve any optimization problem. Consequently, it is extremely fast. Instead, it models the scores using a neural network which is trained across different architectures to minimize the entropy regularized sum of the prediction loss. We call this variant as Inductive-SUBSELNET. We compare our method against six state-of-the-art methods on three real world datasets, which show that Transductive-SUBSELNET (Inductive-SUBSELNET) provides the best (second best) trade off between accuracy and inference time as well as accuracy and memory usage, among all the methods. This is because (1) our subset selection method does not require any training at any stage of subset selection for a new model; and, (2) our approach is non-adaptive and does the subset selection before the training starts. In contrast, most state-of-the-art data subset selection approaches are adaptive, in that the subset selection adapts as the training progresses, and as a result, they require access to the entire data at training time. Finally, we design a hybrid version of the model, where given a budget, we first select a larger set of instances using Inductive-SUBSELNET, and then extract the required number of instances using Transductive-SUBSELNET. We observe that such a hybrid approach allow us to make a smooth transition between the trade off curves from Inductive-SUBSELNET to Transductive-SUBSELNET.

