EFFICIENT DATA SUBSET SELECTION TO GENERALIZE TRAINING ACROSS MODELS: TRANSDUCTIVE AND IN-DUCTIVE NETWORKS

Abstract

Subset selection, in recent times, has emerged as a successful approach toward efficient training of models by significantly reducing the amount of data and computational resources required. However, existing methods employ discrete combinatorial and model-specific approaches which lack generalizability-for each new model, the algorithm has to be executed from the beginning. Therefore, for data subset selection for an unseen architecture, one cannot use the subset chosen for a different model. In this work, we propose SUBSELNET, a nonadaptive subset selection framework, which tackles these problems with two main components. First, we introduce an attention-based neural gadget that leverages the graph structure of architectures and acts as a surrogate to trained deep neural networks for quick model prediction. Then, we use these predictions to build subset samplers. This leads us to develop two variants of SUBSELNET. The first variant is transductive (called as Transductive-SUBSELNET) which computes the subset separately for each model by solving a small optimization problem. Such an optimization is still super fast, thanks to the replacement of explicit model training by the model approximator. The second variant is inductive (called as Inductive-SUBSELNET) which computes the subset using a trained subset selector, without any optimization. Most state-of-the-art data subset selection approaches are adaptive, in that the subset selection adapts as the training progresses, and as a result, they require access to the entire data at training time. Our approach, in contrast, is non-adaptive and does the subset selection only once in the beginning, thereby achieving resource and memory efficiency along with compute-efficiency at training time. Our experiments show that both the variants of our model outperform several methods on the quality of the subset chosen and further demonstrate that our method can be used for choosing the best architecture from a set of architectures.

1. INTRODUCTION

In the last decade, deep neural networks have enhanced the performance of the state-of-the-art ML models dramatically. However, these neural networks often demand massive data to train, which renders them heavily contingent on availability of high performance computing machinery, e.g., GPUs, CPUs, RAMs, storage disks, etc. However, such resources entail heavy energy consumption, excessive CO 2 emission and maintenance cost. Driven by this challenge, a recent body of work focus on suitably selecting a subset of instances, so that the model can be quickly trained using lightweight computing infrastructure (Boutsidis et al., 2013; Kirchhoff & Bilmes, 2014; Wei et al., 2014a; Bairi et al., 2015; Liu et al., 2015; Wei et al., 2015; Lucic et al., 2017; Mirzasoleiman et al., 2020b; Kaushal et al., 2019; Killamsetty et al., 2021a; b; c) . However, these existing data subset selection algorithm are discrete combinatorial algorithms, which share three key limitations. (1) Scaling up the combinatorial algorithms is often difficult, which imposes significant barrier against achieving efficiency gains as compared to training with entire data. (2) Many of these approaches are adaptive in nature, i.e, the subset changes as the model training progresses. As a result, they require access to the entire training dataset and while they provide compute-efficiency, they do not address memory and resource efficiency challenges of deep model training. (3) The subset selected by the algorithm is tailored to train only a given specific model and it cannot be used to train another model. Therefore, the algorithm cannot be shared across different models. We discuss the related work in detail in Appendix A.

1.1. PRESENT WORK

Responding to the above limitations, we develop SUBSELNET, a trainable subset selection framework, which-once trained on a set of model architectures and a dataset-can quickly select a small training subset such that it can be used to train a new (test) model, without a significant drop in accuracy. Our setup is non-adaptive in that it learns to select the subset before the training starts for a new architecture, instead of adaptively selecting the subset during the training process. We initiate our investigation by writing down an instance of combinatorial optimization problem that outputs a subset specifically for one given model architecture. Then, we gradually develop SUBSELNET, by building upon this setup. SUBSELNET comprises of the following novel components. Neural model approximator. The key blocker in scaling up a model-specific combinatorial subset selector across different architectures is the involvement of the model parameters as optimization variables along with the candidate data subset. To circumvent this blocker, we design a neural model approximator which aims to approximate the predictions of a trained model for any given architecture. Thus, such a model approximator can provide per instance accuracy provided by a new (test) model without explicitly training it. This model approximator works in two steps. First, it translates a given model architecture into a set of embedding vectors using graph neural networks (GNNs). Similar to the proposal of Yan et al. (2020) it views a given model architecture as a directed graph between different operations and, then outputs the node embeddings by learning a variational graph autoencoder (VAE) in an unsupervised manner. Due to such nature of the training, these node embeddings represent only the underlying architecture-they do not capture any signal from the predictions of the trained model. Hence, in the next step, we build a neural model encoder which uses these node embeddings and the given instance to approximate the prediction made by the trained model. The model encoder is a transformer based neural network which combines the node embedding using self-attention induced weights to obtain an intermediate graph representation. This intermediate representation finally combines with the instance vector x to provide the prediction of the trained architecture. Subset sampler. Having computed the prediction of a trained architecture, we aim to choose a subset of instances that would minimize the predicted loss and at the same time, offers a good representation of the data. Our subset sampler takes the approximate model output and an instance as input and computes a selection score. Then it builds a logit vector using all these selection scores, feeds it into a multinomial distribution and samples a subset from it. This naturally leads to two variants of the model.

Transductive-SUBSELNET:

The first variant is transductive in nature. Here, for each new architecture, we utilize the predictions from the model approximator to build a continuous surrogate of the original combinatorial problem and solve it to obtain the underlying selection scores. Thus, we still need to solve a fresh optimization problem for every new architecture. However, the direct predictions from the model approximator allow us to skip explicit model training. This makes this strategy extremely fast both in terms of memory and time. We call this transductive subset selector as Transductive-SUBSELNET.

Inductive-SUBSELNET:

In contrast to Transductive-SUBSELNET, the second variant does not require to solve any optimization problem. Consequently, it is extremely fast. Instead, it models the scores using a neural network which is trained across different architectures to minimize the entropy regularized sum of the prediction loss. We call this variant as Inductive-SUBSELNET. We compare our method against six state-of-the-art methods on three real world datasets, which show that Transductive-SUBSELNET (Inductive-SUBSELNET) provides the best (second best) trade off between accuracy and inference time as well as accuracy and memory usage, among all the methods. This is because (1) our subset selection method does not require any training at any stage of subset selection for a new model; and, (2) our approach is non-adaptive and does the subset selection before the training starts. In contrast, most state-of-the-art data subset selection approaches are adaptive, in that the subset selection adapts as the training progresses, and as a result, they require access to the entire data at training time. Finally, we design a hybrid version of the model, where given a budget, we first select a larger set of instances using Inductive-SUBSELNET, and then extract the required number of instances using Transductive-SUBSELNET. We observe that such a hybrid approach allow us to make a smooth transition between the trade off curves from Inductive-SUBSELNET to Transductive-SUBSELNET.

2. DEVELOPMENT OF PROPOSED MODEL: SUBSELNET

In this section, we setup the notations and write down the combinatorial subset selection problem for efficient training. This leads us to develop a continuous optimization problem which would allow us to generalize the combinatorial setup across different models.

2.1. NOTATIONS

We are given a set of training instances {(x i , y i )} i∈D where we use D to index the data. Here, x i ∈ R dx are features and y i ∈ Y as the labels. In our experiments, we consider Y as a set of categorical labels. However, our framework can also be used for continuous labels. We use m to denote a neural architecture and represent its parameterization as m θ . We also use M to denote the set of neural architectures. Given an architecture m ∈ M, G m = (V m , E m ) provides the graph representation of m, where the nodes u ∈ V m represent the operations and the e = (u m , v m ) indicates an edge, where the output given by the operation represented by the node u m is fed to one of the operands of the operation given by the node v m . Finally, we use H(•) to denote the entropy of a probability distribution and ℓ(m θ (x), y) as the cross entropy loss hereafter.

2.2. COMBINATORIAL SUBSET SELECTION FOR EFFICIENT LEARNING

We are given a dataset {(x i , y i )} i∈D and a model architecture m ∈ M with its neural parameterization m θ . The goal of a subset selection algorithm is to select a small subset of instances S with |S| = n << |D| such that, training m θ on the subset S gives nearly same accuracy as training on the entire dataset D. Existing works (Killamsetty et al., 2021b; Sivasubramanian et al., 2021; Killamsetty et al., 2021a) adopt different strategies to achieve this goal, but all of them aim to simultaneously optimize for the model parameters θ as well as the candidate subset S. At the outset, we may consider the following optimization problem. minimize θ,S⊂D:|S|=b i∈S ℓ(m θ (x i ), y i ) -λ DIVERSITY(S), ( ) where b is the budget, DIVERSITY(S) measures the representativeness of S with respect to the whole dataset D and λ is a regularizing coefficient. One can use submodular functions (Fujishige, 2005; Iyer, 2015) like Facility Location, graph cut, or Log-Determinants to model DIVERSITY(S). Here, λ trades off between training loss and diversity. Such an optimization problem indeed provides an optimal subset S that results in high accuracy. Bottlenecks of the combinatorial optimization.  F ϕ (G m , x i ) ≈ m θ * (x i ) for m ∈ M. (2) Here, F ϕ (G m , x i ) = g β (GNN α (G m ), x i ). (3) Here, ϕ = {α, β}, and θ * is the set of learned parameters of the model m θ on the dataset D. Subset sampler. We design a subset sampler using a probabilistic model Pr π (•). Given a budget |S| ≤ b, it sequentially draws instances S = {s 1 , ..., s b } from a softmax distribution of the logit vector π ∈ R |D| where π(x i , y i ) indicates a score for the element (x i , y i ). Having chosen the first t instances S t = {s 1 , ..s t } from D, it draws the (t + 1)-th element (x, y) from the remaining instances in D with a probability proportional to exp(π(x, y)) and then repeat it for b times. Thus, the probability of selecting the ordered set of elements S = {s 1 , ..., s b } is given by Pr π (S) = b t=0 exp(π(x st+1 , y st+1 )) τ ∈D\St exp(π(x sτ , y sτ )) (4) We would like to highlight that we use S as an ordered set of elements, selected in a sequential manner. However, such an order does not affect the trained model which is inherently invariant of permutations of the training data, it only affects the choice of S. Training objective. Using the Eqs. ( 2) and (4), we replace the combinatorial optimization problem in Eq. ( 1) with a continuous optimization problem, across different model architectures m ∈ M. To that goal, we define Λ(S; m; π, F ϕ ) = i∈S ℓ(F ϕ (G m , x i ), y i ) -λH(Pr π (•)) (5) minimize π,ϕ m∈M E S∈Prπ(•) Λ(S; m; π, F ϕ ) + i∈S γKL(F ϕ (G m , x i ), m θ * (x i )) Here, we use entropy on the subset sampler H(Pr π (•)) to model the diversity of samples in the selected subset. We call our neural pipeline, which consists of the model approximator F ϕ and the subset selector π, as SUBSELNET. In the above, γ penalizes the difference between the output of model approximator and the prediction made by the trained model, which allows us to generalize the training of different models m ∈ M through the model F ϕ (G m , x i ).

2.4. TRANSDUCTIVE-SUBSELNET AND INDUCTIVE-SUBSELNET MODELS

The optimization (6) suggests that once F ϕ is trained, we can use it to compute the output of the trained model m θ * for an unseen architecture m ′ and use it to compute π. This already removes a significant overhead of model training and facilitates fast computation of π. This leads us to develop two types of models based on how we can compute π, as follows. Transductive-SUBSELNET. The first variant of the model is transductive in terms of computation of π. Here, once we train the model approximator F ϕ , then we compute π by solving the optimization problem explicitly with respect to π, every time when we wish to select data subset for a new architecture. Given a trained model F ϕ and a new model architecture m ′ ∈ M, we solve the optimization problem: min π E S∈Pr π (•) [Λ(S; m; π, F ϕ )] to find the subset sampler Pr π during inference time for a new architecture m ′ . Such an optimization still consumes time during inference. However, it is still significantly faster than the combinatorial methods (Killamsetty et al., 2021b; a; Mirzasoleiman et al., 2020a; Sivasubramanian et al., 2021) thanks to sidestepping the explicit model training using a model approximator. Inductive-SUBSELNET. In contrast to the transductive model, the inductive model does not require explicit optimization of π in the face of a new architecture. To that aim, we approximate π using a neural network π ψ . This takes two signals as inputs -the dataset D and the outputs of the model approximator for different instances {F ϕ (G m , x i ) | i ∈ D}, and finally outputs a score for each instance π ψ (x i , y i ). Under Inductive-SUBSELNET, the optimization (6) becomes: minimize ψ,ϕ m∈M E S∈Prπ ψ (•) Λ(S; m; π ψ , F ϕ ) + i∈S γKL(F ϕ (G m , x i ), m θ * (x i )) Such an inductive model can select an optimal distribution of the subset that should be used to efficiently train any model m θ , without explicitly training θ or searching for the underlying subset.

3. NEURAL PARAMETERIZATION OF SUBSELNET

In this section, we describe the neural parametrization of SUBSELNET. SUBSELNET consists of two key components, F ϕ and π ψ . Specifically, Transductive-SUBSELNET has only one neural component which is F ϕ , whereas, Inductive-SUBSELNET has both F ϕ and π ψ .

3.1. NEURAL PARAMETERIZATION OF F ϕ

The approximator F ϕ consists of two components: (i) a graph neural network GNN α which maps G m , the DAG of an architecture, to the node representations H m = {h u } u∈Vm and (ii) a model encoder g β which takes H m and the instance x i as input and approximates m θ * (x i ), i.e., the prediction made by the trained model. Therefore, F ϕ (G m , x) = g β (GNN α (G m ), x i ). Here, ϕ = {α, β}. Computation of architecture embedding using GNN α . Given a model m ∈ M, we compute the representations H m = {h u |u ∈ V m } by using a graph neural network GNN α parameterized with α, following the proposal of Yan et al. (2020) . We first compute the feature vector f u for each node u ∈ V m using the one-hot encoding of the associated operation (e.g., max, sum, etc.) and then feed it into a neural network to compute an initial node representation, as given below. h u [0] = INITNODE α (f u ) Then, we use a message passing network, which collects signals from the neighborhood of different nodes and recursively compute the node representations (Yan et al., 2020; Xu et al., 2018b; Gilmer et al., 2017) . Given a maximum number of recursive layers K and the node u, we compute the node embeddings H m = {h u |u ∈ V m } by gathering information from the k < K hops using K recursive layers as follows. h (u,v) [k -1] = EDGEEMBED α (h u [k -1], h v [k -1]) h ′ u [k -1] = SYMMAGGR α ( h (u,v) [k -1] | v ∈ Nbr(u) ) h u [k] = UPDATE α (h u [k -1], h ′ u [k -1]). Here, Nbr(u) is the set of neighbors of u. We use SYMMAGGR as a simple sum aggregator and both UPDATE and EDGEEMBED are injective mappings, as used in (Xu et al., 2018b) . Note that trainable parameters from EDGEEMBED, SYMMAGGR and UPDATE are decoupled. They are represented as the set of parameters α. Finally, we obtain our node representations as: h u = [h u [0], .., h u [K -1]]. Model encoder g β . Having computed the architecture representation {h u | u ∈ V m }, we next design the model encoder which leverages these embeddings to predict the output of the trained model m θ * (x i ). To this aim, we developed a model encoder g β parameterized by β that takes H m and x i as input and attempts to predict m θ * (x i ), i.e., g β (H m , x i ) ≈ m θ * (x i ). It consists of three steps. In the first step, we generate a permutation invariant order on the nodes. Next, we feed the representations {h u } in this order into a self-attention based transformer layer. Finally, we combine the output of the transformer and the instance x i using a feedforward network to approximate the model output. Node ordering using BFS order. We first sort the nodes using breadth-first-search (BFS) order ρ. Similar to You et al. (2018) , this sorting method produces a permutation-invariant sequence of nodes and captures subtleties like skip connections in the network structure G m Attention layer. Given the BFS order ρ, we pass the representations H m = {h u | u ∈ V m } in the sequence ρ through a self-attention based transformer network. Here, the Query, Key and Value functions are realized by matrices W query , W key , W value ∈ R dim(h)×k where k is a tunable width. Thus, for each node u ∈ V m , we have: Query(h u ) = W ⊤ query h u , Key(h u ) = W ⊤ key h u , Value(h u ) = W ⊤ value h u (11) Using these quantities, we compute an attention weighted vector ζ u given by: Att u = W T c v a u,v Value(h v ) with, a u,v = SOFTMAX v Query(h u ) ⊤ Key(h v )/ √ k (12) Here k is the dimension of the latent space, the softmax operation is over the node v, and W c ∈ R k×dim(h) . Subsequently, for each node u, we use a feedforward network, preceded and succeeded by layer normalization operations, which are given by the following set of equations. ζ u,1 = LN(Att u + h u ; γ 1 , γ 2 ), ζ u,2 = W ⊤ 2 RELU(W ⊤ 1 ζ u,1 ), ζ u,3 = LN(ζ u,1 + ζ u,2 ; γ 3 , γ 4 ) Here, LN is the layer normalization operation (Ba et al., 2016) . Finally, we feed the vector ζ u,3 for the last node u in the sequence ρ, i.e., u = ρ(|V m |) along with the feature vector x i into a feed-forward network parameterized by W F to model the prediction m θ * (x i ). Thus, the final output of the model encoder g β (H m , x i ) is given by o m,xi = FF β2 (ζ ρ |Vm|,3 , x i ) (13) Here, W • and γ • are trainable parameters and collectively form the set of parameters β.

3.2. NEURAL ARCHITECTURE OF INDUCTIVE-SUBSELNET

We approximate π using a neural network π ψ using a neural network which takes three inputs -(x j , y j ), the corresponding output of the model approximator, i.e., o m,xj = F ϕ (G m , x j ) and the node representation matrix H m and provides us a positive selection score π ψ (H m , x j , y j , o m,xj ). In practice, π ψ is a three-layer feed-forward network, which contains Leaky-ReLU activation functions for the first two layers and sigmoid activation at the last layer.

4. PARAMETER ESTIMATION AND INFERENCE

Given a dataset {(x i , y i ) | i ∈ D} and the output of the trained models {m θ * (x i )} i∈D , our goal is to estimate ϕ and π (resp. ψ) for the transductive (inductive) model. We first illustrate the bottlenecks that prevent us from end-to-end training for estimating these parameters. Then, we introduce a multi-stage training method to overcome these limitations. Finally, we present the inference method.

4.1. BOTTLENECK FOR END TO END TRAINING

End to end optimization of the above problem is difficult for the following reasons. (i) Our architecture representation H m only represents the architectures and thus should be independent of parameter of the architecture θ and the instances x. End to end training can make them sensitive to these quantities. (ii) To enable the model approximator F ϕ accurately fit the output of the trained model m θ , we need an explicit training for ϕ with the target m θ . Adding the corresponding loss as an additional regularizer imposes an additional hyperparameter tuning.

4.2. MULTI-STAGE TRAINING

In our multi-stage training method, we first train the model approximator F ϕ by minimizing the sum of the KL divergence between the gold output probabilities, and then train our subset sampler Pr π (resp. Pr π ψ ) for the transductive (inductive) model as well as fine-tuning ϕ. Training the model approximator F ϕ . We train F ϕ in two steps. In the first step, we perform unsupervised training of GNN α using graph variational autoencoder (GVAE). This ensures that the architecture representations H m remain insensitive to the model parameters. We build the encoder and decoder of our GVAE by following existing works on graph VAEs (Yan et al., 2020) in the context graph based modeling of neural architectures. Given a graph G m , the encoder q(Z m | G m ) which takes the node embeddings {h u } u∈Vm and maps it into the latent space Z m = {z u } u∈Vm . Specifically, we model the encoder q(Z m | G m ) as: q(z u | G m ) = N (µ(h u ), Σ(h u )). Here, both µ and Σ are neural networks. Given a latent representation Z m = {z u } u∈Vm , the decoder models a generative distribution of the graph G m where the presence of an edge is modeled as Bernoulli distribution BERNOULLI(σ(z ⊤ u z v )). Thus, we model the decoder as: p(G m | Z) = (u,v)∈Em σ(z ⊤ u z v ) • (u,v)̸ ∈Em [1 -σ(z ⊤ u z v )] Here, σ is a parameterized sigmoid function. Finally, we estimate α, µ, Σ and σ by maximizing the evidence lower bound (ELBO) as follows: max α,µ,Σ,σ E Z∼q(• | Gm) [p(G m | Z)] -KL(q(• | G m )||prior(•)) Next, we train our model encoder g β by minimizing the KL-Divergence between the approximated prediction g β (H m , x i ) and the ground truth prediction m θ * (x i ), where both these quantities are probabilities across different classes. Hence, the training problem is as follows: minimize β i∈D,m∈M KL(m θ * (x i )||g β (H m , x i )) Training of the subset sampler. Finally, we fine-tune g β and train π by solving (6) for the Transductive-SUBSELNET (likewise train π ψ by solving (7) for Inductive-SUBSELNET). α, β, H m ←TRAINAPPROX(D, M, {θ * }) 3: o ← [g β ({H m , x i })] i,m 4: ψ ← TRAINPI(o, {H m }, {x i }) 1: function TRAINAPPROX(D, M, {θ * }) 2: α ← TRAINGNN(M) 3: for m ∈ M train do 4: The TRAININDUCTIVE routine further calls the TRAINPI subroutine to train the parameters of the neural subset selector. H m ← GNN α(m) 5: POS ← BFSORDERING(G m ) 6: β ← TRAINMODELENC({x i }, POS, {θ * }) Algorithm 2 Inference procedure 1: function INFERTRANSDUCTIVE(D, α, β, m ′ ) 2: H m ′ ← GNN α(m ′ ) 3: F ϕ (G m ′ , x i ) ← g β (H m ′ , x i ) ∀i ∈ D 4: π * ← min π E S∈Prπ(•) [Λ(S; m ′ ; π; F ϕ )] 5: S * ∼ Pr π * (•) 6: TRAINNEWMODEL(m ′ ; S * ) 1: function INFERINDUCTIVE(D, α, β, m ′ ) 2: H m ′ ← GNN α(m ′ ) 3: F ϕ (G m ′ , x i ) ← g β (H m ′ , x i ) ∀i ∈ D 4: Compute π ψ (x i , y i ) ∀i Inference Subroutines. Given an unseen architecture and parameters of the trained neural networks, the inference phase for both variants of SUBSELNET first generates the model encoder output for all the data points. Post this, the INFERTRANSDUCTIVE routine solves the optimization problem on π explicitly for the unseen architecture and selects the subset from the dataset. On the other hand, INFERINDUC-TIVE utilizes the trained parameters of the neural subset selector. Finally, both routines call the TRAINNEWMODEL to train and evaluate the unseen architecture on selected subset.

5. EXPERIMENTS

In this section, we provide comprehensive evaluation of SUBSELNET against several strong baselines on three real world datasets. In Appendix D, we present additional results.

5.1. EXPERIMENTAL SETUP

Datasets. We use FMNIST (Xiao et al., 2017) , CIFAR10 (Krizhevsky et al., 2014) and CI-FAR100 (Krizhevsky et al., 2009) datasets for our experiments. We transform an input image X i to a vector x i of dimension 2048 by feeding it to a pre-trained ResNet50 v1.5 (?) model and using the output from the penultimate layer as the image representation. Model architectures and baselines. We use model architectures from NAS-Bench-101 (Ying et al., 2019) for our experiments. We compare Transductive-SUBSELNET and Inductive-SUBSELNET against two non-adaptive subset selection methods -(i) Facility location (Fujishige, 2005; Iyer, 2015) where we maximize F L(S) = j∈D max i∈S x ⊤ i x j to find S, (ii) Pruning (Sorscher et al., 2022) , and four adaptive subset selection methods -(iii) Glister (Killamsetty et al., 2021b) , (iv) Grad-Match (Killamsetty et al., 2021a) , (v) EL2N (Paul et al., 2021) , (vi) GraNd (Paul et al., 2021) Given an architecture m ′ ∈ M test , we select the subset S from D tr using our subset sampler (Pr π for Transductive-SUBSELNET or Pr π ψ for Inductive-SUBSELNET). Similarly, all the non-adaptive subset selectors select S ⊂ D tr using their own algorithms. Once S is selected, we train the test models m ′ ∈ M test on S. We perform our experiments with different |S| = b ∈ (0.005|D|, 0.05|D|) and compare the performance between different methods using three quantities: (1) Accuracy Pr(y = ŷ) measured using 1 |Dtest| i∈Dtest m ′ ∈Mtest 1(max j m ′ θ * (x i )[j] = y i ). (2) Computational efficiency, i.e., the speedup achieved with respect to training with full dataset. It is measured with respect to T f /T . Here, T f is the time taken for training with full dataset; and, T is the time taken for the entire inference task, which is the average time for selecting subsets across the test models m ′ ∈ M test plus the average training time of these test models on the respective selected subsets. (3) Resource efficiency in terms of the amount of memory consumed during the entire inference task, described in item (2), which is measured as T 0 memory(t) dt where memory(t) is amount of memory consumed at timestamp t.

5.2. RESULTS

Comparison with baselines. Here, we compare different methods in terms of the trade off between accuracy and computational efficiency as well as accuracy and resource efficiency. In Figure 1 , we probe the variation between these quantities by varying the size of the selected subset |S| = b ∈ (0.005|D|, 0.05|D|). We make the following observations. (1) Our methods trade-off between accuracy vs. computational efficiency as well as accuracy vs. resource efficiency more effectively than all the methods. For FMNIST, both the variants of our method strikingly output 75% accuracy, whereas they are 100 times faster than full selection. Transductive-SUBSELNET performs slightly better than Inductive-SUBSELNET in terms of the overall trade-off between accuracy and efficiency for FMNIST and CIFAR10 datasets. However, for CIFAR100, Transductive-SUBSELNET performs significantly better than Inductive-SUBSELNET. The time taken for both Transductive-SUBSELNET and Inductive-SUBSELNET seems comparable-this is because the subset selection time for both of them are significantly less than the final training time on the selected subset. (2) EL2N is the second best method. It provides the best trade-off between accuracy and time as well as accuracy and GPU memory, among all the baselines. It aims at choosing difficult training instances having high prediction error. As a result, once trained on them, the model can predict the labels of easy instances too. However, it chooses instances after running the initial few epochs. (3) FL adopts a greedy algorithm for subset selection and therefore, it consumes a large time and memory during subset selection itself. Consequently, the overall efficiency significantly decreases although the complexity of the training time on the selected subset remains the same as our models in terms of time and memory. (4) In addition to EL2N, Glister, Grad-Match and GraNd are adaptive subset selection methods that operate with moderately small (> 5%) subset sizes. In a region, where the subset size is extremely small, i.e., 1% -5%, they perform very poorly. Moreover, they maximize a monotone function at each gradient update step, which results in significant overhead in terms of time. These methods process the entire training data to refine the choice of the subset and consequently, they end up consuming a lot of memory. ( 5) GraNd selects the instances having high uncertainty after running each model for five epochs and often the model is not well trained by then. Hybrid-SUBSELNET. From Figure 1 , we observe that Transductive-SUBSELNET performs significantly better than Inductive-SUBSELNET. However, since Transductive-SUBSELNET solves a fresh optimization problem for each new architecture, it performs better at the cost of time and GPU Here, given the budget of the subset b, we first choose B > b instances using Inductive-SUBSELNET and the final b instances by running the explicit optimization routines in Transductive-SUBSELNET. Figure 3 summarizes the results for B = {25K, 30K, 35K, 45K, 50K} . We observe that the trade off curves for the Hybrid-SUBSELNET lie in between Inductive-SUBSELNET and Transductive-SUBSELNET. For low value of B, i.e., B = 25K, the trade off line of Hybrid-SUBSELNET remains close to Inductive-SUBSELNET. As we increase B, the trade-off curve of accuracy vs speed up as well as the accuracy vs GPU usage becomes better, which allows Hybrid-SUBSELNET to smoothly transition from the trade off curve of Inductive-SUBSELNET to Transductive-SUBSELNET. At B = 45K, the trade-off curve almost coincides with Transductive-SUBSELNET. Such properties allow a user to choose an appropriate B that can accurately correspond to a target operating point in the form of (Accuracy, Speed up) or (Accuracy, memory usage).

6. CONCLUSION

In this work, we develop SUBSELNET, a subset selection framework, which can be trained on a set of model architectures, to be able to predict a suitable training subset before training a model, for an unseen architecture. To do so, we first design a neural model approximator, which predicts the output of a new candidate architecture without explicitly training it. We use that output to design transductive and inductive variants of our model. The transductive model solves a small optimization problem to compute the subset for a new architecture m every single time. In contrast, the inductive model resorts to a neural subset sampler instead of an optimizer. Our work does not incorporate the gradients of the trained model in model approximator and it would be interesting to explore its impact on the subset selection. Further we can extend our setup to an adaptive setting, where we can incorporate signals from different epochs with a sequence encoder to train a subset selector. (Zhang et al., 2019; Ning et al., 2020; Yan et al., 2020; Lukasik et al., 2021) . By employing an asynchronous message passing scheme over the directed acyclic graph (DAG), GNN-based methods model the propagation of input data over the actual network structure. Apart from encodings based solely on the structure of the network, White et al. (2020) ; Yan et al. (2021) produce computation-aware encodings that map architectures with similar performance to the same region in the latent space. Following the work of Yan et al. (2020) , we use a graph isomorphism network as an encoder but instead of producing a single graph embedding, our method produces a collection of node embeddings, ordered by breadth-first-search (BFS) ordering of the nodes. Our work also differs in that we do not employ network embeddings to perform downstream search strategies. Instead, architecture embeddings are used in training a novel model approximator that predicts the logits of a particular architecture, given an architecture embedding and a data embedding. Network architecture search. There is an ever-increasing demand for the automatic search of neural networks for various tasks. The networks discovered by NAS methods often come from an underlying search space, usually designed to constrain the search space size. One such method is to use cell-based search spaces (Luo et al., 2018; Zoph et al., 2017; Liu et al., 2017; Pham et al., 2018; Ying et al., 2019; Dong & Yang, 2020) . Although we utilize the NAS-Bench-101 search space for architecture retrieval, our work is fundamentally different from NAS. In contrast to the NAS methods, which search for the best possible architecture from the search space using either sampling or gradient-descent based methods (Baker et al., 2017; Zoph & Le, 2016; Real et al., 2017; 2018; Liu et al., 2018; Tan et al., 2018) , our work focuses on efficient data subset selection given a dataset and an architecture, which is sampled from a search space. Our work utilizes graph representation learning on the architectures sampled from the mentioned search spaces to project an architecture under consideration to a continuous latent space, utilize the model expression from the latent space as proxies for the actual model and proceed with data subset selection using the generated embedding, model proxy and given dataset. Data subset selection. Data subset selection is widely used in literature for efficient learning, coreset selection, human centric learning, etc. Several works cast the efficient data subset selection task as instance of submodular or approximate-submodular optimization problem (Killamsetty et al., 2021a; Wei et al., 2014a; b; c; Killamsetty et al., 2021b; Sivasubramanian et al., 2021) . Another line of work focus on selecting coresets which are expressed as the weighted combination of subset of data, approximating some characteristics, e.g., loss function, model prediction (Feldman, 2020; Mirzasoleiman et al., 2020b; Har-Peled & Mazumdar, 2004; Boutsidis et al., 2013; Lucic et al., 2017) . Our work is closely connected to simultaneous model learning and subset selection (De et al., 2021; 2020; Sivasubramanian et al., 2021) . These existing works focus on jointly optimizing the training loss, with respect to the subset of instances and the parameters of the underlying model. Among them (De et al., 2021; 2020) focus on distributing decisions between human and machines, whereas (Sivasubramanian et al., 2021) aims for efficient learning. However, these methods adopt a combinatorial approach for selecting subsets and consequently, they are not generalizable across architectures. In contrast, our work focuses on differentiable subset selection mechanism, which can generalize across architectures. • DEEPSET: We consider permutation invariant networks of the form ρ( h∈H ϕ(h); x i ) where ρ and ϕ are neural networks and H is the sequence under consideration. We ρ is a fully-connected network with 4 layers, ReLU activation, and hidden dimension of 64, and ϕ is a two-layer fullyconnected network with ReLU activation and has output dimension 10. • LSTM: We consider an LSTM-based encoder with hidden dimension of 16 and dropout probability of 0.2. The output of the last LSTM block is concatenated with x i and fed to a linear layer with hidden dimension 256, dropout probability of 0.3 and ReLU as the activation function. Since the goal of the model encoder is to produce outputs which mimic the architectures, we measure the KL divergence between the outputs of the gold models and of the encoder to denote the closeness of the output distribution. Performance of subset selectors using different model encoders. We consider three different design choices of model approximator (our (Transformer), Feedforward, and LSTM) along with three different subset selection strategies (Our subset sampler, top-b instances based on uncertainty, and top-b based on loss) which result in nine different combinations of model approximation and subset selection strategies. We measure uncertainty using the entropy of the predicted distribution of the target classes and report the average test accuracy of the models when they are trained on the underlying pre-selected subset in the following table - We conducted similar experiments as Section 5.1 for CIFAR10 and FMNIST on larger subset sizes (b) of 0.1|D|, 0.2|D|, 0.4|D| and 0.7|D|. For each dataset and the above mentioned subset sizes, we evaluate the decrease in accuracy (ratio of the accuracy on the subset to accuracy on the full dataset), speed-up (ratio of the time taken to train the full dataset to the sum of times taken for subset selection and subset training), and GPU usage in GB-min. We report the variation of these metrics with respect to the subset sizes in the following tables - Figure 15 : Trade off between accuracy and speedup (top row) and accuracy and memory consumption (bottom row) for all the methods -Facility location (Fujishige, 2005; Iyer, 2015) , Pruning (Sorscher et al., 2022) , Glister (Killamsetty et al., 2021b) , Grad-Match (Killamsetty et al., 2021a) , EL2N (Paul et al., 2021) ; GraNd (Paul et al., 2021) ; and; Full selection on FMNIST and CIFAR10. In all cases, we vary |S| = b ∈ (0.1|D|, 0.7|D|).

E PROS AND CONS OF USING GNNS

We have used a GNN in our model encoder to encode the architecture representations into an embedding. We chose a GNN for the task due to following reasons -1. Message passing between the nodes (which may be the input, output, or any of the operations) allows us to generate embeddings that capture the contextual structural information of the node, i.e., the embedding of each node captures not only the operation for that node but also the operations preceding that node to a large extent. 2. It has been shown by (Morris et al., 2019) and (Xu et al., 2018a) that GNNs are as powerful as the Weisfeiler-Lehman algorithm and thus give a powerful representation for the graph. Thus, we obtain smooth embeddings of the nodes/edges that can effectively distill information from its neighborhood without significant compression. 3. GNNs embed model architecture into representations independent of the underlying dataset and the model parameters. This is because it operates on only the nodes and edges-the structure of the architecture and does not use the parameter values or input data. However, the GNN faces the following drawbacks -1. GNN uses a symmetric aggregator for message passing over node neighbors to ensure that the representation of any node should be invariant to a permutation of its neighbors. Such a symmetric aggregator renders it a low-pass filter, as shown in (NT & Maehara, 2019) , which attenuates important high-frequency signals. 2. We are training one GNN using several architectures. This can lead to the insensitivity of the embedding to change in the architecture. In the context of model architecture, if we change the operation of one node in the architecture (either remove, add or change the operation), then the model's output can significantly change. However, the embedding of GNN may become immune to such changes, since the GNN is being trained over many architectures.

F CHOICE OF SUBMODULAR FUNCTION FOR THE OPTIMIZATION PROBLEM

In ( 1) we introduced the original combinatorial problem for subset selection where optimization variable Sthe subset of instances -makes the underlying problem combinatorial. Here, we can use submodular functions like Graph-Cut, Facility-Location, and Log-Determinant as the diversity functions, which would allow us to use greedy algorithms to maximize the objective in ( 1). But, as discussed in Section 4.1, this suffers from two bottlenecks -expensive computation issues and lack of generalizability. Therefore, we do not follow these approaches and resort to our proposed approach called SUBSELNET. In contrast to the optimization problem in (1), which was a combinatorial set optimization problem, the optimization problem in SUBSELNET( 6) is a continuous optimization problem where the goal is to estimate P r π . In such a problem, where the probability distribution is the key optimization variable, entropy is a more natural measure of diversity than the other submodular measures.



https://github.com/jmschrei/apricot



INFERENCE During inference, our goal is to select a subset S with |S| = b for a new model m ′ , which would facilitate efficient training of m ′ . As discussed in Section 2.4, we compute π for Transductive-SUBSELNET by explicitly solving the optimization problem: min π E S∈Pr π (•) [Λ(S; m; π, F ϕ )] and then draw S ∼ Pr π (•). For Inductive-SUBSELNET, we draw S ∼ Pr π ψ (•) where ψ is the learned value of ψ during training. 4.4 OVERVIEW OF TRAINING AND INFERENCE ROUTINES Algorithms 1 and 2 summarize the algorithms for the training and inference procedure. Algorithm 1 Training procedure 1: function TRAINTRANSDUCTIVE(D, M, {θ * }) 2: α, β, H m ←TRAINAPPROX(D, M, {θ * }) 1: function TRAININDUCTIVE(D, M, {θ * }) 2:

′ ; S * ) Training Subroutines. The training phase for both, Transductive-SUBSELNET first utilizes the TRAINAPPROX routine to train the model approximator given the dataset, trained model parameters, and the set of neural architectures. Internally, the routine calls the TRAINGNN subroutine to train the parameters (α) of the GNN network, BFSORDERING subroutine to reorder the embeddings based on the BFS order and the TRAINMODELENC subroutine to train the attention-based model encoder's parameters (β).

Figure1: Trade off between accuracy and speedup (top row) and accuracy and memory consumption (bottom row) for all the methods -Facility location(Fujishige, 2005; Iyer, 2015), Pruning(Sorscher et al., 2022), Glister(Killamsetty et al., 2021b), Grad-Match(Killamsetty et al., 2021a), EL2N(Paul et al., 2021); GraNd(Paul et al., 2021); and; Full selection on all three datasets -FMNIST, CIFAR10 and CIFAR100. In all cases, we vary |S| = b ∈ (0.005|D|, 0.05|D|) and measure accuracy on 20% test architectures and 20% test instances. model). None of the baseline methods supports any generalizable learning protocol across different model architectures and thus cannot leverage the training architectures during test. Given an architecture m ′ ∈ M test , we select the subset S from D tr using our subset sampler (Pr π for Transductive-SUBSELNET or Pr π ψ for Inductive-SUBSELNET). Similarly, all the non-adaptive subset selectors select S ⊂ D tr using their own algorithms. Once S is selected, we train the test models m ′ ∈ M test on S. We perform our experiments with different |S| = b ∈ (0.005|D|, 0.05|D|) and compare the performance between different methods using three quantities: (1) Accuracy Pr(y = ŷ) measured using 1

Figure 3: Hybrid-SUBSELNET memory. On the other hand, Inductive-SUBSELNET performs significantly worse as it relies on a trained neural network to learn the same optimization problem. Here, we design a hybrid version of our model, called as Hybrid-SUBSELNET.Here, given the budget of the subset b, we first choose B > b instances using Inductive-SUBSELNET and the final b instances by running the explicit optimization routines in Transductive-SUBSELNET. Figure3summarizes the results for B = {25K, 30K, 35K, 45K, 50K} . We observe that the trade off curves for the Hybrid-SUBSELNET lie in between Inductive-SUBSELNET and Transductive-SUBSELNET. For low value of B, i.e., B = 25K, the trade off line of Hybrid-SUBSELNET remains close to Inductive-SUBSELNET. As we increase B, the trade-off curve of accuracy vs speed up as well as the accuracy vs GPU usage becomes better, which allows Hybrid-SUBSELNET to smoothly transition from the trade off curve of Inductive-SUBSELNET to Transductive-SUBSELNET. At B = 45K, the trade-off curve almost coincides with Transductive-SUBSELNET. Such properties allow a user to choose an appropriate B that can accurately correspond to a target operating point in the form of (Accuracy, Speed up) or (Accuracy, memory usage).

Approximator of the trained model m θ * . First, we design a neural network F ϕ which would approximate the predictions of the trained model m θ * for different architectures m ∈ M. Given the dataset {(x i , y i ) i∈D } and a model architecture m ∈ M, we first feed the underlying DAG G m into a graph neural network GNN α with parameter α, which outputs the representations of the nodes of the G m , i.e., H m = {h u } u∈Vm . Next, we feed H m and the instance x i into an encoder g β

Inference time in secondsFiner analysis of the inference time. Next, we demarcate the subset selection phase from the training phase of the test models on the selected subset during the inference time analysis. Table2summarizes the results for top three non-adaptive subset selection methods for b = 0.005|D| on CIFAR100. We observe that: (1) the final training times of all three methods are roughly same; (2) the selection time for Transductive-SUBSELNET is significantly more than Inductive-SUBSELNET, although it remains extremely small as compared to the final training on the inferred subset; and, (3) the selection time of FL is large-as close as 323% of the training time.

Barret Zoph and Quoc V. Le. Neural architecture search with reinforcement learning, 2016. URL https://arxiv.org/abs/1611.01578. Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V. Le. Learning transferable architectures for scalable image recognition, 2017. URL https://arxiv.org/abs/1707.07012. Representation learning for model architectures. Recent work in network representation learning use GNN based encoder-decoder to encapsulate the local structural information of a neural network into a fixed-length latent space

Table. 8 summarizes performance of different model encoders. We make the following observations: (1) Transformer-based model encoder outperforms every other method by a significant margin across both the datasets. (2) The BFS sequential modeling of an architecture with transformers leads to better representation that enables closer model approximation compared to other sequential methods like LSTM. (3) Non-sequential model approximators like Feedforward and DeepSets led to poor model approximation. Comparison of the performance of several model encoder architectures g β on the CIFAR-10 and FMNIST datasets, based on the Kullback-Leibler divergence values between the gold model outputs and predicted model outputs.

Test accuracy of the nine combinations of model approximators and selection strategies on the pre-selected CIFAR10 subset of size 5%. If we use simple unsupervised subset selection heuristics, e.g., loss or uncertainty based subset selection, then our model approximator performs much worse than Feedforward or

Variation of accuracy with subset size of both the variants of SUBSELNET on training, validation and test set of CIFAR10

Decrease in accuracy (Accuracy on selected subset/Accuracy on full data) for CIFAR10 and FMNIST for b ∈ (0.1|D|, 0.7|D|).

Speed-up (Time for full training/(Time taken for subset selection + Training on the selected subset)) for CIFAR10 and FMNIST for b ∈ (0.1|D|, 0.7|D|).

7. ETHICS STATEMENT

We do not foresee any negative impact of our work from ethics viewpoint.

8. REPRODUCIBILITY STATEMENT

We uploaded the code in supplementary material. Details of implementation are given in Appendix C.

B ILLUSTRATION OF SUBSELNET

< l a t e x i t s h a 1 _ b a s e 6 4 = " 3 Q a L 5 v m Y Q P n 6 U n U X / 9 w G b S 9 o T H k = " > A A A C A 3 i c b V B N S 8 N A E N 3 U r 1 q / q t 7 0 E i y C p 5 K I o s e i F 7 1 I B f s B T S i b 7 b R d u t m E 3 Y l Y Q s G L f 8 W L B 0 W 8 + i e 8 + W / c t j l o 6 4 O B x 3 s z z M w L Y s E 1 O s 6 3 l V t Y X F p e y a 8 W 1 t Y 3 N r e K 2 z t 1 H S W K Q Y 1 F I l L N g G o Q X E I N O Q p o x g p o G A h o B I P L s d + 4 B 6 V 5 J O 9 w G I M f 0 p 7 k X c 4 o G q l d 3 P M Q H l C z 9 F p y v I k 6 M G q n H h V x n 4 7 a x Z J T d i a w 5 4 m b k R L J U G 0 X v 7 x O x J I Q J D J B t W 6 5 T o x + S h V y J m B U 8 B I N M W U D 2 o O W o Z K G o P 1 0 8 s P I P j R K x + 5 G y p R E e 6 L + n k h p q P U w D E x n S L G v Z 7 2 x + J / X S r B 7 7 q d c x g m C Z N N F 3 U T Y G N n j Q O w O V 8 B Q D A 2 h T H F z q 8 3 6 V F G G J r a C C c G d f X m e 1 I / L 7 m n Z u T 0 p V S 6 y O P J k n x y Q I + K S M 1 I h V 6 R K a o S R R / J M X s m b 9 W S 9 W O / W x 7 Q 1 Z 2 U z u + Q P r M 8 f H O q Y d Q = = < / l a t e x i t > InitNode ↵ < l a t e x i t s h a 1 _ b a s e 6 4 = " C p M P G j 9 1 Y H 2 K F R Q V F V x F 9 q T R W b 4 = " > A A A C B H i c b V B N S 8 N A E N 3 U r 1 q / o h 5 7 C R b B U 0 l E 0 W N R C h 4 r 2 A 9 o S 9 l s p u 3 S 3 S T s T s Q S e v D i X / H i Q R G v / g h v / h u 3 b Q 7 a + m D g 8 d 4 M M / P 8 W H C N r v t t 5 V Z W 1 9 Y 3 8 p u F r e 2 d 3 T 1 7 / 6 C h o 0 Q x q L N I R K r l U w 2 C h 1 B H j g J a s Q I q f Q F N f 3 Q 9 9 Z v 3 o D S P w j s c x 9 C V d B D y P m c U j d S z i x 2 E B 9 Q s r Q Y D q E o f g k k v 7 V A R D + m k Z 5 f c s j u D s 0 y 8 j J R I h l r P / u o E E U s k h M g E 1 b r t u T F 2 U 6 q Q M w G T Q i f R E F M 2 o g N o G x p S C b q b z p 6 Y O M d G C Z x + p E y F 6 M z U 3 x M p l V q P p W 8 6 J c W h X v S m 4 n 9 e O 8 H + Z T f l Y Z w g h G y + q J 8 I B y N n m o g T c A U M x d g Q y h Q 3 t z p s S B V l a H I r m B C 8 x Z e X S e O 0 7 J 2 X 3 d u z U u U q i y N P i u S I n B C P X J A K u S E 1 U i e M P J J n 8 k r e r C f r x X q 3 P u a t O S u b O S R / Y H 3 + A J 7 Q m L c = < / l a t e x i t > EdgeEmbed↵ < l a t e x i t s h a 1 _ b a s e 6 4 = " / z / I a + o 7 q F U c T c m c D l 0 g b L P S g 5 E = " > A A A C A 3 i c b V B N S 8 N A E N 3 U r 1 q / q t 7 0 E i y C p 5 K I o s e q F 4 8 V 7 Q c 0 I W y 2 m 3 b p 7 i b s T s Q S C l 7 8 K 1 4 8 K O L V P + H N f + O 2 z U F b H w w 8 3 p t h Z l 6 Y c K b B c b 6 t w s L i 0 v J K c b W 0 t r 6 x u V X e 3 m n q O F W E N k j M Y 9 U O s a a c S d o A B p y 2 E 0 W x C D l t h Y O r s d + 6 p 0 q z W N 7 B M K G + w D 3 J I k Y w G C k o 7 3 l A H 0 C T 7 H Y o x E W v p 0 Z B 5 m G e 9 P E o K F e c q j O B P U / c n F R Q j n p Q / v K 6 M U k F l U A 4 1 r r j O g n 4 G V b A C K e j k p d q m m A y w D 3 a M V R i Q b W f T X 4 Y 2 Y d G 6 d p R r E x J s C f q 7 4 k M C 6 2 H I j S d A k N f z 3 p j 8 T + v k 0 J 0 7 m d M J i l Q S a a L o p T b E N v j Q O w u U 5 Q A H x q C i W L m V p v 0 s c I E T G w l E 4 I 7 + / I 8 a R 5 X 3 d O q c 3 N S q V 3 m c R T R P j p A R 8 h F Z 6 i G r l E d N R B B j + g Z v a I 3 6 8 l 6 s d 6 t j 2 l r w c p n d t E f W J 8 / M W y Y g g = = < / l a t e x i t > SymmAggr↵ < l a t e x i t s h a 1 _ b a s e 6 4 = " R 8 i J K J L F 1 e H 0 n R D 1 T z d k P N l 1 W C 0 = " > A A A C A X i c b V D J S g N B E O 2 J W 4 x b 1 I v g Z T A I n s K M K H o M e v E Y w S y Q D E N N p 5 I 0 6 V n o r h H D E C / + i h c P i n j 1 L 7 z 5 N 3 a W g y Y + K H i 8 V 0 V V v S C R Q p P j f F u 5 p e W V 1 b X 8 e m F j c 2 t 7 p 7 i 7 V 9 d x q j j W e C x j 1 Q x A o x Q R 1 k i Q x G a i E M J A Y i M Y X I / 9 x j 0 q L e L o j o Y J e i H 0 I t E V H M h I f v G g T f h A m m e 1 p A O E I z 9 r g 0 z 6 M P K L J a f s T G A v E n d G S m y G q l / 8 a n d i n o Y Y E Z e g d c t 1 E v I y U C S 4 x F G h n W p M g A + g h y 1 D I w h R e 9 n k g 5 F 9 b J S O 3 Y 2 V q Y j s i f p 7 I o N Q 6 2 E Y m M 4 Q q K / n v b H 4 n 9 d K q X v p Z S J K U s K I T x d 1 U 2 l T b I / j s D t C I S c 5 N A S 4 E u Z W m / d B A S c T W s G E 4 M 6 / v E j q p 2 X 3 v O z c n p U q V 7 M 4 8 u y Q H b E T 5 r I L V m E 3 r M p q j L N H 9 s x e 2 Z v 1 Z L 1 Y 7 9 b H t D V n z W b 2 2 R 9 Y n z + q M Z e q < / l a t e x i t > Update↵ < l a t e x i t s h a 1 _ b a s e 6 4 = " I 4 q g 8 i a X s i i W Y j K h H u B E a u H 9 T K 4 = " > A A A B 8 X i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e y K o s e g F 4 8 R T A w m S 5 i d z C Z D 5 r H M z A p h y V 9 4 8 a C I V / / G m 3 / j b L I H T S x o K K q 6 6 e 6 K E s 6 M 9 f 1 v r 7 S y u r a + U d 6 s b G 3 v 7 O 5 V 9 w / a R q W a 0 B Z R X O l O h A 3 l T N K W Z Z b T T q I p F h G n D 9 H 4 J v c f n q g 2 T M l 7 O 0 l o K P B Q s p g R b J 3 0 2 E t E l I 2 m / a D S r 9 b 8 u j 8 D W i Z B Q W p Q o N m v f v U G i q S C S k s 4 N q Y b + I k N M 6 w t I 5 x O K 7 3 U 0 A S T M R 7 S r q M S C 2 r C b H b x F J 0 4 Z Y B i p V 1 J i 2 b q 7 4 k M C 2 M m I n K d A t u R W f R y 8 T + v m 9 r 4 K s y Y T F J L J Z k v i l O O r E L 5 + 2 j A N C W W T x z B R D N 3 K y I j r D G x L q Q 8 h G D x 5 W X S P q s H F 3 X / 7 r z W u C 7 i K M M R H M M p B H A J D b i F J r S A g I R n e I U 3 z 3 g v 3 r v 3 M W 8 t e c X M I f y B 9 / k D / p W Q d w = = < / l a t e x i t > h h h 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " B 1 p w U 2 J R V A B 4 d p n 6 t + N G y I z L 2 Q Q = " > A A A B 7 X i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 B I v g q S S i 6 L H o x W M F + w F t K J v N p l 2 7 2 Q 2 7 k 0 I p / Q 9 e P C j i 1 f / j z X / j t s 1 B W x 8 M P N 6 b Y W Z e m A p u 0 P O + n c L a + s b m V n G 7 t L O 7 t 3 9 Q P j x q G p V p y h p U C a X b I T F M c M k a y F G w d q o Z S U L B W u H w b u a 3 R k w b r u Q j j l M W J K Q v e c w p Q S s 1 u 6 N I o e m V K 1 7 V m 8 N d J X 5 O K p C j 3 i t / d S N F s 4 R J p I I Y 0 / G 9 F I M J 0 c i p Y N N S N z M s J X R I + q x j q S Q J M 8 F k f u 3 U P b N K 5 M Z K 2 5 L o z t X f E x O S G D N O Q t u Z E B y Y Z W 8 m / u d 1 M o x v g g m X a Y Z M 0 s W i O B M u K n f 2 u h t x z S i K s S W E a m 5 v d e m A a E L R B l S y I f j L L 6 + S 5 k X V v 6 p 6 D 5 e V 2 m 0 e R x F O 4 B T O w Y d r q M E 9 1 K E B F J 7 g G V 7 h z V H O i / P u f C x a C 0 4 + c w x / 4 H z + A M z B j 0 Y = < / l a t e x i t > . . . < l a t e x i t s h a 1 _ b a s e 6 4 = " W g t w d E u Z g j a b A y e o 6 i J h f b v l Z U Q = " > A A A B 8 H i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e y K o s e g F 4 8 R z E O S J c x O Z p M h 8 1 h n Z o W w 5 C u 8 e F D E q 5 / j z b 9 x k u x B E w s a i q p u u r u i h D N j f f / b K 6 y s r q 1 v F D d L W 9 s 7 u

LayerNorm

< l a t e x i t s h a 1 _ b a s e 6 4 = " V r a o 7 + i e Y / f 7 6 U a 3 e w s 0 a 9 N i 0a e / F e / c + 5 q 0 F L 5 8 5 h D / w P n 8 A B 8 O Q j g = = < / l a t e x i t > g < l a t e x i t s h a 1 _ b a s e 6 4 = " g 0 O i b C A e P h r n y W 4 p 4 9 e 9 4 8 9 + 4 b + F q 0 5 J 5 s 5 h j 9 w P n 8 A K R u R H Q = = < / l a t e x i t > (a) Overview < l a t e x i t s h a 1 _ b a s e 6 4 = " e x s e 2 e B 5 3F < l a t e x i t s h a 1 _ b a s e 6 4 = " e x s e 2 e B 5 35 3 7 s 0 r t p o i j h A 7 Q I T p G L r p A N X S H 6 q i B C M r R M 3 p F b 9 a T 9 W K 9 W x / T 1 g W r m N l D f 2 B 9 / g D y I 5 X / < / l a t e x i t > : = {↵, } < l a t e x i t s h a 1 _ b a s e 6 4 = " S r e 3 V 5 / y 6 3< l a t e x i t s h a 1 _ b a s e 6 4 = " g 0 O i b C A e P h r n y Wy u 8 e c Z 7 8 d 6 9 j 3 l r y S t m D u E P v M 8 f 7 3 u R F g = = < / l a t e x i t > ⇡ < l a t e x i t s h a 1 _ b a s e 6 4 = " J 1 7c y r a E P z F l 5 d J 4 7 z i X 1 a 8 + 4 t y 9 S a P o 0 C O y Q k 5 I z 6 5 I l V y R 2 q k T h h 5 J M / k l b w 5 I + f F e X c + 5 q 0 r T j 5 z R P 7 A + f w B i Y G S m A = = < / l a t e x i t >

pos[u]

< l a t e x i t s h a 1 _ b a s e 6 4 = " w M 2 a l n t S X X w Q 8c y r a E P z F l 5 d J 4 7 z i X 1 a 8 + 4 t y 9 S a P o 0 C O y Q k 5 I z 6 5 I l V y R 2 q k T h h 5 J M / k l b w 5 I + f F e X c + 5 q 0 r T j 5 z R P 7 A + f w B i Y G S m A = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " w M 2 a l n t S X X w Q 8H e r Y 9 Z a 8 k q Z v b R H 1 i f P 7 w 6 l 5 Y = < / l a t e x i t > ⇡ or ⇡ < l a t e x i t s h a 1 _ b a s e 6 4 = " 6 l 4 H 0 X q k o 7 e w a i i 5 a Y v s X r B 6 7 f Architectures (M). Although our task is not Neural Architecture Search, we leverage the NASBench-101 search space as an architecture pool. The cell-based search space was designed for the benchmarking of various NAS methods. It consists of 423, 624 unique architectures with the following constraints -(1) number of nodes in each cell is at most 7, (2) number of edges in each cell is at most 9, (3) barring the input and output, there are three unique operations, namely 1 × 1 convolution, 3 × 3 convolution and 3 × 3 max-pool. We utilize the architectures from the search space in generating the sequence of embeddings along with sampling architectures for the training and testing of the encoder and datasets for the subset selector.

C.2 IMPLEMENTATION DETAILS ABOUT BASELINES

Facility Location (FL). We implemented facility location on all the three datasets using the apricot 1 library. The similarity matrix was computed using Euclidean distance between data points, and the objective function was maximized using the naive greedy algorithm.Pruning. It selects a subset from the entire dataset based on the uncertainty of the datapoints while partial training. In our setup, we considered ResNet-18 as a master model, which is trained on each dataset for 5 epochs. Post training, the uncertainty measure is calculated based on the probabilities of each class, and the points with highest uncertainty are considered in the subset. We train the master model at a learning rate of 0.025.Glister and Grad-Match. We implemented GLISTER (Killamsetty et al., 2021b) and Grad-Match (Killamsetty et al., 2021a) using the CORDS library. We trained the models for 50 epochs, using batch size of 20, and selected the subset after every 10 epochs. The loss was minimized using SGD with learning rate of 0.01, momentum of 0.9 and weight decay with regularization constant of 5 × 10 -4 . We used cosine annealing for scheduling the learning rate with T max of 50 epochs, and used 10% of the training data as the validation set. Details of specific hyperparameters for stated as follows.Glister uses a greedy selection approach to minimize a bi-level objective function. In our implementation, we used stochastic greedy optimization with learning rate 0.01, applied on the data points of each mini-batch. Online-Glister approximates the objective function with a Taylor series expansion up to an arbitrary number of terms to speed up the process; we used 15 terms in our experiments.Grad-Match applies the orthogonal matching (OMP) pursuit algorithm to the data points of each mini-batch to match gradient of a subset to the entire training/validation set. Here, we set the learning rate is set to 0.01. The regularization constant in OMP is 1.0 and the algorithm optimizes the objective function within an error margin of 10 -4 .GraNd. This is an adaptive subset selection strategy in which the norm of the gradient of the loss function is used as a score to rank a data point. The gradient scores are computed after the model has trained on the full dataset for the first few epochs. For the rest of epochs, the model is trained only on the top-k data points, selected using the gradient scores. In our implementation, we let the model train on the full dataset for the first 5 epochs, and computed the gradient of the loss only with respect to the last layer fully connected layer.EL2N. When the loss function used to compute the GraNd scores is the cross entropy loss, the norm of the gradient for a data point x can be approximated by E||p(x) -y|| 2 , where p(x) is the discrete probability distribution over the classes, computed by taking softmax of the logits, and y is the one-hot encoded true label corresponding to the data point x. Similar to our implementation of GraNd, we computed the EL2N scores after letting the models train on the full data for the first 5 epochs.

C.3 IMPLEMENTATION DETAILS ABOUT OUR MODEL

GNN α . As we utilize NASBench-101 space as the underlying set of neural architectures, each computational node in the architecture can comprise of one of five operations and the one-hotencoded feature vector f u . Since the set is cell-based, there is an injective mapping between the neural architecture and the cell structure. We aim to produce a sequence of embeddings for the cell, which in turn corresponds to that of the architecture. For each architecture, we use the initial feature We repeat this layer K times, and each iteration gathers information from k < K hops. After all the iterations, we generate an embedding for each node, and following (You et al., 2018) we use the BFS-tree based node-ordering scheme to generate the sequence of embeddings for each network.The GVAE-based architecture was trained for 10 epochs with the number of recursive layers K set to 5, and the Adam optimizer was used with learning rate of 10 -3 . The entire search space was considered as the dataset, and a batch-size of 32 was used. Post training, we call the node embeddings collectively as the architecture representation.To train the latent space embeddings, the parameters α are trained in an encoder-decoder fashion using a variational autoencoder. The mean µ and variance σ on the final node embeddings h u are:The decoder aims to reconstruct the original cell structure (i.e the nodes and the corresponding operations), which are one-hot encoded. It is modeled using single-layer fully connected networks followed by a sigmoid layer. The last item of the output sequence ζ u,3 is concatenated with the data embedding x i and fed to another 2-layer fully-connected network with hidden dimension 256 and dropout probability of 0.3. The model encoder is trained by minimizing the KL-divergence between g β (H m , x i ) and m θ * (x i ).We used an AdamW optimizer with learning rate of 10 -3 , ϵ = 10 -8 , betas = (0.9, 0.999) and weight decay of 0.005. We also used Cosine Annealing to decay the learning rate, and used gradient clipping with maximum norm set to 5. Figure 6 shows the convergence of the outputs of the model encoder g β (H m , x i ) with the outputs of the model m θ * (x i ).Neural Network π ψ . The inductive model is a three-layer fully-connected neural network with two Leaky ReLU activations and a sigmoid activation after the last layer. The input to π ψ is the concatenation (H m ; o m,i ; x i ; y i ). The hidden dimensions of the two intermediary layers are 64 and 16, and the final layer is a single neuron that outputs the score corresponding to a data point x i .While training π ψ we add a regularization term λ ′ ( i∈D π ψ (H m , o m,i , x i , y i ) -|S|) to ensure that nearly |S| samples have high scores out of the entire dataset D. Both the regularization constants λ (in equation 6) and λ ′ are set to 0.1. We train the model weights using an Adam optimizer with a learning rate of 0.001. During training, at each iteration we draw instances using Pr π and use the log-derivative trick to compute the gradient of the objective. During each computation step, we use one instance of the ranked list to compute the unbiased estimate of the objective in (6) .

D ADDITIONAL EXPERIMENTS D.1 ABLATION STUDY

We perform ablation study of SUBSELNET from three perspectives.Impact of ablation of subset sampler. First, we attempt to understand the impact of the subset sampler. To that aim, we compare the performance of SUBSELNET against two baselines, namely -Bottom-b-loss and Bottom-b-loss+gumbel. In Bottom-b-loss, we sort the data instances based on their predicted loss ℓ(F ϕ (G m , x), y) and consider those points with the bottom b values. In Bottomb-loss+gumbel, we add noise sampled from the gumbel distribution with µ = 0 and β = 0.025, and sort the instances based on these noisy loss values, i.e., ℓ(F ϕ (G m , x), y) + Gumbel(0, β = 0.025).We observe that Bottom-b-loss and Bottom-b-loss+gumbel do not perform that well in spite of being efficient in terms of time and memory. Exploring alternative architecture of the model encoder g β . We consider three alternative architecture to our current model encoder g β .• FEEDFORWARD: We consider a two-layer fully-connected network, in which we concatenate the mean of H m with x i . We used ReLu activation between the layers and the hidden dimension was set to 256. We used dropout for regularization with probability 0.3.LSTM, whereas this trend is opposite if we use our subset sampler for selecting the subset. This may be due to overfitting of the transformer architecture in presence of uncertainty or loss based selection, which is compensated by our subset sampler.

D.2 RECOMMENDING MODEL ARCHITECTURE

When dealing with a pool of architectures designed for the same task, choosing the correct architecture for the task might be a daunting task -since it is impractical to train all the architectures from scratch.In view of this problem, we show that training on smaller carefully chosen subsets might be beneficial for a quicker alternative to choosing the correct architectures. We first extract the top 15 best performing architectures A * having highest accuracy, when trained on full data. We mark them as "gold". Then, we gather top 15 architectures A when trained on the subset provided by our models. Then, we compare A and A * using the Kendall tau rank correlation coefficient (KTau) along with Jaccard coefficent |A ∩ A * |/|A ∪ A * |.Figure 10 summarizes the results for top three non-adaptive subset selectors in terms of the accuracy, namely -Transductive-SUBSELNET, Inductive-SUBSELNET and FL. We make the following observations: (1) One of our variant outperforms FL in most of the cases in CIFAR10 and CIFAR100.(2) There is no consistent winner between Transductive-SUBSELNET and Inductive-SUBSELNET, although Inductive-SUBSELNET outperforms both Transductive-SUBSELNET and FL consistently in CIFAR100 in terms of the Jaccard coefficient. 

D.3 AVOIDING UNDERFITTING AND OVERFITTING

Since the amount of training data is small, there is a possibility of overfitting. However, the coefficient λ of the entropy regularizer λH(P r π ), can be increased to draw instances from the different regions of the feature space, which in turn can reduce the overfitting. In practice, we tuned λ on the validation set to control such overfitting.We present the accuracies on (training, validation, test) folds for both Transductive-SUBSELNET and Inductive-SUBSELNET in Table 11 .We make the following observations:1. From training to test, in most cases, the decrease in accuracy is ∼ 7%.2. This small accuracy gap is further reduced from validation to test. Here, in most cases, the decrease in accuracy is ∼ 4%.We perform early stopping using the validation set which acts as an additional regularizer and therefore, the amount of overfitting is significantly low. Note that in the case of CIFAR10, we denote the decrease factors of 0.91-0.96 in green, and the decrease factors of 0.85 -0.88 in purple. In case of FMNIST, we denote the decrease factors of 0.94-0.97 in green and the decrease factors of 0.90 -0.93 in purple.We make the following observations:1. We show a better trade-off between accuracy and time and accuracy and memory than almost all the baselines. 2. Observations in CIFAR10: When we tuned the subset sizes, we notice that SUBSELNET, GLISTER, Grad-Match and EL2N can achieve a comparable decrease factor of 0.91-0.93. In terms of speed-up and memory usage, we see that (a) SUBSELNET achieves a 1.3x speed-up as compared to GLISTER and 1.1x speed-up as compared to Grad-Match and EL2N (b) GLISTER consumes 3.7x GPU memory, Grad-Match consumes 3.1x GPU memory and EL2N consumes 2.5x GPU memory as compared to SUBSELNET We notice that none of the other subset selection strategies achieve a high-enough accuracy, and we beat them in terms of speed-up and memory usage. Moreover, for the case when the subset selection methods achieve a decrease factor of 0.85 -0.88, we see that (a) SUBSELNET achieves a 2.4x speed-up as compared to FacLoc, 1.8x speed-up as compared to Pruning, 1.4x speed-up as compared to GLISTER, 1.2x speed-up as compared to Grad-Match and 1.1x speed-up as compared to EL2N (b) FacLoc consumes 4.8x GPU memory, Pruning consumes 1.7x GPU memory, GLISTER consumes 4x GPU memory, Grad-Match consumes 3.4x GPU memory and EL2N consumes 2.6x GPU memory as compared to SUBSELNET. 3. Observations in FMNIST: When we tuned the subset sizes, we notice that SUBSELNET, Facloc, GLISTER, Grad-Match and EL2N can achieve a comparable decrease factor of 0.94-0.97. In terms of speed-up and memory usage, we see that (a) SUBSELNET achieves a 3.8x speed-up as compared to FacLoc, 1.4x speed-up as compared to GLISTER and Grad-Match, and 2.2x speed-up as compared to EL2N. (b) FacLoc consumes 12.5x GPU Memory, and GLISTER, Grad-Match and EL2N consume 2.9x GPU memory as compared to SUBSELNET. We notice that none of the other subset selection strategies achieve a high-enough accuracy, and we beat them in terms of speed-up and memory usage. Moreover, for the case when the subset selection methods achieve a decrease factor of 0.90-0.93, we see that (a) SUBSELNET achieves a 7.4x speed-up as compared to FacLoc, 2.1x speed-up as compared to GLISTER, 2.9x speed-up as compared to Grad-Match and 2.1x speed-up as compared to EL2N (b) FacLoc consumes 28.5x GPU memory, GLISTER consumes 4.5x GPU memory, Grad-Match consumes 6.1x GPU memory and EL2N consumes 3.7x GPU memory as compared to SUBSELNET.We present the trade-off between the accuracy and speed-up, and accuracy and memory consumption in Figure 15 .

