DISTRIBUTION EMBEDDING NETWORK FOR META-LEARNING WITH VARIABLE-LENGTH INPUT

Abstract

We propose Distribution Embedding Network (DEN) for meta-learning, which is designed for applications where both the data distribution and the number of features could vary across tasks. DEN first transforms features using a learned piecewise linear function, then learns an embedding of the transformed data distribution, and finally classifies examples based on the distribution embedding. We show that the parameters of the distribution embedding and the classification modules can be shared across tasks. We propose a novel methodology to mass-simulate binary classification training tasks, and demonstrate that DEN outperforms existing methods in a number of test tasks in numerical studies.

1. INTRODUCTION

Deep learning has made substantial progress in a variety of tasks in image classification (e.g., He et al., 2016) , object detection (e.g., Redmon & Farhadi, 2017; He et al., 2017) , machine translation (e.g., Vaswani et al., 2017) and natural language understanding (e.g., Devlin et al., 2019) . These achievements rely on efficient gradient-based optimization algorithms (e.g., Duchi et al., 2011; Sutskever et al., 2013; Kingma & Ba, 2015) as well as a large number of labeled examples to train highly flexible deep learning models. However, in many applications, it is prohibitively expensive or impossible to collect a large amount of labeled training data, calling for techniques that can learn from small labeled data. Meta-learning aims to tackle the small data problem by training a model on labeled data from a number of related tasks, with the goal to learn a model that can perform well on similar, but unseen future tasks with only a small amount of labeled training data. In this work, we propose a meta-learning model for classification using Distribution Embedding Networks (DEN) . Unlike many existing meta-learning algorithms that assume a fixed feature set across tasks, DEN is designed for applications where both the distribution of features and the number of features could vary across tasks. For example, we may use DEN to learn the optimal aggregator of an ensemble of models to replace the naive majority vote, where in different aggregation tasks, the distribution of model outputs and the number of models in the ensemble could be different. On a high level, DEN first applies a learned feature transformation that allows features to be transformed into the same distribution family across tasks. It then uses a neural network to learn an embedding of the transformed data distribution. Finally, given the learned distribution embedding, together with the transformed features, DEN classifies examples using a Deep Sets architecture (Zaheer et al., 2017) , enabling it to be applied to variable-length inputs. To adapt the model on a new task, we only update the feature transformations with relatively few parameters.

2. RELATED WORK

There are multiple generic techniques applied to the meta-learning problem in the literature. The first camp of approaches learn similarities between pairs of examples. When presented with a new task with a small set of labeled examples, these methods classify unlabeled data based on their similarities with labeled ones. These methods include Matching Net (Vinyals et al., 2016) and Prototypical Net (Snell et al., 2017) , which learn a distance metric between examples. Siamese Net (Koch et al., 2015) and Relation Net (Sung et al., 2018) use twin towers to learn the relationship between examples. Learn Net (Bertinetto et al., 2016) proposes having class specific weights for the towers. Satorras & Estrach (2018) learns the similarity metric using a Graph Neural Network, and Transductive Propagation Network (Liu et al., 2019) classifies all unlabeled data at once by exploring the manifold structure of the new class space. The second camp of optimization-based meta-learning aims to find a good starting point model, so that when presented with a new task, the meta model can quickly adapt to perform well on the new task with a small number of gradient steps. MAML (Finn et al., 2017) designs a learning algorithm, such that the expected loss of the learned meta-model on new tasks after one gradient step is minimized. Meta-Learner LSTM (Ravi & Larochelle, 2017) modifies the classical gradient steps with learned gradient update weights, which are trained to minimize the validation loss. More recently, LEO (Rusu et al., 2019) extends MAML and utilizes Relation Net to learn a low-dimensional latent embedding of model parameters and performs optimization-based meta-learning from this space. Another camp of methods use internal or external memory for meta-learning. They include MANN (Santoro et al., 2016) and Meta Net (Munkhdalai & Yu, 2017) , which store the past knowledge in external memory and internal model activations, respectively. New examples are classified by retrieving relevant information from the memory. Our proposal does not take the above three routes. Rather, DEN is conceptually similar to topic modeling that learns the latent context variable. For example, Neural Statistician (Edwards & Storkey, 2017) considers the hierarchical generative process and uses variational autoencoder (Kingma & Welling, 2014) to learn the latent vector that summarizes the dataset in an unsupervised fashion. Similar proposals include variational homoencoder (Hewitt et al., 2018) and CNP (Garnelo et al., 2018) . In comparison, DEN is a supervised procedure, which learns an embedding of the data distribution. We then utilize this distribution embedding for classification on unseen examples.

3. NOTATIONS

In this paper, we use bold upper case letters to denote matrices (e.g., X), bold lower case letters to denote vectors (e.g., x), italic lower case letters to denote scalars (e.g., x), normal text to denote random variables (e.g., x), and bold normal text to denote random vectors (e.g., x). Let T 1 , . . . , T M be M training tasks, following some task distribution P . In training task T i , we observe a set of n i independent features and label pairs, (X Ti , y Ti ), where X Ti = [x 1 Ti , . . . , x di Ti ] ∈ R ni×di is the d i dimensional real-valued feature matrix, x j Ti ∈ R ni is the j-th feature of task T i and y Ti ∈ {0, 1} ni is the binary label vector. We assume that the label is binary for simplicity of presentation. Our proposed model can be trivially extended to multiclass classification problems. We use P Ti to denote the joint distribution of (x Ti , y Ti ). 

4. DISTRIBUTION EMBEDDING NETWORK

To motivate our proposal, we first consider the problem of minimizing the risk on a single task T: θT = arg min θ∈Θ E (x T ,y T )∼P T [L(f (x T ; θ), y T )] , where f (•; θ) is a model with parameter θ and L is the loss function. Lemma 1. Assume the joint distribution P T has a probability density (mass) function q(•; η T ). Then the optimizer θT is of the form φ * L,f,q (η T ), where φ * L,f,q is some deterministic function depending on the loss L, the model f and the density q. The proof of Lemma 1 can be found in Appendix A. It suggests that, when the joint distribution P T is in a parametric family, the dependency of θT on the task T is through two parts: the functional form of the density q and the distribution parameter η T . Now, if the joint distributions of all training tasks {T 1 , . . . , T M } are in the same parametric family, then there exists a common function φ * L,f,q such that θTi = φ * L,f,q (η Ti ) for every i ∈ {1, . . . , M }. Hence, we may reparameterize f (x T ; φ * L,f,q (η T )) as f (x T , η T ; γ), and learn the new model parameter γ from observations (X Ti , y Ti ) and some estimate ηTi . One advantage of this reparameterization is that the optimal choice of γ is now task-independent, so, once learned from training tasks, it can be transferred to new unseen tasks whose data distribution falls in the same parametric family. Moreover, since the data distributions of all tasks are in the same parametric family, we can learn a shared model to estimate ηT for all T. In practice, the data distribution could vary greatly across tasks. Hence, for each task T, we propose to apply a transformation c T to the original features and work with (c T (x T ), y T ), allowing it to unify the distribution family of (c T (x T ), y T ). To summarize, our proposal, DEN, consists of three building blocks. Firstly, we apply a transformation layer c T to the original features. Secondly, we use a distribution embedding module to obtain a distribution embedding s T of the transformed data (c T (x T ), y T ). This corresponds to the distribution parameter η T discussed above. Finally, we use a classification module to output a prediction based on the transformed feature c T (x T ) and the distribution embedding s T . Here the last two modules are shared across tasks.

4.1. LEARNING THE DISTRIBUTION EMBEDDING

To illustrate the idea of learning a compact distribution embedding, we first consider a special case that x T |(y T = k) ∼ N d (µ T,k , Σ T,k ) for k ∈ {0, 1}. With the normality assumption, P T is uniquely characterized by its first moment vectors E[x T |(y T = k)], second moment matrices E[x T x T |(y T = k)], and the label positive rate P(y T = 1). Given an i.i.d. sample (X T , y T ) where X T := [x 1 T , . . . , x d T ] ∈ R n×d and y T ∈ {0, 1} n , we may estimate these quantities by sample moments, leading to a compact distribution embedding s T = x 1 T,0 , . . . , x d T,0 , x 1 T,0 x 1 T,0 , . . . , x 1 T,0 x d T,0 , . . . , x d T,0 x d T,0 , x 1 T,1 , . . . , x d T,1 , x 1 T,1 x 1 T,1 , . . . , x 1 T,1 x d T,1 , . . . , x d T,1 x d T,1 , y T , where x j T,k denotes the j-th feature vector of examples with label k, x denotes the arithmetic average of its elements for a vector x, and is the element-wise product. There are two notable characteristics in this distribution embedding. Firstly, every element in s T is of an average form over examples within the same class. This motivates us to use a batch average layer in DEN when obtaining the distribution embedding. Secondly, the embedding s T only involves quantities of single features and pairs of features. As discussed in Section 4.1.2, this allows us to decompose the whole distribution embedding vector of task-dependent dimension (e.g., 2d 2 +2d+1 for Gaussian data) into sub-vectors of fixed dimension, opening up an opportunity to handle variablelength features using a Deep Sets architecture (Zaheer et al., 2017) .

4.1.1. NON-GAUSSIAN FEATURES

In practice, the normality assumption is usually overly simplified. To address this issue, one simple strategy is to first apply a transformation c T : R d → R d on each task so that c T (x T )|(y T = k) follows a multivariate normal distribution. However, such transformations are not guaranteed to exist. It is also challenging to analytically construct such transformations, even if they exist. Instead of analytically constructing the transformations c T , we propose to learn a piecewise linear function (PLF) c j T : R → R for each feature j and task T, i.e., z j T := c j T (x; k j T , α j T ) = K-1 i=1 α j T,i + x -k j T,i k j T,i+1 -k j T,i α j T,i+1 -α j T,i 1 k j T,i ≤ x ≤ k j T,i+1 , where k j T := [k j T,1 , . . . , k j T,K ] ∈ R K , k j T,1 < k j T,2 < • • • < k j T,K , is the vector of predetermined keypoints that spans the domainfoot_0 of x j T , and α j T := [α j T,1 , . . . , α j T,K ] ∈ R K is the parameter vector of the PLF, characterizing its output at each keypoint. The PLF can be optionally constrained to be monotonic with α j T,i ≤ α j T,i+1 for all i = 1, . . . , K -1, which serves as a regularization and can be satisfied through parameter projections during training through, e.g., projected SGD. PLFs can implement compact one-dimensional non-linearities that can be learned with a small sample. They are universal approximators in this space: with enough keypoints, they can approximate any bounded continuous function. Although it is not guaranteed that features transformed by PLFs satisfy the normality assumption, PLFs make it possible that transformed features (approximately) belong to the same parametric family for all tasks. Since this parametric family is not necessarily normal, the sample moments used in (2) are no longer appropriate. Fortunately, the next lemma indicates that, as long as this parametric family is an exponential family, the distribution embedding can be chosen as an average form. Lemma 2. Let Q be a probability distribution in an exponential family with density q(u; η) := B(u) exp[λ(η) • S(u) -A(η)]. Let u 1 , . . . , u n i.i.d. ∼ Q. Then n i=1 S(u i ) is a sufficient statistic for η. According to Lemma 2, if the conditional distribution P T (z T |y T = k) belongs to the same exponential family for all tasks T and k ∈ {0, 1}, then there exists a task-independent function S such that S(z T,k ) is sufficient for P T (z T |y T = k). Consequently, we may use s T := [S(z T,0 ), S(z T,1 ), y T ] as a distribution embedding of the joint probability P T , where S can be encoded as a neural network.

4.1.2. VARIABLE-LENGTH FEATURES

To handle variable-length features, instead of learning the whole distribution embedding s T directly, we decompose the joint distribution of (z T , y T ) into smaller pieces: conditional probabilities z i1 T , . . . , z ir T |y for all r-subsets {i 1 , . . . , i r } ⊂ {1, . . . , d} and the marginal y, where r is a hyperparameter shared across tasks. The optimal choice of r should depend on the distribution of (z T , y T ). For example, as discussed in (2), when z T follows a multivariate normal distribution conditional on y T , r = 2 can sufficiently characterize the joint distribution of (z T , y T ). We will use r = 2 in all numerical studies in Section 5. Following the discussion of Lemma 2 in Section 4.1.1, we use a learnable model g, shared across tasks, to derive a distribution embedding vector for each subset {i 1 , . . . , i r } ⊂ {1, . . . , d}: s i1,...,ir T := g z i1 T,0 , . . . , z ir T,0 , g z i1 T,1 , . . . , z ir T,1 , y T , where the average is taken with respect to a training batch during training or the support set during testing. We will discuss it in more detail in Section 4.3. An alternative formulation of s T is to directly embed the joint distribution of (z i1 T , . . . , z ir T , y), s i1,...,ir T = g z i1 T , . . . , z ir T , y T . Both formulations in (4) and (5) work well in numerical studies presented in Section 5.

4.2. CLASSIFICATION WITH DISTRIBUTION EMBEDDING

With a distribution embedding s i1,...,ir T for each subset {i 1 , . . . , i r }, we group the embedding s i1,...,ir T with its associated transformed features (z i1 T , . . . , z ir T ), and then use a Deep Sets structure (Zaheer et al., 2017) over the set of all M -subsets of features: ŷ = ψ   d r -1 {i1,...,ir}⊂{1,...,d} h z i1 T , . . . , z ir T , s i1,...,ir T   , which is able to accommodate a variable number of h's, and in turn accommodate a variable number of inputs. Based on Theorem 7 in Zaheer et al. ( 2017), ( 6) could approximate any permutation invariant continuous function in h. Note that s i1,...,ir T is the average embedding, and all examples within the same batch/task share the same s i1,...,ir T (see details in Section 4.3).

4.3. TRAINING AND INFERENCE

Figure 1 shows a high level summary of our model graph during training, where T 1 , . . . , T m are training tasks. In each gradient step, we first randomly sample a task T i ∈ {T 1 , . . . , T m }. We then sample two disjoint batches A and B of training features and labels from (X Ti , y Ti ). The two batches of features are first transformed using PLFs in (3). We then use (4) or ( 5) to obtain a distribution embedding, taking the average with respect to the sample in batch B. Next, we use the average distribution embedding to make predictions on batch A using the Deep Sets formulation in (6). The use of two batches is to reflect the scenario that during inference, we use the support set (i.e., batch B) to obtain the distribution embedding of the task, with which we classify query set examples (i.e., batch A). This procedure is similar to the idea of episode-based training introduced in Vinyals et al. (2016) . Note that during training, s Ti is identical across examples within the same batch, but it could vary (even within the same task) across batches. If PLFs are able to approximately transform the features into the same parametric family, then the rest of the network can be task-agnostic. Thus, after training on T 1 , . . . , T m , for each new task S, we learn a set of new PLFs (3), while keeping the weights of other layers fixed. Because PLFs only have a small number of parameters, they can be trained on a small support set (X S , y S ). During inference, we first use learned PLFs in (3) to transform features in both the support set and the query set. We then utilize the learned distribution embedding module to obtain s S , where the average in ( 5) and ( 4) is taken over the whole support set. Finally, the embedding s S and PLFtransformed query set features are used to classify query set examples using (6).

5. NUMERICAL STUDIES

In this section, we numerically compare the two formulations of DEN, Conditional DEN (4) and Joint DEN (5) (with r = 2), with a range of baseline methods, including Prototypical Net (Snell et al., 2017) , Relation Net (Sung et al., 2018) and MAML (Finn et al., 2017) applied to a ReLUactivated deep neural network (DNN). All hyperparameters are chosen based on cross-validation on training tasks, and are summarized in Appendix B. We also compare our method against DNNs trained directly on the support set (Direct DNN), whose hyperparameters were chosen based on cross-validation on the support set.foot_1 

5.1. GENERATE TRAINING TASKS THROUGH CONTROLLED SIMULATION

To generate training tasks, we adopt a novel approach. We take seven multiclass image classification datasets: CIFAR-10, CIFAR-100 (Krizhevsky, 2009) , MNIST (LeCun et al., 2010) , Fashion MNIST (Xiao et al., 2017) , EMNIST (Cohen et al., 2017) , Kuzushiji MNIST (KMNIST; Clanuwat et al., 2018) and Street View House Numbers (SVHN; Netzer et al., 2011) . On each dataset, we pick nine equally spaced cutoffs and binarize the labels based on whether the class id is below the cutoff or not. This gives rise to nine binary classification tasks for each dataset with positive label proportion in {0.1, 0.2, . . . , 0.9}. In summary, we collect 7 × 9 = 63 binary classification tasks. To generate features for DEN, we build 50 image classifiers on each task T i ∈ {T 1 , . . . , T 63 }, and take classification scores on the test set as features x Ti ∈ (0, 1) 50 . The 50 convolutional image classifiers all have the structure  Conv(f, k) → MaxPool(p) → Conv(f, k) → MaxPool(p) → • • • → Conv(f, k) → Dense(u) → Dense(u) → • • • → Dense(1)

5.2. META-LEARNING ON MODEL AGGREGATION

In this section, we study the performance of DEN in aggregating the outputs of an ensemble of classifiers. We also study the impact of fine-tuning PLF on the performance of DEN. As discussed in Section 4, fine-tuning PLF is a powerful way to make DEN adapt to a wide variety of tasks with different input distributions. However, in the scenario when the distributions of training and target tasks are similar, fine-tuning PLF can introduce overfitting that outweighs its benefit, especially when the support set is small compared to the number of features. To introduce similar training and target tasks, we use 5 × 9 = 45 tasks derived from CIFAR- The entire training and fine-tuning process is repeated 5 times. For each test task, we report the average AUC score across 5 × 100 aggregation sub-tasks and its estimated standard error. In Table 1 , we present the result where we aggregate C = 25 classifiers. Average and product rules are simple heuristics that make predictions based on the average and product of the 25 classification outputs, respectively. In Table 2 , we show the result where the number of classifiers C to be aggregated is sampled uniformly from [13, 25] across those 500 trials. To allow baseline methods to take varying number of features, we repeat and add a random subset of (25 -C) features so that all inputs have 25 features. We observe that DEN significantly outperforms other methods in seven out of eight tasks. 1 and 2 also show that in 15 out of 16 comparisons, DEN without fine-tuning is statistically no worse than DEN with fine-tuning on the PLF layer. This may suggest that fine-tuning the PLF layer is not necessary when the input distribution is similar among tasks.

5.3. META-LEARNING ON REAL DATASETS

Finally, we apply DEN to seven target tasks from three real datasets: Diamonds, Nomao and Puzzles. We give a short description of each dataset below, and list the features in Appendix D. • With Diamonds datafoot_2 , we use five features to classify whether the price of the diamond is above $5325, $2402, or $951. This results in three binary classification tasks with positive label proportions of 25%, 50%, and 75%, respectively. • With Nomao datafoot_3 , we use seven features to classify whether two business entities are identical. Positive examples account for 71% of the data. • With Puzzles datafoot_4 , we use six features to classify whether the number of units sold for each puzzle in a six months period is above 93, 45, or 24. The support set (with 155 puzzles) covers the first six months period, whereas the query set (with 367 puzzles) covers the second and third six months periods. The three cutoffs result in positive label proportions of 25%, 49% and 74% in the support set, and 23%, 51% and 73% in the query set. Under review as a conference paper at ICLR 2021 In this study, we use all 63 tasks described in Section 5.1 during training. We train a single Joint DEN and a single Conditional Den model for all seven tasks with C ≤ 9, and fine-tune the PLF on 50 support set examples for Diamonds and Nomao tasks, and on 155 support set examples for Puzzles tasks. For other methods, we train a different model for each dataset (three in total), using a suitable number of features. In the Nomao data, six out of seven features contain missing values. For DEN, we learn a missing value embedding via the PLF, whereas for other methods, we add six missing value indicator features, each corresponding to a feature with missing value. We repeat the whole procedure for 20 times with different random seeds in initialization and selection of training batches and report the average AUC and standard error in Tables 3, 4 and 5 . DEN significantly outperforms other methods in six out of seven tasks. Compared with other methods, DEN is especially impressive in the Nomao task, in which all features are monotonic, and six of them include missing values. Direct DNN performs well in two of the Puzzles tasks with 155 support set examples, indicating that the benefit of DEN diminishes with a large support set. Note that fine-tuning the PLF greatly improved the performance of DEN in these seven tasks, which shows that fine-tuning the PLF is helpful when the training and target tasks have different distributions. 

6. CONCLUSION

In this paper, we presented a novel meta-learning algorithm that can be applied to settings where both the distribution and number of input features vary across tasks. Most other meta-learning techniques do not readily handle such settings. In numerical studies, we demonstrated that an application of the proposed method on binary classification problems outperforms several meta-learning baselines. The permutation invariance of the proposed structure is an interesting venue to explore in future work; see (Bloem-Reddy & Teh, 2020) for a thorough discussion on this topic. Furthermore, we can use the Set Transformer (Lee et al., 2019) in place of the Deep Sets structure in the classifier block of DEN, and we leave the comparison to future work. Our proposed method can be extended to multi-class classifications by revising the formulations in (4) and ( 5). DEN can also be combined with optimization based meta-learning methods, e.g., MAML, to further improve its performance. Finally, although convenient and intuitive, the onedimensional PLFs are not able to account for the associations between features. Improving the first layer of DEN may lead to further performance improvement. Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a A APPENDIX: PROOF OF LEMMAS Proof of Lemma 1. For simplicity, we assume the joint distribution P T has probability density q(•; η). Define a map L : Θ × R r → R + by L(θ, η) := E (x,y)∼P T [L(f (x; θ), y)] = L(f (x; θ), y)q(x, y; η)dxdy, where L depends on the loss L, the model f and the density q. Then the problem becomes θ = arg min θ∈Θ L(θ, η). Hence, θ is of the form φ * L,f,q (η). Proof of Lemma 2. By independence, the joint probability of {u i } n i=1 is n i=1 q(u i ; η) = n i=1 B(u i ) exp λ(η) • n i=1 S(u i ) -nA(η) . According to Fisher-Neyman factorization theorem (Halmos & Savage, 1949 ), the statistic s(U ) := n i=1 S(u i ) is sufficient for η.

B APPENDIX: MODEL STRUCTURES AND HYPERPARAMETERS

In this section, we report model structures and hyperparameters of all models we have considered in Section 5. We use cross-validation on 5 × 9 tasks derived from CIFAR-10, CIFAR-100, MNIST, Fashion MNIST and EMNIST to tune the hyperaprameters. Note that those 45 tasks were treated as training tasks in all experiments presented in Section 5. For DEN, Prototypical Net and Relation Net, we trained each of them for 200 epochs using Adam optimizer with batch size 256 and the TensorFlow default learning rate, 0.001. Their model-related hyperparameters were chosen based on 5-fold cross-validation: in each fold, we trained the model on all but one training datasets and evaluated it on the remaining onefoot_5 , feeding with 500 supporting examples; we then selected the model with the highest AUC scores averaged over 5 folds. For DNN with MAML, after training it on the 36 tasks, we fine-tuned the last layer of the trained DNN on a support set of size 500 of each validation task, then evaluated it on the remaining examples of this validation task. For the DNN model directly trained on the support set, its hyperparameters were selected on the support set of each test task. So it could vary from task to task. We only report its model structure since there are many test tasks. Based on cross-validation, we arrived at the following model architectures. We use the following architectures across all of our experiments. We use m-Dense(u) to represent m consecutive ReLUactivated dense layers with u units. With a bit abuse of notation, we use Dense(1) to represent a sigmoid-activated dense layer with 1 unit. We denote by BatchAvg an average layer over examples in a batch, and denote by PairAvg an average layer over pairs of features. • Direct DNN: m-Dense(u) → Dense(1), where m and u are chosen based on the specific test task and can vary from task to task. • Prototypical Net: the embedding model is 2-Dense(32). • Remark that, during hyperparameter tuning, the maximum number of ReLU-activated dense layers we have tried is 12 and the maximum number of units we have tried is 64 for all models. The proposed Joint DEN and Conditional DEN tend to prefer larger models than the rest ones.

C APPENDIX: ADDITIONAL RESULTS FOR MODEL AGGREGATION EXPERIMENTS

In this section, we present results for Joint DEN and Conditional DEN with fine-tuning in model aggregation experiments. For comparison, we also report results for Direct DNN and DNN with MAML.

C.1 FIXED NUMBER OF FEATURES

We consider a support set of size 500. For Joint DEN and Conditional DEN, we use cross-validation on the support set to tune the hyperparameters-tuning epochs, initial learning rate and batch size. For DNN with MAML, we also tune the number of tuning layers, e.g., fine-tune the last one or two dense layers, in addition to the above three hyperparameters. We repeat the procedures in the above section for the case when the number of features can vary.

D APPENDIX: DESCRIPTION OF REAL DATASETS

We use three datasets in our real data analysis. With Diamonds data, we use numeric carat, cut in ordinal scale, color in ordinal scale, clarity in ordinal scale, numeric depth and numeric table to classify whether the price of diamond is above $951, above $2402, or above $5325, which corresponds to the 25% qunatile, median and 75% quantile in the data. Four of the six features, caret, cut, color and clarity are expected to be monotonic, whereas depth and table have unknown monotonicity. With Puzzles data, we use has photo (whether the reviews has a photo), is amazon (whether the reviews were on Amazon), number of times users found the reviews to be helpful, total number of reviews, age of the reviews and number of words in the reviews to classify whether the units sold for the respective puzzle is above 24, above 45 or above 93. Most of the features have unknown monotonicity in their effect in the label.



For instance, they can be chosen as quantiles of the distribution of x j T . We will make the code of the experiments of Section 5.1 publicly available upon acceptance of this paper. https://www.kaggle.com/shivam2503/diamonds https://archive.ics.uci.edu/ml/datasets/Nomao https://www.kaggle.com/dbahri/puzzles For example, in one of the five folds, we trained the models on 4×9 tasks derived from CIFAR-10, CIFAR-100, MNIST and Fashion MNIST datasets, and evaluated the model on the 9 tasks derived from the EMNIST dataset



DEN on the training tasks {T 1 , . . . , T M }. Given a new task S with a small set of labeled examples, we fine-tune the trained model on S using these labeled examples. The final model is then applied to unlabeled examples in S for classification. The set of labeled examples is called the support set and the set of unlabeled examples is called the query set.

Figure 1: Block diagram of the model graph during training. We resample batches A and B and Task T i after each gradient step.

Relation Net: the embedding model is 2-Dense(32), and the relation model is 2-Dense(16) → Dense(1). • DNN + MAML: 4-Dense(64) → Dense(1), 1200 training batches of size 256 where each batch contains 10 tasks, inner learning rate is 0.01, outer learning rate is 0.005, finetune 6 epochs. • Joint DEN: PLF with 10 calibration keypoints, the distribution embedding model is 3-Dense(64) → BatchAvg, the Deep Sets classification model is 3-Dense(64) → PairAvg → 3-Dense(64) → Dense(1). • Conditional DEN: PLF with 10 calibration keypoints, the distribution embedding model is 4-Dense(16) → BatchAvg, the Deep Sets classification model is 4-Dense(64) → PairAvg → 4-Dense(64) → Dense(1).

, where the number of dense layers d ∈ [1, 4], the number of convoluted layers c ∈ [0, 3], filters and units f, u ∈ [2, 511], kernel and pool sizes k, p ∈ [2, 5], and training epochs e ∈ [1, 24] are uniformly sampled. Note that these 50 classifiers range from linear classifiers to ReLU-activated deep convolutional neural networks, with accuracy ranging from below 0.6 to over 0.99. Finally, to augment the training data, we apply sub-sampling during training. In each training step, after selecting a task and two disjoint batches of training examples as discussed in Section 4.3, we randomly pick C < 50 classifiers to construct a sub-task that aims to aggregate the classification scores of C classifiers. Here C could vary across training steps, and thus the joint distribution P

PLFs during training across tasks for simplicity. We then pick four test tasks from SVHN and KMNIST (two from each dataset) of different difficulties. The average AUC across 50 classifiers are 68.28%, 78.11%, 91.51%, and 87.58%, respectively. Given a test task, we randomly select 100 sets of C classifiers among the 50 candidate classifiers, which results in 100 aggregation sub-tasks for each of the four test tasks. For each aggregation sub-tasks, we form a support set with 50 labeled examples and a disjoint query set with 8000 examples. "Direct DNN" is trained directly on the 50 support set examples. For DNN with MAML, we fine-tune the last DNN layer for 5 epochs on the support set. For DEN, we either fine-tune the first layer of C PLFs for 10 epochs, or take C PLFs directly from the last training epoch without fine-tuning.

Test % AUC (standard error) when aggregating 25 classifiers. Bold fonts indicate the best method at 95% significance level.

Test % AUC (standard error) when aggregating variable number of classifiers. Bold fonts indicate the best method at 95% significance level.

Test % AUC (standard error) on Diamonds data. We binarize price based on $951, $2402 and $5325 to generate three tasks. Bold fonts indicate the best method at 95% significance level.

Test % AUC (standard error) on Nomao data. Bold fonts indicate the best method at 95% significance level.

Test % AUC (standard error) on Puzzles data. We binarize number of units sold based on 24, 45 and 93 to generate three tasks. Bold fonts indicate the best method at 95% significance level.

novel image dataset for benchmarking machine learning algorithms. CoRR, abs/1708.07747, 2017. Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Russ R Salakhutdinov, and Alexander J Smola. Deep sets. In Advances in Neural Information Processing Systems 30, pp. 3391-3401. 2017.

Test % AUC (standard error) on aggregating 25 classifiers with 500 supporting examples. Bold fonts indicate the best method at 95% significance level.

Test % AUC (standard error) on aggregating variable number of classifiers with 500 supporting examples. With Nomao data, we use fax trigram similarity score, street number trigram similarity score, phone trigram similarity score, clean name trigram similarity score, geocoder input address trigram similarity score, coordinates longitude trigram similarity score and coordinates latitude trigram similarity score to classify whether two businesses are identical. All of the features are outputs from some other models, and are expected to be monotonic. Six and seven of the features have missing values. Fax trigram similarity score is missing 97% of the time; phone trigram similarity score is missing 58% of the time; street number trigram similarity score is missing 35% of the time; geocoder input address trigram similarity score is missing 0.1% of the time, and both coordinates longitude trigram similarity score and coordinates latitude trigram similarity score are missing 55% of the time.

