DISTRIBUTION EMBEDDING NETWORK FOR META-LEARNING WITH VARIABLE-LENGTH INPUT

Abstract

We propose Distribution Embedding Network (DEN) for meta-learning, which is designed for applications where both the data distribution and the number of features could vary across tasks. DEN first transforms features using a learned piecewise linear function, then learns an embedding of the transformed data distribution, and finally classifies examples based on the distribution embedding. We show that the parameters of the distribution embedding and the classification modules can be shared across tasks. We propose a novel methodology to mass-simulate binary classification training tasks, and demonstrate that DEN outperforms existing methods in a number of test tasks in numerical studies.

1. INTRODUCTION

Deep learning has made substantial progress in a variety of tasks in image classification (e.g., He et al., 2016) , object detection (e.g., Redmon & Farhadi, 2017; He et al., 2017) , machine translation (e.g., Vaswani et al., 2017) and natural language understanding (e.g., Devlin et al., 2019) . These achievements rely on efficient gradient-based optimization algorithms (e.g., Duchi et al., 2011; Sutskever et al., 2013; Kingma & Ba, 2015) as well as a large number of labeled examples to train highly flexible deep learning models. However, in many applications, it is prohibitively expensive or impossible to collect a large amount of labeled training data, calling for techniques that can learn from small labeled data. Meta-learning aims to tackle the small data problem by training a model on labeled data from a number of related tasks, with the goal to learn a model that can perform well on similar, but unseen future tasks with only a small amount of labeled training data. In this work, we propose a meta-learning model for classification using Distribution Embedding Networks (DEN). Unlike many existing meta-learning algorithms that assume a fixed feature set across tasks, DEN is designed for applications where both the distribution of features and the number of features could vary across tasks. For example, we may use DEN to learn the optimal aggregator of an ensemble of models to replace the naive majority vote, where in different aggregation tasks, the distribution of model outputs and the number of models in the ensemble could be different. On a high level, DEN first applies a learned feature transformation that allows features to be transformed into the same distribution family across tasks. It then uses a neural network to learn an embedding of the transformed data distribution. Finally, given the learned distribution embedding, together with the transformed features, DEN classifies examples using a Deep Sets architecture (Zaheer et al., 2017) , enabling it to be applied to variable-length inputs. To adapt the model on a new task, we only update the feature transformations with relatively few parameters. The second camp of optimization-based meta-learning aims to find a good starting point model, so that when presented with a new task, the meta model can quickly adapt to perform well on the new task with a small number of gradient steps. MAML (Finn et al., 2017) designs a learning algorithm, such that the expected loss of the learned meta-model on new tasks after one gradient step is minimized. Meta-Learner LSTM (Ravi & Larochelle, 2017) modifies the classical gradient steps with learned gradient update weights, which are trained to minimize the validation loss. More recently, LEO (Rusu et al., 2019) extends MAML and utilizes Relation Net to learn a low-dimensional latent embedding of model parameters and performs optimization-based meta-learning from this space.

2. RELATED WORK

Another camp of methods use internal or external memory for meta-learning. They include MANN (Santoro et al., 2016) and Meta Net (Munkhdalai & Yu, 2017), which store the past knowledge in external memory and internal model activations, respectively. New examples are classified by retrieving relevant information from the memory. Our proposal does not take the above three routes. Rather, DEN is conceptually similar to topic modeling that learns the latent context variable. For example, Neural Statistician (Edwards & Storkey, 2017) considers the hierarchical generative process and uses variational autoencoder (Kingma & Welling, 2014) to learn the latent vector that summarizes the dataset in an unsupervised fashion. Similar proposals include variational homoencoder (Hewitt et al., 2018) and CNP (Garnelo et al., 2018) . In comparison, DEN is a supervised procedure, which learns an embedding of the data distribution. We then utilize this distribution embedding for classification on unseen examples.

3. NOTATIONS

In this paper, we use bold upper case letters to denote matrices (e.g., X), bold lower case letters to denote vectors (e.g., x), italic lower case letters to denote scalars (e.g., x), normal text to denote random variables (e.g., x), and bold normal text to denote random vectors (e.g., x). Let T 1 , . . . , T M be M training tasks, following some task distribution P . In training task T i , we observe a set of n i independent features and label pairs, (X Ti , y Ti ), where X Ti = [x 1 Ti , . . . , x di Ti ] ∈ R ni×di is the d i dimensional real-valued feature matrix, x j Ti ∈ R ni is the j-th feature of task T i and y Ti ∈ {0, 1} ni is the binary label vector. We assume that the label is binary for simplicity of presentation. Our proposed model can be trivially extended to multiclass classification problems. We use P Ti to denote the joint distribution of (x Ti , y Ti ). 

4. DISTRIBUTION EMBEDDING NETWORK

To motivate our proposal, we first consider the problem of minimizing the risk on a single task T: θT = arg min θ∈Θ E (x T ,y T )∼P T [L(f (x T ; θ), y T )] , where f (•; θ) is a model with parameter θ and L is the loss function. Lemma 1. Assume the joint distribution P T has a probability density (mass) function q(•; η T ). Then the optimizer θT is of the form φ * L,f,q (η T ), where φ * L,f,q is some deterministic function depending on the loss L, the model f and the density q. The proof of Lemma 1 can be found in Appendix A. It suggests that, when the joint distribution P T is in a parametric family, the dependency of θT on the task T is through two parts: the functional form of the density q and the distribution parameter η T .



There are multiple generic techniques applied to the meta-learning problem in the literature. The first camp of approaches learn similarities between pairs of examples. When presented with a new task with a small set of labeled examples, these methods classify unlabeled data based on their similarities with labeled ones. These methods include Matching Net (Vinyals et al., 2016) and Prototypical Net (Snell et al., 2017), which learn a distance metric between examples. Siamese Net (Koch et al., 2015) and Relation Net (Sung et al., 2018) use twin towers to learn the relationship between examples. Learn Net (Bertinetto et al., 2016) proposes having class specific weights for the towers. Satorras & Estrach (2018) learns the similarity metric using a Graph Neural Network, and Transductive Propagation Network (Liu et al., 2019) classifies all unlabeled data at once by exploring the manifold structure of the new class space.

We first train DEN on the training tasks {T 1 , . . . , T M }. Given a new task S with a small set of labeled examples, we fine-tune the trained model on S using these labeled examples. The final model is then applied to unlabeled examples in S for classification. The set of labeled examples is called the support set and the set of unlabeled examples is called the query set.

