DISTRIBUTION EMBEDDING NETWORK FOR META-LEARNING WITH VARIABLE-LENGTH INPUT

Abstract

We propose Distribution Embedding Network (DEN) for meta-learning, which is designed for applications where both the data distribution and the number of features could vary across tasks. DEN first transforms features using a learned piecewise linear function, then learns an embedding of the transformed data distribution, and finally classifies examples based on the distribution embedding. We show that the parameters of the distribution embedding and the classification modules can be shared across tasks. We propose a novel methodology to mass-simulate binary classification training tasks, and demonstrate that DEN outperforms existing methods in a number of test tasks in numerical studies.

1. INTRODUCTION

Deep learning has made substantial progress in a variety of tasks in image classification (e.g., He et al., 2016) , object detection (e.g., Redmon & Farhadi, 2017; He et al., 2017) , machine translation (e.g., Vaswani et al., 2017) and natural language understanding (e.g., Devlin et al., 2019) . These achievements rely on efficient gradient-based optimization algorithms (e.g., Duchi et al., 2011; Sutskever et al., 2013; Kingma & Ba, 2015) as well as a large number of labeled examples to train highly flexible deep learning models. However, in many applications, it is prohibitively expensive or impossible to collect a large amount of labeled training data, calling for techniques that can learn from small labeled data. Meta-learning aims to tackle the small data problem by training a model on labeled data from a number of related tasks, with the goal to learn a model that can perform well on similar, but unseen future tasks with only a small amount of labeled training data. In this work, we propose a meta-learning model for classification using Distribution Embedding Networks (DEN). Unlike many existing meta-learning algorithms that assume a fixed feature set across tasks, DEN is designed for applications where both the distribution of features and the number of features could vary across tasks. For example, we may use DEN to learn the optimal aggregator of an ensemble of models to replace the naive majority vote, where in different aggregation tasks, the distribution of model outputs and the number of models in the ensemble could be different. On a high level, DEN first applies a learned feature transformation that allows features to be transformed into the same distribution family across tasks. It then uses a neural network to learn an embedding of the transformed data distribution. Finally, given the learned distribution embedding, together with the transformed features, DEN classifies examples using a Deep Sets architecture (Zaheer et al., 2017) , enabling it to be applied to variable-length inputs. To adapt the model on a new task, we only update the feature transformations with relatively few parameters.

2. RELATED WORK

There are multiple generic techniques applied to the meta-learning problem in the literature. The first camp of approaches learn similarities between pairs of examples. When presented with a new task with a small set of labeled examples, these methods classify unlabeled data based on their similarities with labeled ones. These methods include Matching Net (Vinyals et al., 2016) and Prototypical Net (Snell et al., 2017) , which learn a distance metric between examples. Siamese Net (Koch et al., 2015) and Relation Net (Sung et al., 2018) use twin towers to learn the relationship between examples. Learn Net (Bertinetto et al., 2016) proposes having class specific weights for

