CHAMELEON: LEARNING MODEL INITIALIZATIONS ACROSS TASKS WITH DIFFERENT SCHEMAS

Abstract

Parametric models, and particularly neural networks, require weight initialization as a starting point for gradient-based optimization. Recent work shows that an initial parameter set can be learned from a population of supervised learning tasks that enables a fast convergence for unseen tasks even when only a handful of instances is available (model-agnostic meta-learning). Currently, methods for learning model initializations are limited to a population of tasks sharing the same schema, i.e., the same number, order, type, and semantics of predictor and target variables. In this paper, we address the problem of meta-learning weight initialization across tasks with different schemas, for example, if the number of predictors varies across tasks, while they still share some variables. We propose Chameleon, a model that learns to align different predictor schemas to a common representation. In experiments on 23 datasets of the OpenML-CC18 benchmark, we show that Chameleon can successfully learn parameter initializations across tasks with different schemas, presenting, to the best of our knowledge, the first cross-dataset few-shot classification approach for unstructured data.

1. INTRODUCTION

Humans require only a few examples to correctly classify new instances of previously unknown objects. For example, it is sufficient to see a handful of images of a specific type of dog before being able to classify dogs of this type consistently. In contrast, deep learning models optimized in a classical supervised setup usually require a vast number of training examples to match human performance. A striking difference is that a human has already learned to classify countless other objects, while parameters of a neural network are typically initialized randomly. Previous approaches improved this starting point for gradient-based optimization by choosing a more robust random initialization (He et al., 2015) or by starting from a pretrained network (Pan & Yang, 2010) . Still, models do not learn from only a handful of training examples even when applying these techniques. Moreover, established hyperparameter optimization methods (Schilling et al., 2016) are not capable of optimizing the model initialization due to the high-dimensional parameter space. Few-shot classification aims at correctly classifying unseen instances of a novel task with only a few labeled training instances given. This is typically accomplished by meta-learning across a set of training tasks, which consist of training and validation examples with given labels for a set of classes. The field has gained immense popularity among researchers after recent meta-learning approaches have shown that it is possible to learn a weight initialization across different tasks, which facilitates a faster convergence speed and thus enables classifying novel classes after seeing only a few instances (Finn et al., 2018) . However, training a single model across different tasks is only feasible if all tasks share the same schema, meaning that all instances share one set of features in identical order. For that reason, most approaches demonstrate their performance on image data, which can be easily scaled to a fixed shape, whereas transforming unstructured data to a uniform schema is not trivial. We want to extend popular approaches to operate invariant of schema, i.e., independent of order and shape, making it possible to use meta-learning approaches on unstructured data with varying feature spaces, e.g., learning a model from heart disease data that can accurately classify a few-shot task for diabetes detection that relies on similar features. Thus, we require a schema-invariant encoder that maps heart disease and diabetes data to one feature representation, which then can be used to train a single model via popular meta-learning algorithms like REPTILE (Nichol et al., 2018b) . We propose a set-wise feature transformation model called CHAMELEON, named after a REPTILE capable of adjusting its colors according to the environment in which it is located. CHAMELEON projects different schemas to a fixed input space while keeping features from different tasks but of the same type or distribution in the same position, as illustrated by Figure 1 . Our model learns to compute a task-specific reordering matrix that, when multiplied with the original input, aligns the schema of unstructured tasks to a common representation while behaving invariant to the order of input features. Our main contributions are as follows: (1) We show how our proposed method CHAMELEON can learn to align varying feature spaces to a common representation. (2) We propose the first approach to tackle few-shot classification for tasks with different schemas. ( 3) In experiments on 23 datasets of the OpenML-CC18 benchmark (Bischl et al., 2017 ) collection, we demonstrate how current meta-learning approaches can successfully learn a model initialization across tasks with different schemas as long as they share some variables with respect to their type or semantics. ( 4) Although an alignment makes little sense to be performed on top of structured data such as images which can be easily rescaled, we demonstrate how CHAMELEON can align latent embeddings of two image datasets generated with different neural networks.

2. RELATED WORK

Our goal is to extend recent few-shot classification approaches that make use of optimization-based meta-learning by adding a feature alignment component that casts different inputs to a common schema, presenting the first approach working across tasks with different schema. In this section, we will discuss various works related to our approach. Research on transfer learning (Pan & Yang, 2010; Sung et al., 2018; Gligic et al., 2020) Heterogeneous Transfer Learning tries to tackle a similar problem setting as described in this work. In contrast to regular Transfer Learning, the feature spaces of the auxiliary tasks and the actual task differ and are often non-overlapping (Day & Khoshgoftaar, 2017) . Many approaches require co-occurence data i.e. instances that can be found in both datasets (Wu et al., 2019; Qi et al., 2011) , rely on jointly optimizing separate models for each dataset to propagate information (Zhao & Hoi, 2010; Yan et al., 2016) , or utilize meta-features (Feuz & Cook, 2015) . Oftentimes, these approaches operate on structured data e.g. images and text with different data distributions for the tasks at hand (Li et al., 2019; He et al., 2019) . These datasets can thus be embedded in a shared space with standard models such as convolutional neural networks and transformer-based language models. However, none of these approaches are capable of training a single encoder that operates across a meta-dataset of tasks with different schema for unstructured data.



Figure1: Chameleon Pipeline: Chameleon aims to encode tasks with different schemas to a shared representation with an uniform feature space, which can then be processed by any classifier. The left block represents tasks of the same domain with different schemas. The middle represents the aligned features in a fixed schema.

has shown that training a model on different auxiliary tasks before actually fitting it to the target problem can provide better results if training data is scarce. Motivated by this, few-shot learning approaches try to generalize to novel tasks with unseen classes given only a few instances by first meta-learning across a set of training tasks(Duan et al., 2017; Finn et al., 2017b; Snell et al., 2017). A task τ consists of predictor data X τ , a target Y τ , a predefined training/test split τ = (X train τ , Y train . Typically, an N -way K-shot problem refers to a few-shot learning problem where each task consists of N classes with K training samples per class.

