CHAMELEON: LEARNING MODEL INITIALIZATIONS ACROSS TASKS WITH DIFFERENT SCHEMAS

Abstract

Parametric models, and particularly neural networks, require weight initialization as a starting point for gradient-based optimization. Recent work shows that an initial parameter set can be learned from a population of supervised learning tasks that enables a fast convergence for unseen tasks even when only a handful of instances is available (model-agnostic meta-learning). Currently, methods for learning model initializations are limited to a population of tasks sharing the same schema, i.e., the same number, order, type, and semantics of predictor and target variables. In this paper, we address the problem of meta-learning weight initialization across tasks with different schemas, for example, if the number of predictors varies across tasks, while they still share some variables. We propose Chameleon, a model that learns to align different predictor schemas to a common representation. In experiments on 23 datasets of the OpenML-CC18 benchmark, we show that Chameleon can successfully learn parameter initializations across tasks with different schemas, presenting, to the best of our knowledge, the first cross-dataset few-shot classification approach for unstructured data.

1. INTRODUCTION

Humans require only a few examples to correctly classify new instances of previously unknown objects. For example, it is sufficient to see a handful of images of a specific type of dog before being able to classify dogs of this type consistently. In contrast, deep learning models optimized in a classical supervised setup usually require a vast number of training examples to match human performance. A striking difference is that a human has already learned to classify countless other objects, while parameters of a neural network are typically initialized randomly. Previous approaches improved this starting point for gradient-based optimization by choosing a more robust random initialization (He et al., 2015) or by starting from a pretrained network (Pan & Yang, 2010) . Still, models do not learn from only a handful of training examples even when applying these techniques. Moreover, established hyperparameter optimization methods (Schilling et al., 2016) are not capable of optimizing the model initialization due to the high-dimensional parameter space. Few-shot classification aims at correctly classifying unseen instances of a novel task with only a few labeled training instances given. This is typically accomplished by meta-learning across a set of training tasks, which consist of training and validation examples with given labels for a set of classes. The field has gained immense popularity among researchers after recent meta-learning approaches have shown that it is possible to learn a weight initialization across different tasks, which facilitates a faster convergence speed and thus enables classifying novel classes after seeing only a few instances (Finn et al., 2018) . However, training a single model across different tasks is only feasible if all tasks share the same schema, meaning that all instances share one set of features in identical order. For that reason, most approaches demonstrate their performance on image data, which can be easily scaled to a fixed shape, whereas transforming unstructured data to a uniform schema is not trivial. We want to extend popular approaches to operate invariant of schema, i.e., independent of order and shape, making it possible to use meta-learning approaches on unstructured data with varying feature spaces, e.g., learning a model from heart disease data that can accurately classify a few-shot task for diabetes detection that relies on similar features. Thus, we require a schema-invariant encoder that maps heart disease and diabetes data to one feature representation, which then can be used to train a single model via popular meta-learning algorithms like REPTILE (Nichol et al., 2018b) . We propose a set-wise feature transformation model called CHAMELEON, named after a REPTILE capable of adjusting its colors according to the environment in which it is located. CHAMELEON projects different schemas to a fixed input space while keeping features from different tasks but of the same type or distribution in the same position, as illustrated by Figure 1 . Our model learns to compute a task-specific reordering matrix that, when multiplied with the original input, aligns the schema of unstructured tasks to a common representation while behaving invariant to the order of input features. Our main contributions are as follows: (1) We show how our proposed method CHAMELEON can learn to align varying feature spaces to a common representation. (2) We propose the first approach to tackle few-shot classification for tasks with different schemas. (3) In experiments on 23 datasets of the OpenML-CC18 benchmark (Bischl et al., 2017 ) collection, we demonstrate how current meta-learning approaches can successfully learn a model initialization across tasks with different schemas as long as they share some variables with respect to their type or semantics. (4) Although an alignment makes little sense to be performed on top of structured data such as images which can be easily rescaled, we demonstrate how CHAMELEON can align latent embeddings of two image datasets generated with different neural networks.

2. RELATED WORK

Our goal is to extend recent few-shot classification approaches that make use of optimization-based meta-learning by adding a feature alignment component that casts different inputs to a common schema, presenting the first approach working across tasks with different schema. In this section, we will discuss various works related to our approach. Research on transfer learning (Pan & Yang, 2010; Sung et al., 2018; Gligic et al., 2020) has shown that training a model on different auxiliary tasks before actually fitting it to the target problem can provide better results if training data is scarce. Motivated by this, few-shot learning approaches try to generalize to novel tasks with unseen classes given only a few instances by first meta-learning across a set of training tasks (Duan et al., 2017; Finn et al., 2017b; Snell et al., 2017) . A task τ consists of predictor data X τ , a target Y τ , a predefined training/test split τ = (X train τ , Y train τ , X test τ , Y test τ ) and a loss function L τ . Typically, an N -way K-shot problem refers to a few-shot learning problem where each task consists of N classes with K training samples per class. Heterogeneous Transfer Learning tries to tackle a similar problem setting as described in this work. In contrast to regular Transfer Learning, the feature spaces of the auxiliary tasks and the actual task differ and are often non-overlapping (Day & Khoshgoftaar, 2017) . Many approaches require co-occurence data i.e. instances that can be found in both datasets (Wu et al., 2019; Qi et al., 2011) , rely on jointly optimizing separate models for each dataset to propagate information (Zhao & Hoi, 2010; Yan et al., 2016) , or utilize meta-features (Feuz & Cook, 2015) . Oftentimes, these approaches operate on structured data e.g. images and text with different data distributions for the tasks at hand (Li et al., 2019; He et al., 2019) . These datasets can thus be embedded in a shared space with standard models such as convolutional neural networks and transformer-based language models. However, none of these approaches are capable of training a single encoder that operates across a meta-dataset of tasks with different schema for unstructured data. Early approaches like (Fe-Fei et al., 2003) already investigated the few-shot learning setting by representing prior knowledge as a probability density function. In recent years, various works proposed new model-based meta-learning approaches which rapidly improved the state-of-the-art few-shot learning benchmarks. Most prominently, this includes methods which rely on learning an embedding space for non-parametric metric approaches during inference time (Vinyals et al., 2016; Snell et al., 2017) , and approaches which utilize an external memory which stores information about previously seen classes (Santoro et al., 2016; Munkhdalai & Yu, 2017) . Several more recent meta-learning approaches have been developed which introduce architectures and parameterization techniques specifically suited for few-shot classification (Mishra et al., 2018; Shi et al., 2019; Wang & Chen, 2020) while others try to extract useful meta-features from datasets to improve hyper-parameter optimization (Jomaa et al., 2019) . In contrast, Finn et al. (2017a) showed that an optimization-based approach, which solely adapts the learning paradigm can be sufficient for learning across tasks. Model Agnostic Meta-Learning (MAML) describes a model initialization algorithm that is capable of training an arbitrary model f across different tasks. Instead of sequentially training the model one task at a time, it uses update steps from different tasks to find a common gradient direction that achieves a fast convergence. In other words, for each meta-learning update, we would need an initial value for the model parameters θ. Then, we sample a batch of tasks T , and for each task τ ∈ T we find an updated version of θ using N examples from the task by performing gradient descent with learning rate α as in: θ τ ← θ -α∇ θ L τ (f θ ). The final update of θ with step size β will be: θ ← θ -β 1 |T | ∇ θ τ L τ (f θ τ ) Finn et al. (2017a) state that MAML does not require learning an update rule (Ravi & Larochelle, 2016) , or restricting their model architecture (Santoro et al., 2016) . They extended their approach by incorporating a probabilistic component such that for a new task, the model is sampled from a distribution of models to guarantee a higher model diversification for ambiguous tasks (Finn et al., 2018) . However, MAML requires to compute second-order derivatives, resulting in a computationally heavy approach. Nichol et al. (2018b) extend upon the first-order approximation given as an ablation by Finn et al. (2018) , which numerically approximates Equation (1) by replacing the second derivative with the weights difference, s.t. the update rule used in REPTILE is given by: θ ← θ -β 1 |T | τ (θ τ -θ) which means we can use the difference between the previous and updated version as an approximation of the second-order derivatives to reduce computational cost. The serial version is presented in Algorithm (1).foot_0 All of these approaches rely on a fixed schema, i.e. the same set of features with identical alignment across all tasks. However, many similar datasets only share a subset of their features, while oftentimes having a different order or representation e.g. latent embeddings for two different image datasets generated by training two similar architectures. Most current few-shot classification approaches sample tasks from a single dataset by selecting a random subset of classes; although it is possible to train a single meta-model on two different image datasets as shown by Munkhdalai & Yu (2017) and Tseng et al. (2020) since the images can be scaled to a fixed size. Further research demonstrates that it is possible to learn a single model across different output sizes (Drumond et al., 2020) . Recently, a meta-dataset for few-shot classification of image tasks was also published to promote meta-learning across multiple datasets (Triantafillou et al., 2020) . Optimizing a single model across various datasets requires a shared feature space. Thus, it is required to align the features which is achieved by simply rescaling all instances in the case of image data which is not trivial for unstructured data. Recent work relies on preprocessing images to a one-dimensional latent embedding with an additional deep neural network. The authors Rusu et al. (2019) train a Wide Residual Network (Zagoruyko & Komodakis, 2016) on the meta-training data of MiniImageNet (Vinyals et al., 2016) to compute latent embeddings of the data which are then used for few-shot classification, demonstrating state-of-the-art results. Finding a suitable initialization for deep network has long been a focus of machine learning research. Especially the initialization of Glorot & Bengio (2010) and later He et al. (2015) which emphasize the importance of a scaled variance that depends on the layer inputs are widely used. Similar findings are also reported by Cao et al. (2019 ). Recently, Dauphin & Schoenholz (2019) showed that it is possible to learn a suitable initialization by optimizing the norms of the respective weights. So far, none of these methods tried to learn a common initialization across tasks with different schema. We propose a novel feature alignment component named CHAMELEON, which enables state-of-the-art methods to learn how to work on top of tasks whose feature vector differ not only in their length but also their concrete alignment. Our model shares resemblance with scaled dot-product attention popularized by (Vaswani et al., 2017) : Attention(Q, K, V ) = sof tmax( QK T √ d K )V where Q, K and V are matrices describing queries, keys and values, and d K is the dimensionality of the keys such that the softmax computes an attention mask which is then multiplied with the values V . In contrast to this, we pretrain the parametrized model CHAMELEON to compute a soft permutation matrix which can realign features across tasks with varying schema when multiplied with V instead of computing a simple attention mask. Algorithm 1 REPTILE Nichol et al. (2018b) Input: Meta-dataset T = {(X 1 , Y 1 , L 1 ), ..., (X |T | , Y |T | , L |T | )}, learning rate β 1: Randomly initialize parameters θ of model f 2: for iteration = 1, 2, ... do 3: Sample task (X τ , Y τ , L τ ) ∼ T 4: θ ← θ 5: for k steps = 1,2,... do 6: θ ← θ -α∇ θ L τ (Y τ , f (X τ ; θ )) 7: end for 8: θ ← θβ(θθ) 9: end for 10: return parameters θ of model f

3.1. PROBLEM SETTING

We describe a classification dataset with vector-shaped predictors (i.e., no images, time series etc.) by a pair (X, Y ) ∈ R N ×F × {0, ..., C} N , with predictors X and targets Y , where N denotes the number of instances, F the number of predictors and C the number of classes. Let D F := N ∈N R N ×F × {0, ..., C} N be the space of all such datasets with F predictors and D := F ∈N D F be the space of any such dataset. Let us also denote the space of all predictor matrices with F predictors by X F := N ∈N R N ×F and all predictor matrices by X := F ∈N X F . Then a dataset τ = (X, Y ) ∈ D equipped with a predefined training/test split, i.e. the quadruplet τ = (X train τ , Y train τ , X test τ , Y test τ ) is called a task. A collection of such tasks T ⊂ D is called a meta- dataset. Similar to splitting a single data set into a training and test part, one can split a meta-dataset T = T train ∪ T test . The schema of a task τ then describes not only the number and order, but also the semantics of predictor variables {p τ1 , p τ2 , . . . , p τ F } in X train τ . Consider a meta-dataset of correlated tasks T ⊂ D, such that the predictor variables {p τ1 , p τ2 , . . . , p τ F } of any individual task τ are contained in a common set of predictor variables {p 1 , p 2 , . . . , p K }. Methods like REPTILE and MAML try to find the best initialization for a specific model, in this work referred to as ŷ, to operate on a set T of similar tasks. However, every task τ has to share the same schema of common size K, where similar features shared across tasks are in the same position. A feature-order invariant encoder is needed to map the data representation X τ of tasks with varying input schema and feature length F τ to a shared latent representation X τ with fixed feature length K: enc : where N represents the number of instances in X τ , F τ is the number of features of task τ which varies across tasks, and K is the size of the desired feature space. By combining this encoder with model ŷ that works on a fixed input size K and outputs the predicted target e.g. binary classification, it is possible to apply the REPTILE algorithm to learn an initialization θ init across tasks with different schema. The optimization objective then becomes the meta-loss for the combined network f = ŷ • enc over a set of tasks T : X -→ X K , X τ ∈ R N ×Fτ -→ X τ ∈ R N ×K (4) Input X τ: N × Fτ Chameleon Output X τ Π(X τ ): N × K Transpose Conv1D (N×8×1) Conv1D (N×16×1) Conv1D (N×K×1) + Softmax Π(Xτ) Fτ × K Fτ × 16 Fτ × 8 Fτ × N argmin θ init E τ ∼T L τ Y test τ , f X test τ ; θ (u) τ s.t. θ (u) τ = A (u) X train τ , Y train τ , L τ , f ; θ init (5) where θ init is the set of initial weights for the combined network f consisting of enc with parameters θ enc and model ŷ with parameters θ ŷ , and θ (u) τ are the updated weights after applying the learning procedure A for u iterations on the task τ as defined in Algorithm 1 for the inner updates of REPTILE. It is important to mention that learning one weight parameterization across any heterogeneous set of tasks is extremely difficult since it is most likely impossible to find one initialization for two tasks with a vastly different number and type of features. By contrast, if two tasks share similar features, one can align the similar features to a common representation so that a model can directly learn across different tasks by transforming the tasks as illustrated in Figure 1 .

3.2. CHAMELEON

Consider a set of tasks where a right stochastic matrix Π τ exists for each task that reorders predictor data X τ into X τ having the same schema for every task τ ∈ T : X τ = X τ • Π τ , where    x1,1 . . . x1,K . . . . . . . . . xN,1 . . . xN,K    Xτ =    x 1,1 . . . x 1,Fτ . . . . . . . . . x N,1 . . . x N,Fτ    Xτ •    π 1,1 . . . π 1,K . . . . . . . . . π Fτ ,1 . . . π Fτ ,K    Πτ Every x m,n represents the feature n of sample m. Every π m,n represent how much of feature m (from samples in X τ ) should be shifted to position n in the adapted input X τ . Finally, every xm,n represent the new feature n of sample m in X τ with the adpated shape and size. In order to achieve the same X τ when permuting two features of a task X τ , we must simply permute the corresponding rows in Π τ to achieve the same X τ . Since Π τ is a right stochastic matrix, the summation for every row of Π τ is set to be equal to 1 as in i π j,i = 1, so that each value in Π τ simply states how much a feature is shifted to a corresponding position. For example: Consider that task a has features [apples, bananas, melons] and task b features [lemons, bananas, apples]. Both can be transformed to the same representation [apples, lemons, bananas, melons] by replacing missing features with zeros and reordering them. This transformation must have the same result for a and b independent of their feature order. In a real life scenario, features might come with different names or sometimes their similarity is not clear to the human eye. Note that a classic autoencoder is not capable of this as it is not invariant to the order of the features. Our proposed component, denoted by Φ, takes a task as input and outputs the corresponding reordering matrix: Φ(X τ , θ enc ) = Πτ (7) The function Φ is a neural network parameterized by θ enc . It consists of three 1D-convolutions, where the last one is the output layer that estimates the alignment matrix via a softmax activation. The input is first transposed to size [F τ × N ] (where N is the number of samples) i.e., each feature is represented by a vector of instances. Each convolution has kernel length 1 (as the order of instances is arbitrary and thus needs to be permutation invariant) and a channel output size of 8, 16, and lastly K. The result is a reordering matrix displaying the relation of every original feature to each of the K features in the target space. Each of these vectors passes through a softmax layer, computing the ratio of features in X τ shifted to each position of X τ . Finally, the reordering matrix can be multiplied with the input to compute the aligned task as defined in Equation ( 6). By using a kernel length of 1 in combination with the final matrix multiplication, the full architecture becomes permutation invariant in the feature dimension. Column-wise permuting the features of an input task leads to the corresponding row-wise permutation of the reordering matrix. Thus, multiplying both matrices results in the same aligned output independent of permutation. The overall architecture can be seen in Figure 2 . The encoder necessary for training across tasks with different predictor vectors with REPTILE by optimizing Equation ( 5) is then given as: enc : X τ -→ X τ • Φ(X τ , θ enc ) = X τ • Πτ (8)

3.3. REORDERING TRAINING

Only joint-training the network ŷ • enc as described above, will not teach CHAMELEON denoted by Φ how to reorder the features to a shared representation. That is why it is necessary to train Φ specifically with the objective of reordering features (reordering training). In order to do so, we optimize Φ to align novel tasks by training on a set of tasks for which the reordering matrix Π τ exists such that it maps τ to the shared representation. In other words, we require a meta-dataset that contains not only a set of similar tasks τ ∈ T with different schema, but also the position for each feature in the shared representation given by a permutation matrix. If Π τ is known beforehand for each τ ∈ T , optimizing Chameleon becomes a simple supervised classification task based on predicting the new position of each feature in τ . Thus, we can minimize the expected reordering loss over the meta-dataset: θ enc = argmin θenc E τ ∼T L Φ Π τ , Πτ where L Φ is the softmax cross-entropy loss, Π τ is the ground-truth (one-hot encoding of the new position for each variable), and Πτ is the prediction. This training procedure can be seen in Algorithm (2). The trained CHAMELEON model can then be used to compute the Π τ for any unseen task τ ∈ T .

Algorithm 2 Reordering Training

Input: θ enc ←θ enc -γ∇L Φ (Π τ , Φ(X τ , θ enc )) 5: end for 6: return Trained parameters θ enc of the CHAMELEON model After this training procedure, we can use the learned weights as initialization for Φ before optimizing ŷ • enc with REPTILE without further using L Φ . Experiments show that this procedure improves our results significantly compared to only optimizing the joint meta-loss. Meta-dataset T = {(X 1 , Π 1 ), ..., (X |T | , Π |T | )}, latent dimension K, learning Training the CHAMELEON component to reorder similar tasks to a shared representation not only requires a meta-dataset but one where the true reordering matrix Π τ is provided for every task. In application, this means manually matching similar features of different training tasks so that novel tasks can be matched automatically. However, it is possible to sample a broad number of tasks from a single dataset by sampling smaller sub-tasks from it, selecting a random subset of features in arbitrary order for N random instances. Thus, it is not necessary to manually match the features since all these sub-tasks share the same Πτ apart from the respective permutation of the rows as mentioned above.

4. EXPERIMENTAL RESULTS

Baseline and Setup In order to evaluate the proposed method, we investigate the combined model ŷ • enc with the initialization for enc obtained by pretraining CHAMELEON as defined in Equation 9before using REPTILE to jointly optimize ŷ • enc. We compare the performance with an initialization obtained by running REPTILE on the base model ŷ by training on tasks padded to a fixed size K as ŷ is not schema invariant. Both initializations are then compared to the performance of model ŷ with random Glorot initialization (Glorot & Bengio, 2010 ) (referred to as Random). In all of our experiments, we measure the performance of a model and its initialization by evaluating the validation data of a task after performing three update steps on the respective training data. All experiments are conducted in two variants: In Split experiments, test tasks contain novel features in addition to features seen during meta-training. In contrast, test tasks in No-Split experiments only consist of features seen during meta-training. While the Split experiments evaluate the performance of the model when faced with novel features during meta-testing, the No-Split experiments can be used to compare against a perfect alignment by repeating the baseline experiment with tasks that are already aligned (referred to as Oracle). A detailed description of the utilized models is found in Appendix B. Meta-Datasets For our main experiments, we utilize a single dataset as meta-dataset by sampling the training and test tasks from it. This allows us to evaluate our method on different domains without matching related datasets since Πτ is naturally given for a subset of permuted features. Novel features can also be introduced during testing by splitting not only the instances but also the features of a dataset in train and test partition (Split). Training tasks are then sampled by selecting a random subset of the training features in arbitrary order for N instances. Stratified sampling guarantees that test tasks contain both features from train and test while sampling the instances from the test set only. For all experiments, 75% of the instances are used for reordering training of CHAMELEON and joint-training of the full architecture, and 25% for sampling test tasks. For Split experiments, we further impose a train-test split on the features (20% of the features are restricted to the test split). Our work is built on top of REPTILE (Nichol et al., 2018b) but can be used in conjunction with any model-agnostic meta-learning method. We opted to use REPTILE since it does not require second-order derivatives, and the code is publicly available (Nichol et al., 2018a) while also being easy to adapt to our problem. 

Main Results

We evaluate our approach using the OpenML-CC18 benchmark (Bischl et al., 2017 ) from which we selected 23 datasets for few-shot classification. The details of all datasets utilized in this work are summarized in Appendix B. The results in Figure 3 display the model performance after performing three update steps on a novel test task to illustrate the faster convergence. The graph shows a clear performance lift when using the proposed architecture after pretraining it to reorder tasks. This demonstrates to the best of our knowledge the first few-shot classification approach, which successfully learns across tasks with varying schemas (contribution 2). Furthermore, in the No-Split results one can see that the performance of the proposed method approaches the Oracle performance, which suggests an ideal feature alignment. When adding novel features during test time (Split) CHAMELEON is still able to outperform other setups although with a lower margin. Ablations We visualize the result of pretraining CHAMELEON on the Wine dataset (from OpenML-CC18) in Figure 6 to show that the proposed model is capable of learning the correct alignment between tasks. One can see that the component manages to learn the true feature position in almost all cases. Moreover, this illustration does also show that CHAMELEON can be used to compute the similarity between different features by indicating which pairs are confused most often. For example, features two and four are showing a strong correlation, which is very plausible since they depict the free sulfur dioxide and total sulfur dioxide level of the wine. This demonstrates that our proposed architecture is able to learn an alignment between different feature spaces (contribution 1). Furthermore, we repeat the experiments on the OpenML-CC18 benchmark in two ablation studies to measure the impact of joint-training and the proposed reordering training (Algorithm 2). First, we do not train CHAMELEON with Equation 9, but only jointly train ŷ • enc with REPTILE to evaluate the influence of adding additional parameters to the network without pretraining it. Secondly, we use REPTILE only to update the initialization for the parameters of ŷ while freezing the pretrained parameters of enc in order to assess the effect of joint-training both network components. These two variants are referred to as Untrain and Frozen. We compare these ablations to our approach by conducting a Wilcoxon signed-rank test (Wilcoxon, 1992 ) with Holm's alpha correction (Holm, 1979) . The results are displayed in the form of a critical difference diagram (Demšar, 2006; Ismail Fawaz et al., 2019) presented in Figure 4 . The diagram shows the ranked performance of each model and whether they are statistically different. The results confirm that our approach leads to statistically significant improvements over the random and REPTILE baselines when pretraining CHAMELEON. Similarly, our approach is also significantly better than jointly training the full architecture without pretraining CHAMELEON (UNTRAIN), confirming that the improvements do not stem from the increased model capacity. Finally, comparing the results to the FROZEN model shows improvements that are not significant, indicating that a near-optimal alignment was already found during pretraining. A detailed overview for all experimental results is given in Appendix C.

Latent Embeddings Experiments

Learning to align features is only feasible for unstructured data since this approach would not preserve any structure. However, it is a widespread practice among few-shot classification methods, and computer vision approaches in general, to use a pretrained model to embed image data into a latent space before applying further operations. We can use CHAMELEON to align the latent embeddings of image datasets that are generated with different networks. Thus, it is possible to use latent embeddings for meta-training while evaluating on novel tasks that are not yet embedded in case the embedding network is not available, or the complexity of different datasets requires models with different capacities to extract useful features. We conduct an additional experiment for which we combine two similar image datasets, namely EMNIST-Digits and EMNIST-Letters (Cohen et al., 2017) evaluating on tasks sampled from the other one. In the combined experiments, the full training is performed on the EMNIST-Letters dataset, while EMNIST-Digits is used for testing. Splitting the features is not necessary as the train, and test features are coming from different datasets. The results of this experiment are displayed in Figure 5 . It shows the accuracy of EMNIST-Digits averaged across 5 runs with 1,600 generated tasks per run during the REPTILE training on EMNIST-Letters for the different model variants. Each test task is evaluated by performing 3 update steps on the training samples and measuring the accuracy of its validation data afterward. One can see that our proposed approach reports a significantly higher accuracy than the REPTILE baseline after performing three update steps on a task (contribution 4). Thus, showing that CHAMELEON is able to transfer knowledge from one dataset to another. Moreover, simply adding CHAMELEON without pretraining it to reorder tasks (Untrain) does not lead to any improvement. This might be sparked by using a CHAMELEON component that has a much lower number of parameters than the base network. Only by adding the reordering-training, the model manages to converge to a suitable initialization. In contrast to our experiments on the OpenML datasets, freezing the weights of CHAMELEON after pretraining also fails to give an improvement, suggesting that the pretraining did not manage to capture the ideal alignment, but enables learning it during joint-training. Our code is available at BLIND-REVIEW.

5. CONCLUSION

In this paper, we presented, to the best of our knowledge, the first approach to tackle few-shot classification for unstructured tasks with different schema. Our model component CHAMELEON is capable of embedding tasks to a common representation by computing a matrix that can reorder the features. For this, we propose a novel pretraining framework that is shown to learn useful permutations across tasks in a supervised fashion without requiring actual labels. In experiments on 23 datasets of the OpenML-CC18 benchmark, our method shows significant improvements even when presented with features not seen during training. Furthermore, by aligning different latent embeddings we demonstrate how a single meta-model can be used to learn across multiple image datasets each embedded with a distinct network.

A APPENDIX -INNER TRAINING

We visualize the inner training for one of the experiments in Figure 7 . It shows two exemplary snapshots of the inner test loss when training on a sampled task with the current initialization θ init before meta-learning and after 20,000 meta-epochs. It is compared to the test loss of the model when it is trained on the same task starting with the random initialization. For this experiment, models were trained until convergence. Note that both losses are not identical in meta-epoch 0 because the CHAMELEON component is already pretrained. The snapshots show the expected REPTILE behavior, namely a faster convergence when using the currently learned initialization compared to a random one. 

B APPENDIX -EXPERIMENTAL DETAILS

The features of each dataset are normalized between 0 and 1. The Split experiments are limited to the 21 datasets which have more than four features in order to perform a feature split. We sample 10 training and 10 validation instances per label for a new task, and 16 tasks per meta-batch. The number of classes in a task is given by the number of classes of the respective dataset, as shown in Table 1 . During the reordering-training phase and the inner updates of reptile, specified in line 6 of Algorithm (1), we use the ADAM optimizer (Kingma & Ba, 2014) with an initial learning rate of 0.0001 and 0.001 respectively. The meta-updates of REPTILE are carried out with a learning rate β of 0.01. The reordering-training phase is run for 4000 epochs. All results reported in this work are averaged over 5 runs.

OpenML-CC18

All experiments on the OpenML-CC18 benchmark are conducted with the same model architecture. The base model ŷ is a feed-forward neural network with two dense hidden layers that have 16 neurons each. CHAMELEON consists of two 1D-convolutions with 8 and 16 filters respectively and a final convolution that maps the task to the feature-length K, as shown in Figure 2 . We selected dataasets that have up to 33 features and a minimum number of 90 instances per class. We limited the number of features and model capacity because this work seeks to establish a proof of concept for learning across data with different schemas. In contrast, very high-dimensional data would require tuning a more complex CHAMELEON architecture. 

C APPENDIX -TABLES WITH EXPERIMENTS RESULTS

The following tables show the detailed results of our experiments on the OpenML-CC18 datasets for Split and NoSplit settings. The tables contain the loss and accuracy for the the base model ŷ trained from a random initialization and with REPTILE, and our proposed model ŷ • enc with the additional ablation studies Untrain and Frozen: D PROBLEM SETTING: GENERAL MULTI-TASK LEARNING. We describe a classification dataset with vector-shaped predictors (i.e., no images, time series etc.) by a pair (X, Y ) ∈ R N ×F × {0, ..., C} N , with predictors X and targets Y , where N denotes the number of instances, F the number of predictors and C the number of classes. Let D F := N ∈N R N ×F × {0, ..., C} N be the space of all such datasets with F predictors and D := F ∈N D F be the space of any such dataset. Let us also denote the space of all predictor matrices with F predictors by X F := N ∈N R N ×F and all predictor matrices by X := Consider a meta-dataset of correlated tasks T ⊂ D, such that the predictor variables {p τ1 , p τ2 , . . . , p τ F } of any individual task τ are contained in a common set of predictor variables {p 1 , p 2 , . . . , p K }. As elucidated in the previous section, our goal is to construct an encoder that learns to match these predictors and map the features of any task τ ∈ T into a shared latent space R K . enc : X -→ X K , X ∈ R N ×F -→ X ∈ R N ×K (10) This encoder can be combined with a parametric model of fixed input size ŷ : R K → {0, 1} (e.g. neural network or SVM) such that for the joint model ŷ • enc an initialization θ init can be learned via MAML or REPTILE across all tasks, even when those may not have the same predictor vector. Just as with MAML, this initialization facilitates rapid convergence of the combined model ŷ • enc on any new, previously unseen task T ∈ T test . More explicitly, the ultimate goal is to minimize the meta test loss L (θ init ) := E Tτ ∼T test L τ Y test τ , ŷ • enc X test τ ; θ (u) τ (11) here L τ is the task specific loss (e.g. miss-classification rate) of the model on the test data of T τ , using the updated parameters θ MAML and REPTILE are solving sub-problems when the number F of features is fixed and the predictors of all tasks are the same and aligned, i.e., the same predictor always occurs at the same position within the predictor vector, thus the identity can be used as predictor encoder. This problem alternatively can be described as a supervised learning problem with a multivariate or structured target.



Note that REPTILE does not require validation instances during meta-learning.



Figure1: Chameleon Pipeline: Chameleon aims to encode tasks with different schemas to a shared representation with an uniform feature space, which can then be processed by any classifier. The left block represents tasks of the same domain with different schemas. The middle represents the aligned features in a fixed schema.

Figure 2: The Chameleon Architecture: N represents the number of samples in τ , Fτ is the number of features in τ , and K is the number of features in the desired feature space. "Conv(a × b × c)" is a convolution operation with a input channels, filter size of b and kernel length c.

Figure3: Accuracy improvement for each method over Glorot initialization(Glorot & Bengio, 2010): The difference is plotted in negative log scale to account for the varying performance scales across the different datasets (higher points are better; A value of -1 is equivalent to the Glorot initialization). The graph (a) represents Split experiments while (b) depicts the No-Split experiments. Notice that the oracle has been omitted from the Split experiments since there is no true feature alignment for unseen features. The dataset axis is sorted by the performance of REPTILE on the base model to improve readability. All results are averaged over 5 runs.

Figure 4: Critical Difference Diagram for Split (Left) and No-Split (Right) showing results of Wilcoxon signed-rank test with Holm's alpha correction and 5% significance level. Models are ranked by their performance and a thicker horizontal line indicates pairs that are not statistically different.

Figure 5: Latent embedding results. Meta test accuracy on the EMNIST-Digits data set while training on EMNIST-Letters. Each point represents the accuracy on 1600 test tasks after performing three update steps on the respective training data. Results are averaged over 5 runs.

Figure 6: Heat map of the feature shifts for the Wine data computed with CHAMELEON after reordering-training: The x-axis represents the twelve features of the original dataset in the correct order and the y-axis shows which position these features are shifted to when presented in a permuted subset.

Figure 7: Snapshots visualizing the inner training. Validation cross-entropy loss for a task sampled from the wall-robot-navigation data set during inner training starting from the current initialization (blue) and random initialization (red).

∈N X F . Then a dataset τ = (X, Y ) ∈ D equipped with a predefined training/test split, i.e. the quadruplet τ = (X train τ , Y train τ , X test τ , Y test τ ) is called a task. A collection of such tasks T ⊂ D is called a metadataset. Similar to splitting a single data set into a training and test part, one can split a meta-dataset T = T train ∪ T test .

latter are the updated parameters of the joint model ŷ • enc which are obtained by minimizing L τ on the training data (X train , Y train ) of T τ via some learning iterative learning algorithm A (e.g. Gradient Descent) for u iterations. θ (u) τ = A (u) X train τ , Y train τ , L τ , ŷ • enc; θ init (12)

rate γ 1: Randomly initialize parameters θ enc of the CHAMELEON model 2: for training iteration = 1, 2, ... do

The details for each dataset are summarized in Appendix 1. When sampling a task in Split, we sample between 40% and 60% of the respective training features. For test tasks in Split experiments 20% of the features are sampled from the set of test features to evaluate performance on similar tasks with partially novel features. For each experimental run, the different variants are tested on the same data split, and we sample 1600 test tasks beforehand, while the training tasks are randomly sampled each epoch. All experiments are repeated five times with different instance and, in the case of Split, different feature splits, and the results are averaged.Latent Embeddings Both networks used for generating the latent embeddings consist of two convolutional and two dense hidden layers with 64 neurons each, but the number of neurons in the output layer is 32 for EMNIST-Digits and 64 for EMNIST-Letters. For these experiments, the CHAMELEON component still has two convolutional layers with 8 and 16 filters, while we use a larger base network with two feed-forward layers with 64 neurons each. All experimental results are averaged over five runs. Information for the 23 OpenML-CC18 dataset used in this paper.* These datasets were embedded using our embedding neural network (see Apendix B).

Loss and accuracy scores of each model variant for the No-Split experiments. The values depict the mean and standard deviation across 5 runs for each dataset with 1600 sampled test tasks per run. Best results are boldfaced (excluding ORACLE).

Loss and accuracy scores of each model variant for the Split experiments. The values depict the mean and standard deviation across 5 runs for each dataset with 1600 sampled test tasks per run. Best results are boldfaced.

