TASK-SIMILARITY AWARE META-LEARNING THROUGH NONPARAMETRIC KERNEL REGRESSION

Abstract

This paper investigates the use of nonparametric kernel-regression to obtain a tasksimilarity aware meta-learning algorithm. Our hypothesis is that the use of tasksimilarity helps meta-learning when the available tasks are limited and may contain outlier/ dissimilar tasks. While existing meta-learning approaches implicitly assume the tasks as being similar, it is generally unclear how this task-similarity could be quantified and used in the learning. As a result, most popular metalearning approaches do not actively use the similarity/dissimilarity between the tasks, but rely on availability of huge number of tasks for their working. Our contribution is a novel framework for meta-learning that explicitly uses task-similarity in the form of kernels and an associated meta-learning algorithm. We model the task-specific parameters to belong to a reproducing kernel Hilbert space where the kernel function captures the similarity across tasks. The proposed algorithm iteratively learns a meta-parameter which is used to assign a task-specific descriptor for every task. The task descriptors are then used to quantify the task-similarity through the kernel function. We show how our approach conceptually generalizes the popular meta-learning approaches of model-agnostic meta-learning (MAML) and Meta-stochastic gradient descent (Meta-SGD) approaches. Numerical experiments with regression and classification tasks show that our algorithm outperforms these approaches when the number of tasks is limited, even in the presence of outlier or dissimilar tasks. This supports our hypothesis that task-similarity helps improve the meta-learning performance in task-limited and adverse settings.

1. INTRODUCTION

Meta-learning seeks to abstract a general learning rule applicable to a class of different learning problems or tasks, given the knowledge of a set of training tasks from the class (Finn & Levine, 2018; Denevi et al., 2018; Hospedales et al., 2020; Grant et al., 2018; Yoon et al., 2018) . The setting is such that the data available for solving each task is often severely limited, resulting in a poor performance when the tasks are solved independently from each other. This also sets meta-learning apart from the transfer learning paradigm where the focus is to transfer a well-trained network from existing domain to another (Pan & Yang, 2010) . While existing meta-learning approaches implicitly assume the tasks as being similar, it is generally unclear how this task-similarity could be quantified and used in the learning. As a result, most popular meta-learning approaches do not actively use the similarity/dissimilarity between the tasks, but rely on availability of huge number of tasks for their working. In many practical applications, the number of tasks could be limited and the tasks may not always be very similar. There might even be 'outlier tasks' or 'out-of-the-distribution tasks' that are less similar or dissimilar from the rest of the tasks. Our conjecture is that the explicit incorporation or awareness of task-similarity helps improve meta-learning performance in such task-limited and adverse settings. The goal of this paper is to test this hypothesis by developing a task-similarity aware meta-learning algorithm using nonparametric kernel regression. Specifically, our contribution is a novel metalearning algorithm called the Task-similarity Aware Nonparametric Meta-Learning (TANML) that: • Explicitly employs similarity across the tasks to fast adapt the meta-information to a given task, by using nonparametric kernel regression. • Models the parameters of a task as belonging to a reproducing kernel Hilbert space (RKHS), obtained by viewing the popular meta-learning of MAML and Meta-SGD approaches through the lens of linear/kernel regression. • Uses task-descriptors through a kernel function to quantify task-similarity/dissimilarity. • Offers a general framework for incorporating task-similarity in the meta-learning process. Though we pursued the algorithm with a specific choice of the task-descriptors, the proposed RKHS task-similarity aware framework can be extended to other formulations. We wish to emphasize that our goal is not to propose another meta-learning algorithm that outperforms the state-of-the-art, but rather to investigate if task-similarity can be explicitly incorporated and used to advantage in a meaningful manner. We show how this is achieved as the consequence of viewing the popular MAML and Meta-SGD formulations through the lens of nonparametric kernel regression. In order to keep the comparison fair on an apple-to-apple level, we compare the performance of TANML with that of MAML and Meta-SGD algorithms.

1.1. MATHEMATICAL OVERVIEW OF THE PROPOSED TASK-SIMILARITY AWARE FRAMEWORK

Given pairs of data (x k , y k ) ∈ R nx × R ny where k ∈ {1, 2, • • • , K} generated by an unknown data source, we are interested in learning a predictor R nx × R D (x, θ θ θ) → f (x, θ θ θ) ∈ R ny from the given data. For example, f (x, θ θ θ) could be a function defined by an artificial neural network (ANN). We collect pairs of data in X = (x 1 , x 2 , • • • , x K ) and Y = (y 1 , y 2 , • • • , y K ) and define the loss function R Knx ×R Kny ×R D (X , Y, θ θ θ) → L(X , Y, θ θ θ) ∈ R which we then minimize with respect to θ θ θ. This constitutes the training of the predictor. In the case of a ANN, L(X , Y, θ θ θ) ∈ R could be the mean-square loss or the cross-entropy function. The data X , Y used for training is referred to as the training data. Let θ θ θ denote the optimal value of θ θ θ obtained from training. Given a new x ∈ R nx , we use ŷ = f (x, θ θ θ) to predict y. The goodness of θ θ θ is evaluated using yŷ on a sequence of pairs of new data called the test data X , Ȳ, defined similarly as X and Y, but with K number of data pairs. The training of the predictor for the given data source is defined as a task. Now, consider that we are interested in carrying out several such tasks for data coming from different but similar sources. Let X i , Y i , Xi , Ȳi , i = 1, • • • , T tr denote the data from T tr different datasources, and defined similarly as X , Y, X , Ȳ above. We refer to the training of the predictor for data X i , Y i , Xi , Ȳi as the ith training task, and θ θ θ i is referred to as the parameter for the task. Metalearning captures similarities across the tasks by learning a common θ θ θ (which we denote by θ θ θ 0 ) from the data of these T tr tasks (called the meta-training data), such that θ θ θ 0 can be quickly adapted to train a predictor for data from new and different but similar data-sources. Depending on how θ θ θ is obtained from θ θ θ 0 , various meta-learning algorithms exist (Denevi et al., 2018; Finn & Levine, 2018; Allen et al., 2019) . The performance of the meta-learning algorithm is evaluated on previously unseen data from several other similar sources X v i , Y v i , X v i , Ȳv i , i = 1, • • • , T v (called the meta-test data) defined similarly to X , Y, X , Ȳthis constitutes the meta-test phase. The training of the predictor for test data X v i , Y v i , X v i , Ȳv i is referred to as the ith test task, θ θ θ v i denotes the parameter for the ith test task. In the existing meta-learning approaches, both θ θ θ i and θ θ θ v i are obtained by adapting θ θ θ 0 using the gradient of L(X i , Y i , θ θ θ) and L(X v i , Y v i , θ θ θ), respectively, evaluated at θ θ θ 0 . In our work, we propose a meta-learning algorithm where θ θ θ i explicitly uses a similarity between the ith training task and all the training tasks. Similarly, the parameters θ θ θ v i for the test tasks also use explicitly a similarity between the ith test task and all the training tasks. As specified later, we define this task-similarity between two tasks through kernel regression, and our algorithm learns the kernel regression coefficients Ψ Ψ Ψ as meta-parameters in addition to θ θ θ 0 . A motivating example Let us now consider a specific loss function given by L(X , Y, θ θ θ) = K k=1 y kf (x k , θ θ θ) 2 2 . Training for tasks independently with limited training data will typically result in a predictor that overfits to X , Y, and generalizes poorly to X , Ȳ. MAML-type metalearning approaches (Finn et al., 2017) solve this by inferring the information across tasks in the form of a good initialization θ θ θ 0 -specialized/adapted to a new task using the adaptation function R D × R Knx × R Kny (θ θ θ 0 , X , Y) → g MAML (θ θ θ 0 , X , Y) ∈ R D defined as: g MAML (θ θ θ 0 , X , Y) θ θ θ 0 -α∇ θ θ θ0 L(X , Y, θ θ θ 0 )

