TASK-SIMILARITY AWARE META-LEARNING THROUGH NONPARAMETRIC KERNEL REGRESSION

Abstract

This paper investigates the use of nonparametric kernel-regression to obtain a tasksimilarity aware meta-learning algorithm. Our hypothesis is that the use of tasksimilarity helps meta-learning when the available tasks are limited and may contain outlier/ dissimilar tasks. While existing meta-learning approaches implicitly assume the tasks as being similar, it is generally unclear how this task-similarity could be quantified and used in the learning. As a result, most popular metalearning approaches do not actively use the similarity/dissimilarity between the tasks, but rely on availability of huge number of tasks for their working. Our contribution is a novel framework for meta-learning that explicitly uses task-similarity in the form of kernels and an associated meta-learning algorithm. We model the task-specific parameters to belong to a reproducing kernel Hilbert space where the kernel function captures the similarity across tasks. The proposed algorithm iteratively learns a meta-parameter which is used to assign a task-specific descriptor for every task. The task descriptors are then used to quantify the task-similarity through the kernel function. We show how our approach conceptually generalizes the popular meta-learning approaches of model-agnostic meta-learning (MAML) and Meta-stochastic gradient descent (Meta-SGD) approaches. Numerical experiments with regression and classification tasks show that our algorithm outperforms these approaches when the number of tasks is limited, even in the presence of outlier or dissimilar tasks. This supports our hypothesis that task-similarity helps improve the meta-learning performance in task-limited and adverse settings.

1. INTRODUCTION

Meta-learning seeks to abstract a general learning rule applicable to a class of different learning problems or tasks, given the knowledge of a set of training tasks from the class (Finn & Levine, 2018; Denevi et al., 2018; Hospedales et al., 2020; Grant et al., 2018; Yoon et al., 2018) . The setting is such that the data available for solving each task is often severely limited, resulting in a poor performance when the tasks are solved independently from each other. This also sets meta-learning apart from the transfer learning paradigm where the focus is to transfer a well-trained network from existing domain to another (Pan & Yang, 2010) . While existing meta-learning approaches implicitly assume the tasks as being similar, it is generally unclear how this task-similarity could be quantified and used in the learning. As a result, most popular meta-learning approaches do not actively use the similarity/dissimilarity between the tasks, but rely on availability of huge number of tasks for their working. In many practical applications, the number of tasks could be limited and the tasks may not always be very similar. There might even be 'outlier tasks' or 'out-of-the-distribution tasks' that are less similar or dissimilar from the rest of the tasks. Our conjecture is that the explicit incorporation or awareness of task-similarity helps improve meta-learning performance in such task-limited and adverse settings. The goal of this paper is to test this hypothesis by developing a task-similarity aware meta-learning algorithm using nonparametric kernel regression. Specifically, our contribution is a novel metalearning algorithm called the Task-similarity Aware Nonparametric Meta-Learning (TANML) that: • Explicitly employs similarity across the tasks to fast adapt the meta-information to a given task, by using nonparametric kernel regression. • Models the parameters of a task as belonging to a reproducing kernel Hilbert space (RKHS), obtained by viewing the popular meta-learning of MAML and Meta-SGD approaches through the lens of linear/kernel regression. • Uses task-descriptors through a kernel function to quantify task-similarity/dissimilarity. • Offers a general framework for incorporating task-similarity in the meta-learning process. Though we pursued the algorithm with a specific choice of the task-descriptors, the proposed RKHS task-similarity aware framework can be extended to other formulations. We wish to emphasize that our goal is not to propose another meta-learning algorithm that outperforms the state-of-the-art, but rather to investigate if task-similarity can be explicitly incorporated and used to advantage in a meaningful manner. We show how this is achieved as the consequence of viewing the popular MAML and Meta-SGD formulations through the lens of nonparametric kernel regression. In order to keep the comparison fair on an apple-to-apple level, we compare the performance of TANML with that of MAML and Meta-SGD algorithms.

1.1. MATHEMATICAL OVERVIEW OF THE PROPOSED TASK-SIMILARITY AWARE FRAMEWORK

Given pairs of data (x k , y k ) ∈ R nx × R ny where k ∈ {1, 2, • • • , K} generated by an unknown data source, we are interested in learning a predictor R nx × R D (x, θ θ θ) → f (x, θ θ θ) ∈ R ny from the given data. For example, f (x, θ θ θ) could be a function defined by an artificial neural network (ANN). We collect pairs of data in X = (x 1 , x 2 , • • • , x K ) and Y = (y 1 , y 2 , • • • , y K ) and define the loss function R Knx ×R Kny ×R D (X , Y, θ θ θ) → L(X , Y, θ θ θ) ∈ R which we then minimize with respect to θ θ θ. This constitutes the training of the predictor. In the case of a ANN, L(X , Y, θ θ θ) ∈ R could be the mean-square loss or the cross-entropy function. The data X , Y used for training is referred to as the training data. Let θ θ θ denote the optimal value of θ θ θ obtained from training. Given a new x ∈ R nx , we use ŷ = f (x, θ θ θ) to predict y. The goodness of θ θ θ is evaluated using yŷ on a sequence of pairs of new data called the test data X , Ȳ, defined similarly as X and Y, but with K number of data pairs. The training of the predictor for the given data source is defined as a task. Now, consider that we are interested in carrying out several such tasks for data coming from different but similar sources. Let X i , Y i , Xi , Ȳi , i = 1, • • • , T tr denote the data from T tr different datasources, and defined similarly as X , Y, X , Ȳ above. We refer to the training of the predictor for data X i , Y i , Xi , Ȳi as the ith training task, and θ θ θ i is referred to as the parameter for the task. Metalearning captures similarities across the tasks by learning a common θ θ θ (which we denote by θ θ θ 0 ) from the data of these T tr tasks (called the meta-training data), such that θ θ θ 0 can be quickly adapted to train a predictor for data from new and different but similar data-sources. Depending on how θ θ θ is obtained from θ θ θ 0 , various meta-learning algorithms exist (Denevi et al., 2018; Finn & Levine, 2018; Allen et al., 2019) . The performance of the meta-learning algorithm is evaluated on previously unseen data from several other similar sources X v i , Y v i , X v i , Ȳv i , i = 1, • • • , T v (called the meta-test data) defined similarly to X , Y, X , Ȳthis constitutes the meta-test phase. The training of the predictor for test data X v i , Y v i , X v i , Ȳv i is referred to as the ith test task, θ θ θ v i denotes the parameter for the ith test task. In the existing meta-learning approaches, both θ θ θ i and θ θ θ v i are obtained by adapting θ θ θ 0 using the gradient of L(X i , Y i , θ θ θ) and L(X v i , Y v i , θ θ θ), respectively, evaluated at θ θ θ 0 . In our work, we propose a meta-learning algorithm where θ θ θ i explicitly uses a similarity between the ith training task and all the training tasks. Similarly, the parameters θ θ θ v i for the test tasks also use explicitly a similarity between the ith test task and all the training tasks. As specified later, we define this task-similarity between two tasks through kernel regression, and our algorithm learns the kernel regression coefficients Ψ Ψ Ψ as meta-parameters in addition to θ θ θ 0 . A motivating example Let us now consider a specific loss function given by L(X , Y, θ θ θ) = K k=1 y k -f (x k , θ θ θ) 2 2 . Training for tasks independently with limited training data will typically result in a predictor that overfits to X , Y, and generalizes poorly to X , Ȳ. MAML-type metalearning approaches (Finn et al., 2017) solve this by inferring the information across tasks in the form of a good initialization θ θ θ 0 -specialized/adapted to a new task using the adaptation function R D × R Knx × R Kny (θ θ θ 0 , X , Y) → g MAML (θ θ θ 0 , X , Y) ∈ R D defined as: g MAML (θ θ θ 0 , X , Y) θ θ θ 0 -α∇ θ θ θ0 L(X , Y, θ θ θ 0 ) The parameters for the training and test tasks as obtained through adaptation of θ θ θ 0 as θ θ θ i = g MAML (θ θ θ 0 , X i , Y i ), i = 1, • • • , T tr , and θ θ θ v i = g MAML (θ θ θ 0 , X v i , Y v i ), i = 1, • • • , T v . The meta-parameter θ θ θ 0 is learnt by iteratively taking a gradient descent with respect to the test loss on the training tasks given by Ttr i=1 L( Xi , Ȳi , g MAML (θ θ θ 0 , X i , Y i )). The parameters for a task are obtained directly from θ θ θ 0 and does not make use of any information from the other training tasks. As a result, the common θ θ θ 0 learnt during the meta-training treats all tasks equallythe algorithm implicitly assumes similarity of all tasks, but is not able to discern or quantify the degree of similarity or dissimilarity among the tasks. In contrast, our algorithm involves an adaptation function g TANML (to be defined later) that explicitly uses a notion of similarity between the tasks to predict parameters for a task. As a result, we expect that our approach helps train predictors even when the data-sources that are not very similar to each other. In our numerical experiments in Section 4, we see that this is indeed the case the sinusoidal function as the data source.

1.2. RELATED WORK

The structural characterization of tasks and use of task-dependent knowledge has gained interest in meta-learning recently. Edwards & Storkey (2017) proposed a variational autoencoder based approach to generate task/dataset statistics used to measure similarity. Ruder & Plank (2017) considered domain similarity and diversity measures in the context of transfer learning (Ruder & Plank, 2017) . The study of how task properties affect the catastrophic forgetting in continual learning was pursued by Nguyen et al. (2019) . Lee et al. (2020) proposed a task-adaptive meta-learning approach for classification that adaptively balances meta-learning and task-specific learning differently for every task and class. Bayesian approaches have been proposed to capture the similarity across tasks in the form of task hyperpriors (Yoon et al., 2018; Finn et al., 2018; Grant et al., 2018; Rothfuss et al., 2020) . Task-similarity defined through effective sample size has been used to develop a new offpolicy algorithm for meta-reinforcement learning (Fakoor et al., 2020) . It was shown by Oreshkin et al. (2018) that the performance few-shot learning shows significant improvements with the use of task-dependent metrics. While the use of kernels or similarity metrics is not new in meta-learning, they are typically seen in the context of defining relations between the classes or samples within a given task (Qiao et al., 2018; Rusu et al., 2019; Vinyals et al., 2016; Snell et al., 2017; Oreshkin et al., 2018; Fortuin & Rätsch, 2019; Goo & Niekum, 2020) . Qiao et al. (2018) use similarity metrics in the activation space to predict parameters for novel categories in few-shot learning with zero-training. Information-theoretic ideas have also been used in the study of the topology and the geometry of task spaces by Nguyen et al. (2019) ; Achille et al. (2018) . Achille et al. (2019) construct vector representations for tasks using partially trained probe networks, based on which task-similarity metrics are developed. Task descriptors have been of interest specially in vision related tasks in the context of transfer learning (Zamir et al., 2018; Achille et al., 2019; Tran et al., 2019) . Recently, neural tangent kernels were been proposed for asymptotic analysis of meta-learning for infinitely wide neural networks by considering gradient based kernels across tasks by Wang et al. (2020) . The work of Wang et al. (2020) is the closest in spirit to our work in that they consider kernels across meta-learning tasks. However, the premise of their work is very different from ours. The aim of their work is to show how global convergence behaviour of popular MAML type task-specific adaptation can be assymptotically described using specific kernels, when every task involves training deep neural networks of asymptotically infinite or very large widths. Our premise on the other hand is entirely differentwe consider a task-specific adaptation that actively employs similarity of tasks in the form of valid kernel functions, in order to improve meta-learning performance. Our work does not make assumptions on the nature of the kernel, the structure of the learner, or its dimensions.

2. REVIEW OF MAML AND META-SGD

To facilitate our analysis, we first review MAML and Meta-SGD approaches and highlight the relevant aspects necessary for our discussion. We shall then show how these approaches lead to the definition of a generalized meta-SGD and consequently, to our TANML approach.

2.1. META AGNOSTIC META-LEARNING

Model-agnostic meta-learning proceeds in two stages iteratively. As discussed in the motivating example, the parameter θ θ θ i for the ith training task X i , Y i , Xi , Ȳi , i = 1, • • • , T tr is obtained by applying the adaptation function g MAML to θ θ θ 0 as θ θ θ i = g MAML (θ θ θ 0 , X i , Y i ). This is called the inner update. Once θ θ θ i is obtained for all the training tasks, θ θ θ 0 is then updated by running one gradient descent step on the total test-loss given by Ttr i=1 L( Xi , Ȳi , g MAML (θ θ θ 0 , X i , Y i )). This is called the outer update. Each outer update involves T tr inner updates corresponding to the T tr training tasks. The outer updates are run for N iter iterations. This constitutes the meta-training phase of MAML, described in Algorithm 1. Once the meta-training phase is complete and θ θ θ 0 is learnt, the parameters for a new test task are obtained by applying the inner update on the training data of the test task. We note here that MAML described in Algorithm 1 is the first-order MAML (Finn et al., 2018) , as opposed to the general MAML where the inner update may contain several gradient descent steps. We note that when we talk of MAML in this paper, we always refer to the first-order MAML. A schematic of MAML is presented in Figure 1 . Algorithm 1: Model agnostic meta-learning (MAML) Initialize θ θ θ 0 for N iter iterations do for i = 1, • • • , T tr do g MAML (θ θ θ 0 , X i , Y i ) = θ θ θ 0 -α∇ θ θ θ0 L(X i , Y i , θ θ θ 0 ) [Inner update] end θ θ θ 0 = θ θ θ 0 -β∇ θ θ θ0 Ttr i=1 L( Xi , Ȳi , g MAML (θ θ θ 0 , X i , Y i )) [Outer update] end 2.2 META-STOCHASTIC GRADIENT DESCENT (META-SGD) Meta stochastic gradient descent (Meta-SGD) is a variant of MAML that learns the component-wise step sizes for the inner update jointly with θ θ θ 0 . Let α α α denote the vector of step-sizes for the different components of θ θ θ. As with MAML, the meta-training phase of Meta-SGD also involves an inner and an outer update. The outer update computes the values of θ θ θ 0 and α α α; the inner update computes the parameter values θ θ θ i using the adaptation function R D × R D × R Knx × R Kny (θ θ θ 0 , α α α, X i , Y i ) → g MSGD (θ θ θ 0 , α α α, X i , Y i ) ∈ R D defined as g MSGD (θ θ θ 0 , α α α, X i , Y i ) θ θ θ 0 -α α α • ∇ θ θ θ0 L(X i , Y i , θ θ θ 0 ), where • operator denotes the point-wise vector product. The outer update is run for N iter iterations. The meta-training phase for Meta-SGD is described in Algorithm 2: Algorithm 2: Meta-stochastic gradient descent Initialize [θ θ θ 0 , α α α] for N iter iterations do for i = 1, • • • , T tr do g MSGD (θ θ θ 0 , α α α, X i , Y i ) = θ θ θ 0 -α α α • ∇ θ θ θ0 L(X i , Y i , θ θ θ 0 ) [Inner update] end [θ θ θ 0 , α α α] = [θ θ θ 0 , α α α] -β∇ [θ θ θ0,α α α] Ttr i=1 L( Xi , Ȳi , g MSGD (θ θ θ 0 , α α α, X i , Y i )) [Outer update] end The predictor for the ith test task is then trained by applying the inner update on X v i , Y v i , using the values of θ θ θ 0 and α α α obtained in the meta-training phase. We notice that the inner update is expressible as g MSGD (θ θ θ 0 , α α α, X i , Y i ) = W z i (θ θ θ 0 ) where W [I, -diag(α α α)] and z i (θ θ θ 0 ) θ θ θ 0 ∇ θ θ θ0 L(X i , Y i , θ θ θ 0 ) . The matrix W denotes the transpose of W, I denotes the identity matrix, and diag(α α α) denotes the diagonal matrix whose diagonal is equal to the vector α α α. We refer to z i (θ θ θ 0 ) as the task descriptor of the ith training task. Thus, g MSGD (θ θ θ 0 , α α α, X i , Y i ) takes the form of a linear predictor for θ θ θ i with z i (θ θ θ 0 ) as the input and regression coefficients W. The adaptation g MSGD (θ θ θ 0 , α α α, X i , Y i ) can be generalized to the case of W being a full matrix that is to be learnt from the training tasks. This generalization results in the adaptation function R D ×R D×2D ×R Knx ×R Kny (θ θ θ 0 , W, X , Y) → g GMSGD (θ θ θ 0 , W, X , Y) ∈ R D given by g GMSGD (θ θ θ 0 , W, X i , Y i ) = W z i (θ θ θ 0 ) = W 1 θ θ θ 0 + W 2 ∇ θ θ θ0 L(X i , Y i , θ θ θ 0 ) where W 1 , W 2 ∈ R D×T are the submatrices of W such that W = [W 1 W 2 ]. Expressed in this manner, we notice how g GMSGD performs a parameter update similar to a secondorder gradient update with W 2 taking a role similar to the Hessian matrix. We refer to the resulting meta-learning algorithm as the Generalized Meta-SGD described in Algorithm 3. The second term Ω(W) in the outer loop cost function is a regularization that ensures W is bounded and avoids overfitting. On setting µ = 0 and using W as defined in equation 1, the Generalized Meta-SGD reduces to the Meta-SGD. The Generalized Meta-SGD is thus a more general form of MAML arrived at by viewing MAML/Meta-SGD as a linear regression. Algorithm 3: Generalized Meta-SGD Initialize [θ θ θ 0 , W ∈ R 2D×D ] for N iter iterations do for i = 1, • • • , T tr do g GMSGD (θ θ θ 0 , W, X i , Y i ) = W z i (θ θ θ 0 ) [Inner update] end [θ θ θ 0 , W] = [θ θ θ 0 , W] -β∇ [θ θ θ0,W] Ttr i=1 L( Xi , Ȳi , g GMSGD (θ θ θ 0 , W, X i , Y i )) + µΩ(W) end 3 TASK-SIMILARITY AWARE META-LEARNING It is well known that the expressive power of linear regression is limited due to both its linear nature and the finite dimension of the input. Further, since the dimension of linear regression matrix W grows quadratically with the dimension of θ θ θ, a large amount of training data would be necessary to estimate it. A transformation of linear regression in the form of 'kernel substitution' or 'kernel trick' results in the more general nonparametric or kernel regression (Bishop, 2006; Schölkopf & Smola, 2002) . Kernel regression essentially performs linear regression in an infinite dimensional space making it a simple yet powerful and effective nonlinear approach. This motivates us to use kernel regression model as the natural next step from the Generalized Meta-SGD developed in the earlier section. By generalizing the linear regression model in g GMSGD , we propose an adaptation function R D × R Ttr×D × R Knx × R Kny (θ θ θ 0 , Ψ Ψ Ψ, X , Y) → g TANML (θ θ θ 0 , Ψ Ψ Ψ, X , Y) ∈ R D in the form of nonparametric or kernel regression model given by g TANML (θ θ θ 0 , Ψ Ψ Ψ, X i , Y i ) = Ttr j=1 ψ ψ ψ j k(z i (θ θ θ 0 ), z j (θ θ θ 0 )) = Ψ Ψ Ψ k(θ θ θ 0 , i), where k : R 2D × R 2D → R denotes a valid kernel function 1 , k(θ θ θ 0 , i) [ k (z i (θ θ θ 0 ), z 1 (θ θ θ 0 )) , • • • , k (z i (θ θ θ 0 ), z Ttr (θ θ θ 0 )) ] is the vector with kernel values between the ith training task and all the T tr training tasks, and Ψ Ψ Ψ = [ψ ψ ψ 1 , • • • , ψ ψ ψ Ttr ] is the matrix of kernel regression coefficients stacked along the columns. The kernel coefficient matrix Ψ Ψ Ψ and the parameter θ θ θ 0 are learnt in the meta-training phase by iteratively performing an outer update as in the case of the Generalized Meta-SGD. The computed Ψ Ψ Ψ and θ θ θ 0 are then used to train the predictor for a new test task by applying the inner update on its training data. We call our approach Task-similarity Aware Nonparametric Meta-Learning (TANML) since the kernel measures the similarity between tasks through the task-descriptors. As we show in the Appendix, TANML reduces to Meta-SGD and MAML for specific choices of the kernel k and the regression coefficient matrix ψ ψ ψ. The kernel regression in equation 2 models θ θ θ i as belonging to the space of functions defined as H: H = θ θ θ : θ θ θ = Ttr i =1 ψ ψ ψ i k(z i (θ θ θ 0 ), z i (θ θ θ 0 )), ψ ψ ψ i ∈ R D , i = 1, • • • , T tr The space H is referred to as the reproducing kernel Hilbert space (RKHS) associated with the kernel k(•, •). We refer the reader to (Hofmann et al., 2008) and (Schölkopf & Smola, 2002) for further reading on RKHS. The space H has an important structure: each function in H uses the information (the coefficients ψ ψ ψ) from all the T tr training tasks weighted by the kernel that essentially quantifies a similarity or correlation between the tasks through the task-descriptors defined earlier. While all meta-learning approaches use similarity implicitly, the number of works actively using similarity with training data at test time are limited (Fakoor et al., 2020) . Computing the optimal values of the kernel coefficients Ψ Ψ Ψ and θ θ θ 0 , which forms the meta-training phase, is then equivalent to solving the functional minimization problem: ✓ ✓ ✓ 0 ✓ ✓ ✓ 1 ✓ ✓ ✓ 2 ✓ ✓ ✓ 3 ✓ ✓ ✓ 4 rL 1 rL 2 rL 3 rL 4 meta-learning task adaptation z 3 z 2 z 1 z 4 rL 2 rL 3 rL 4 rL 1 ✓ ✓ ✓ 1 k(z1, z2) k(z1, z1) k(z1, z3) k(z1, z4) RKHS meta-learning task descriptor kernel computation adaptation } ✓ ✓ ✓0 ✓ ✓ ✓0 ✓ ✓ ✓ 0 1 4 ✓ ✓ ✓ 0 arg min θ θ θ0, θ θ θ∈H Ttr i=1 L( Xi , Ȳi , θ θ θ) + µ θ θ θ 2 H , where the regularization term is the squared-norm in the RKHS which promotes smoothness and controls overfitting, µ being the regularization constant. The squared-norm in an RKHS is defined as θ θ θ 2 H Ttr i=1 Ttr i =1 ψ ψ ψ i ψ ψ ψ i k(z i (θ θ θ 0 ), z i (θ θ θ 0 )) = Ψ Ψ Ψ K(θ θ θ 0 )Ψ Ψ Ψ; and K(θ θ θ 0 ) ∈ R Ttr×Ttr is the matrix of kernels evaluated across all the training tasks. This novel connection between meta-learning and RKHS obtained from TANML could potentially help in the mathematical understanding of existing algorithms, and help develop new meta-learning algorithms in the light of the RKHS theory (Hofmann et al., 2008; Schölkopf & Smola, 2002) . The meta-training phase for TANML is described in Algorithm 4, where we use Ω(Ψ Ψ Ψ) = Ψ Ψ Ψ K(θ θ θ 0 )Ψ Ψ Ψ. In general, other regularizations such as 1 or 2 norms could also be used. We also note from equation 2 that the TANML approach is a general framework: any kernel and any task-descriptor which meaningfully captures the information in the task could be employed. What constitutes a meaningful descriptor for a task is an open question; while there have been studies on deriving features and metrics for understanding the notion of similarity between data sources or datasets, they have mostly been domain-specific and often require separate 'probe' neural networks for the extraction of features (Kim et al., 2019) . The particular form of the task-descriptors used in our derivation is the result of taking MAML/Meta-SGD as a starting point, and follows naturally from analyzing them through the lens of linear and kernel regression. A schematic describing the task-descriptor based TANML and the intuition behind its working is shown in Figure 1 . Algorithm 4: Task-similarity Aware Meta Learning Initialize [θ θ θ 0 , Ψ Ψ Ψ ∈ R Ttr×D ] for N iter iterations do for i = 1, • • • , T tr do g TANML (θ θ θ 0 , Ψ Ψ Ψ, X i , Y i ) = Ψ Ψ Ψ k(θ θ θ 0 , i) [Inner update] end [θ θ θ 0 , Ψ Ψ Ψ] = [θ θ θ 0 , Ψ Ψ Ψ] -β∇ [θ θ θ0,Ψ Ψ Ψ] Ttr i=1 L( Xi , Ȳi , g TANML (θ θ θ 0 , Ψ Ψ Ψ, X i , Y i )) + µΩ(Ψ Ψ Ψ) [Outer update] end On the choice of kernels and sequential training While the expressive power of kernels is immense, it is also known that the performance could vary depending on the choice of the kernel function (Schölkopf & Smola, 2002) . The kernel function that works best for a dataset is usually found by trial and error. A possible approach is to use multi-kernel regression where one lets the data decide which of the pre-specified set of kernels are relevant (Sonnenburg & Schäfer, 2005; Gönen & Alpaydin, 2011) . Domain-specific knowledge may also be incorporated in the choice of kernels. In our analysis, we use two of the popular kernel functions: the Gaussian or the radial basis function (RBF) kernel, and the cosine kernel. We note that since MAML and Meta-SGD and similar approaches perform the inner update independently for every task, they naturally admit a sequential or batch based training. Since TANML uses an inner update in the form of a nonparametric kernel regression, it inherits one of the limitations of kernel-based approachesthat all training data is used simultaneously for every task. As a result, the task losses and the associated gradients for all the training tasks are used at every inner update of TANML. One way to overcome this limitation would be the use of online or sequential kernel regression techniques (Lu et al., 2016; Sahoo et al., 2019; Vermaak et al., 2003) . We will pursue this in our future work.

4. NUMERICAL EXPERIMENTS

We evaluate the performance of TANML and compare it with MAML and Meta-SGD on two synthesized regression datasets, five real-world time-series prediction problems from the Physionet 2012 Challenge dataset (Silva et al., 2012; Rothfuss et al., 2020) , and on few-shot classification dataset Omniglot (Lake et al., 2015) . The synthesized regression tasks have been used previously by previous works (Denevi et al., 2018; Finn et al., 2017; 2018) in meta-learning as a baseline for evaluating the performance on regression tasks. In every experiment, the predictor f (x, θ θ θ) is the output of a fully-connected four-layer feed-forward neural network, with Rectified linear unit (ReLU) as the activation function; θ θ θ is the vector of all the weights and biases in the neural network -the predicted output is a scalar † or vector ŷ, for the regression and classification tasks, respectively. We consider two kernel functions for TANML: the Gaussian kernel k(z i (θ θ θ 0 ), z i (θ θ θ 0 )) = exp -z i (θ θ θ 0 ) -z i (θ θ θ 0 ) 2 2 /σ 2 , and the cosine kernel k(z i (θ θ θ 0 ), z i (θ θ θ 0 )) = zi(θ θ θ0) z i (θ θ θ0) zi(θ θ θ0) z i (θ θ θ0) . In order that the structural similarities are better expressed, we update kernel regression for the parameters of the different layers separately. This is because using a single adaptation function all components of θ θ θ i might result in certain parameters dominating the kernel regression, specially when the dimension of the parameters becomes large. Hence, we perform the adaptation separately for components of θ θ θ i corresponding to the different layers l = 1, • • • , L using the adaptation functions R D l × R D l ×Ttr R Knx × R Kny (θ θ θ 0,l , Ψ Ψ Ψ l , X , Y) → g TANML,l (θ θ θ 0,l , Ψ Ψ Ψ l , X , Y) ∈ R D l for the parameter θ θ θ i,l belonging to the lth network layer: θ θ θ i,l = g TANML,l (θ θ θ 0,l , Ψ Ψ Ψ l , X i , Y i ) = Ttr i =1 ψ ψ ψ i ,l k(z i (θ θ θ 0,B ), z i (θ θ θ 0,l )), l = 1, • • • , L. The NMSE performance on the meta-test set is obtained by averaging over 30 Monte Carlo realizations of tasks. For the regression tasks, the performance of the various meta-learning approaches are compared using the normalized mean-squared error (NMSE) on the test tasks: NMSE Tv i=1 K k=1 (y k -ŷk ) 2 Tv i=1 K k=1 y 2 k . The numerical details of the experiments not mentioned in the manuscript, such as the learning rate and other hyper-parameters, are given in the appendix for space constraints. Experiment 1 We consider the task of training linear predictors of the form f (x, θ θ θ) = θ θ θ x. The task data pairs (x, y) ∈ R 16 × R are generated by a linear model y = β x + e. The regression coefficient vector β for different tasks is randomly sampled with equal probability from two isotropic Gaussian distributions on β: with means β 0 = -4 and β 0 = 4, where 4 denotes the vector of all fours. The additive noise e is assumed to be white and drawn from the standard normal distribution and uncorrelated with x, which is distributed as the multivariate normal distribution of size 16. We consider two cases of T tr = 32 and T tr = 64 training tasks, Each task with K = 4 samples in training and test sets. We evaluate the performance of the MAML, Meta-SGD, and TANML on a test set of T v = 64 tasks. The NMSE test performance is reported in Table 1 . We observe that both MAML and Meta-SGD perform very poorly in comparison to TANML; Meta-SGD performs slightly better than MAML. Further, we observe that TANML with the Cosine kernel performs the best among the four algorithms. The superior performance of TANML could be ascribed its the nonlinear nature with the gradients enter the estimation through the kernels. The adaptation function involves terms with products of different gradients acting in the spirit of a higher order method unlike MAML/Meta-SGD that use a first order adaptation. This also corroborates with the findings of the recent theoretical work by Saunshi et al. (2020) , where they show that MAML-type approaches can fail under convex settings (as is the case in this experiment). It is also interesting to note that the value of θ θ θ 0 we obtain for TANML almost coincides with 0, which is the value for θ θ θ 0 that theoretically minimizes the average error for this problem. We also find that both cosine and Gaussian kernels converge typically in about 5000 iterations, whereas MAML and Meta-SGD do not show improvement in NMSE even after 30000 iterations. Experiment 2 In this experiment, we consider the task of training of non-linear predictors which correspond to a fully connected ANN. We consider data pairs (x, y) ∈ R × R generated from the sinusoidal data source y = A sin(ωx), where x is drawn randomly from the interval [-1, 1], and A and ω differ across tasks. We do not use the knowledge that the data comes from a sinusoidal source while training the predictors. We are given K = 4 shots or data-pairs in each task. In order to illustrate the potential of TANML in using the similarity/ dissimilarity among tasks, we consider a fixed fraction of the tasks to be outliers, that is, generated from a non-sinusoidal data source in both meta-training and meta-test data, as described next. We consider two different regression experiments: (1) Experiment 2a -Fixed frequency varying amplitude: The data for the different tasks generated from sinusoids with A drawn randomly from (0, 1], setting ω = 1. The outlier task data generated as y(x) = Ax. (2) Experiment 2b -Fixed amplitude varying frequency: The data for the different tasks generated from sinusoids with ω randomly drawn from [1, 1.5], setting A = 1. The outlier task data is generated as y(x) = ωx. We perform the experiments with the number of meta-training tasks equal to T tr = 256 and T tr = 512. The NMSE performance on test tasks obtained by averaging over 100 Monte Carlo realizations of tasks is reported Table . We observe that TANML outperforms both MAML and Meta-SGD in test prediction by a significant margin even when the fraction of the outlier tasks is 10% and 20%. This clearly supports our intuition that an explicit awareness or notion of similarity aids in the learning, specially when the number of training tasks is limited. We also observe that on an average TANML with the cosine kernel performs better than the Gaussian kernel. We note that the performance of the approaches in Experiment 2a is better than that in Experiment 2b. This is because there is higher variation among the tasks (changing frequency) than in Experiment 2a (changing amplitudes). We also observe that the performance improves slightly as T tr increases. Experiment 3 In this experiment, we consider the the regression task of time-series prediction on the ICU patient dataset from the Physionet 2012 challenge. The dataset consists of measurements of various vital characteristics of different patients monitored over a period of 48 hours, logged in at various non-uniform time instants. For a given vital characteristic, the time-series of every individual patient forms a task: the time of measurement and the measured value are the input and output, respectively. The goal of the experiment is to predict the time series for new unseen patients given the measurements taken during the first 24 hours. The experiment is done in five different experiment settings corresponding to five different vital characteristics(V 1 to V 5 ). Since our interest is in evaluating the performance in the limited task setting, we consider 100 tasks each for the metatraining, meta-validation, and meta-testing. Further details of the implementation are given in the Appendix. The NMSE values obtained for the test tasks are shown in Table 3 . We observe that in all the experiments TANML with the Gaussian kernel results in the least NMSE, though we also note Algorithm V 1 V 2 V 3 V 4 V 5 MAML 0.56 0.50 0.12 0.13 1.2 Meta-SGD 0.03 0.18 0.04 0.03 0.63 TANML-Cosine 0.13 0.25 0.05 0.06 0.56 TANML-Gaussian 0.03 0.12 0.04 0.02 0.42 that the performance of Meta-SGD also coincides with that of TANML in some of the experiments. On comparison with results from Experiment 2, we observe that while Meta-SGD shows significant improvement over MAML, it exhibits severe instability in the presence of outliers. On the other hand, we observe that TANML performs good prediction in both the experiments. Ablation Study An ablation study of some aspects of TANML is given in Section B of the Appendix. The study is made on the data from Experiment 3. Few-shot learning We now consider the application of our approach to few-shot learning on the character classification data from the Omniglot dataset. We consider the case of 5-way 1-shot learning: that is each task consists of a 5-class problem where each class is given two samples: one for training and the other for validation. The classes correspond to different written characters /symbols of a language, the input is the image in a gray-scale value. Given one training image per class, the goal of each task is to build a 5-way classifier and evaluate its performance on another set of five images. Unlike the state-of-the-art approaches where the goal is to perform few-shot learning over a large number of training tasks (Li et al., 2017) , we restrict our experiments to the limited task setting. We consider two settings, where we use T tr = 50 and T tr = 100 meta-training tasks, and evaluate the performance on a test set of 100 tasks. The cross-entropy loss is used as the objective function for training. The test performance is measured in terms of the accuracy defined as the ratio of correctly classified samples to the total number of samples in each task and is reported in Table 4 . We observe that TANML with both the cosine and the Gaussian kernel outperform MAML and Meta-SGD. We note that the accuracy values are low due to the limited task setting that we have considered. Nevertheless, the experiments demonstrate the potential of TANML to improve fewshot learning in extremely data-limited scenarios. We predict a similar trend to be exhibited even when the number of training tasks is large.

5. CONCLUSION

We proposed a task-similarity aware meta-learning algorithm that explicitly quantifies and employs a similarity between tasks through nonparametric kernel regression. We showed how our approach brings a novel connection between meta-learning and reproducing kernel Hilbert spaces. Our hypothesis was that an explicit incorporation of task-similarity helps improve the meta-learning performance in the task-limited setting with possible outlier tasks. Experiments with regression and classification tasks support our hypothesis, and our algorithm was shown to outperform the popular meta-learning algorithms by a significant margin. The aim of the current contribution was to investigate how task-similarity could be meaningfully employed and used to advantage in meta-learning. To that end, we wish to reiterate that the study is an ongoing one and the experiments considered in this paper are in no way exhaustive. An important next step for our approach is also the use of online/sequential kernel regression techniques to run our algorithm in a sequential or batch-based manner and scale to scenarios with large number of tasks. The nonparametric kernel regression framework also opens doors to a probablistic or Bayesian treatment of meta-learning that we plan to pursue in the recent future. This in turn indicates that the regularization helps TANML perform better than MAML and Meta-SGD by avoiding overfitting to the training data. This is perhaps why TANML exhibits a superior performance in comparison with the other two methodsas we have seen in Section A of the Appendix, MAML and Meta-SGD are obtained when TANML uses a linear kernel and zero regularization. µ V 1 V 1 V 3 V 3 V 4 V 4 µ Cosine

C NUMERICAL EXPERIMENTS

We compare four different approaches: MAML, Meta-SGD, TANML-Cosine, TANML-Gaussian. All the algorithms were trained for 60000 meta-iterations, where each meta-iteration outer update uses the entire set of training tasks, and not as a stochastic gradient descent. All the experiments were performed on either NVIDIA Tesla K80 GPU on Microsoft Azure Platform.

D EXPERIMENT 1 HYPERPARAMETERS

In this experiment, we use a linear predictor of dimension 16, same as the input dimension. Each task has four training input-output pairs and 4 test input-output pairs, that is, K = 4. Input x ∈ R 16 is drawn from N (0, I), the additive noise e is white and drawn from N (0, 1).

D.1 MAML

• Inner update learning rate: α: 0.01 • Outer update learning rate: 5 × 10 -4 • Optimizer: Adam D.2 META-SGD • Inner update learning rate α α α: learnt, initialized with values randomly drawn from [0.001, 0.01] • Outer update learning rate for θ θ θ 0 : 5 × 10 -4 • Outer update learning rate for α α α: 1 × 10 -6 • Optimizer: Adam

D.3 TANML-GAUSSIAN

• Outer update learning rate for θ θ θ 0 : 1 × 10 -3 • Outer update learning rate for Ψ Ψ Ψ: 5 × 10 -5 • µ = 0.1 • σ 2 = 0.5 • Optimizer: Adam

D.4 TANML-COSINE

• Outer update learning rate for θ θ θ 0 = θ θ θ 0 : 5 × 10 -4 • Outer update learning rate for Ψ Ψ Ψ: 1 × 10 -5 • µ = 0.1 • Optimizer: Adam D.5 EXPERIMENT 2 HYPER-PARAMETERS The hyper-parameters for the four approaches are listed below. The learning-rate parameters were chosen such that the training error converged without instability. (1) Experiment 2a -Fixed frequency varying amplitude: The data for the different tasks generated from sinusoids with A drawn randomly from (0, 1], setting ω = 1. The outlier task data generated as y(x) = Ax. (2) Experiment 2b -Fixed amplitude varying frequency: The data for the different tasks generated from sinusoids with ω randomly drawn from [1, 1.5], setting A = 1. The outlier task data is generated as y(x) = ωx.



A valid kernel function is one that results in the kernel matrix evaluated for any number of datapoints to be symmetric and positive-semidefinite cf.(Bishop, 2006)



Figure 1: Left: Schematic of MAML. Right: Schematic of the TANML. Only the computation of θ θ θ 1 is shown to keep the diagram uncluttered.

NMSE on test tasks for the regression experiment 1.

NMSE on test tasks for regression experiment 2.

NMSE on test tasks for regression experiment on Physionet 2012 dataset.

Test accuracy % on Omniglot dataset.

Effect of µ for regression experiment on Physionet 2012 dataset for Ex1, Ex2, Ex3.

A CONNECTION OF TANML TO META-SGD AND MAML

We note that in the case when the kernel is the linear kernel k(z i (θ θ θ 0 ), z i (θ θ θ 0 )) = z i (θ θ θ 0 ) z i (θ θ θ 0 ), TANML becomes the special case of Generalized Meta-SGD proposed in Section 2.2. This is because kernel regression and GMSGD further reduces to the Meta-SGD when the linear regression matrix W is constrained to be of the form as in equation 1 and the regularization µ in the outer loop is set to zero. When regression coefficients of the GMSGD are fixed to W = [I, -αI], we obtain the MAML, which is a fixed linear transform acting on the task-descriptor. Thus, TANML reduces to the specific cases of Meta-SGD and MAML, when the kernel is the linear kernel and the regularization parameter µ in the outer-loop is set to zero.Let us again consider the adaptation rule of the TANML:In the case of linear kernel, we have that k(z i (θ θ θ 0 ), z j (θ θ θ 0 )) = z i (θ θ θ 0 ) z j (θ θ θ 0 ). Then, we have thatwhere Z(θ θ θ 0 ) is the matrix of the task descriptors of all the training tasks arranged column-wise. This gives thatg TANML = g MSGD , where † denotes the pseudo-inverse operation. In other words, TANML with the kernel regression matrix constrained to be of the form Ψ Ψ Ψ = (Z(θ θ θ 0 )) † [Idiag(α)] and with no regularization or µ reduces to the Meta-SGD.Further, by the same argument, we have that in the case when TANML with Ψ Ψ Ψ given by the fixed matrixThis also helps gives an intuition on why behaviour of the TANML might perform better -MAML and Meta-SGD are obtained by restricting the set of possible Ψ Ψ Ψ to a constrained set. In contrast, TANML obtains Ψ Ψ Ψ through unconstrained search over R Ttr×D . Thus, MAML and Meta-SGD are obtained as special cases of TANML, when the kernel is the linear or the simple inner-product kernel, and the regression matrix Ψ Ψ Ψ takes special forms.

B ABLATION STUDY

We consider the study of the influence of two aspects on TANML:As in the case of any regularized estimation or modelling approach, the motivation of including the regularization parameter µ is to avoid the overfitting for ψ ψ ψ, specially in the task-limited setting. Keeping all other parameters unchanged, we vary µ and the resulting test NMSE is shown in Table 5 .We observe that both a very large µ or a very small µ degrades the test performance. This is because unlike MAML and Meta-SGD, the since TANML estimates more number of parameters from the same training data, it tends to quickly overfit the training data to almost zero error. Hence, a nonzero value of µ helps curtail the overfit. However, setting µ to a large value biases the parameters such that very little learning takes place beyond the initial few meta-iterations resulting also in a poor test performance. Thus, we see the best test prediction being acheived with µ set to 0.01. In the case of test tasks, the training set consists of samples from the first half of the time series-the goal is to predict the rest of the time series. In the case of training tasks, the training and validation sets are drawn randomly from the entire time-series. We consider only those patients which have at least four measurements. In order to minimize the dynamic range of the measurements, we divided all the true measurements by 150. The time was measured in minutes. The five vital characteristics considered are (Silva et al., 2012) :The number of datapoints in each task varies from 4 to 48. The tasks with number of datapoints less than four are not considered.The following hyper-parameters settings were using in all the five vital characteristics.

E.1 MAML

• Inner update learning rate: α: 10 -3• Outer update learning rate: 5 × 10 -3• Total ANN layers: 4 with, 2 hidden layers, 8 neurons per layer • Outer update learning rate for θ θ θ 0 : 5 × 10 -3• Outer update learning rate for α α α: 1 × 10 -5• Total ANN layers: 4 with, 2 hidden layers, 8 neurons per layer• Non-linearity: ReLU• Optimizer: Adam

E.3 TANML-GAUSSIAN

• Outer update learning rate for θ θ θ 0 : 1 × 10 -3• Outer update learning rate for Ψ Ψ Ψ: 5 × 10 -5• µ = 0.01• Total ANN layers: 4 with, 2 hidden layers• Non-linearity: ReLU• Optimizer: Adam

E.4 TANML-COSINE

• Outer update learning rate for θ θ θ 0 = θ θ θ 0 : 5 × 10 -4• Outer update learning rate for Ψ Ψ Ψ: 1 × 10 -5• µ = 0.01 • Outer update learning rate for α α α: 1 × 10 -5• Total ANN layers: 4 with, 2 hidden layers, 8 neurons per layer 

