GRADIENT-BASED TRANSFER LEARNING

Abstract

We formulate transfer learning as a meta-learning problem by extending upon the current meta-learning paradigm in that support and query data are drawn from different, but related distributions of tasks. Inspired by the success of Gradient-Based Meta-Learning, we propose to expand it to the transfer learning setting by constructing a general encoder-decoder architecture that learns a map between functionals of different domains. This is achieved by leveraging on the idea that the task-adapted parameters of a meta-learner can serve as an informative representation of the task itself. We demonstrate the proposed method on regression, prediction of dynamical systems and meta-imitation learning problems.

1. INTRODUCTION

The ability to quickly adapt to unseen conditions is a necessary skill for any intelligent system. It provides the means to generalize outside of the training conditions as well as the capacity to extract unobservable features affecting the learner (Lake et al. ( 2017)). Adaptation to a new task involves two steps. The first is inferring the characterizing information of the task at hand. The second is regressing the function representing the task. The importance of this ability is reflected in the considerable volume of work conducted on the matter in the past years e.g. Hospedales et al. (2021) ; Ben-David et al. (2006); Ljung (2010) . The field of meta-learning provides the means to unify these two steps and learn them simultaneously and fully data-driven (Huisman et al. (2021) ). The learning process comprises multiple datasets representing different conditions, or tasks, the learner is concurrently exposed to. Adaptation is performed by extracting the relevant information about each task from a small set of data sampled from the task. In this paper we consider the case of transferring knowledge using a small set of data from a task to another, different, task. In this regard, we build upon the framework of few-shot learning (Wang et al. (2020) ). This can be summarized as estimating an optimal learner for any task with the fewest data samples possible. Recent work has explored the case where the data used for the adaptation and the downstream-task's data are subject to a distributional shift in their domain, referred to as support-query-shift (Bennequin et al. (2021) ). Here, we assume the more general formulation of meta-transfer where the shift can take place on both the domain and co-domain of the underlying function generating the data. This brings us beyond the problem of domain-shift and into the more general notion of learning to transfer between support task and query task. The need for transfer emerges in a multitude of situations. Sequential decision-making problems are one of them. Real-world dynamical systems, for example, are often only partially observable. They require an initial exploration phase to gather the necessary information before estimating a suitable policy. In this case, we would need a way to transfer the knowledge acquired from the dynamics of the system to the estimation of the target policy. That is, transfer between a dynamics prediction model to the estimation of a policy in a control problem. Moreover, transfer learning can be used in situations where we have access to labeled data of a simple problem but would like to solve a more complex, but related, problem. For example, transfer from a single inverted pendulum to a double pendulum with the same dynamics e.g. same length of the poles, same gravity and friction coefficients. To this end, we present an approach to transfer learning through adaptation. Inspired by Gradient-Based Meta-Learning (GBML) we propose a method for meta-transfer learning in a general encoderdecoder model. This can be used independently of the shift between the support task and the query task and is agnostic to architectural changes between the meta-learner and the base-learner (see Fig- Figure 1 : Visual depiction of the proposed model. The representation of the task is the gradient, ∇ θ L, of a meta-learner model i.e. green arrow on the blue loss surface on the left. This representation is then mapped through M ϕ (light blue arrow) to the parameters, ψ, of the main network which is optimal for the given task (minimizes the orange loss function on the right) ure 1). The main idea of this work is that the parameters of a learner that is optimal for a given task contain all the relevant information of such task (Tegnér et al. ( 2022)). The proposed model learns a map from the gradients used to adapt the parameters of a meta-learner to the parameters of a base learner. This, in fact, is a map between functions that has proven to be effective in different contexts e.g. Xu et al. (2020); Dupont et al. (2022) . We argue that representing the task's parameters as the gradients of the meta-learner is more robust to noise and bias in the data. We empirically support this claim with a number of experiments on synthetic regression, dynamical system prediction and meta-imitation learning. Our contributions are as follows: • We extend the formulation of support-query shift to the problem of transfer learning. • We describe a meta-transfer learning method that builds upon previous gradient-based methodologies. • We provide an empirical evaluation of the advantages of gradient-based task representations on a variety of problems.

2. RELATED WORK

In this section we review the relevant literature regarding adaptation methods. The idea of adapting the learning system has been widely studied in the past years (Naik & Mammone (1992); Bengio et al. (1990); Hochreiter et al. (2001) ). Adaptation is performed using data-points that uniquely characterize the task. Different approaches can be included in this definition depending on the framework they abide to and the assumptions they make about the adaptation process. 2009)). These methods cannot be considered to perform general adaptation strategies. However, they require the identification of useful information for a task from a different distribution. They are generally limited to two tasks only and most involve aligning the distributions of these two tasks. In Wu & He (2022) they propose the use of meta-learning for a transfer learning problem. The method is limited to matching the empirical distribution of dynamic source and target tasks. Parameter Identification. A more general form of adaptation to dynamical systems can be identified in the early work on system identification ( Åström & Eykhoff (1971) ). More precisely, parameter identification refer to the estimation of unobservable parameters influencing the dynamic system considered from a sequence of observations. Most of these studies however consider only one of the two required steps for adaptation. In fact, they assume to know the law governing the process and estimate the conditional parameters (Bhat et al. (2002) ; Yu et al. ( 2017)), impose a suitable inductive bias to guide the learning process (Sanchez-Gonzalez et al. ( 2018)) or use a hybrid approach to learn a residual of an imperfect but known system (Ajay et al. ( 2019)).



It refers to the problem of learning algorithms to extract knowledge from one task to solve a second task (Weiss et al. (2016); Zhuang et al. (2020); Pan & Yang (

