TOWARDS EFFICIENT GRADIENT-BASED META-LEARNING IN HETEROGENOUS ENVIRONMENTS Anonymous authors Paper under double-blind review

Abstract

A challenging problem for machine learning is few-shot learning, as traditionally, models trained with SGD require many training samples to converge. Since metalearning models have strong fine-tuning capabilities for the distribution of tasks, many of them have been applied to few-shot learning. Model-agnostic metalearning (MAML) is one of the most popular ones. Recent studies showed that MAML-trained models tend to reuse learned features and do not perform strong adaption, especially in the earlier layers. This paper presents an in-detail analysis of this phenomenon by analyzing MAML's components for different variants. Our results show an interesting relationship between the importance of fine-tuning earlier layers and the difference in the distribution between training and testing. As a result, we determine a fundamental weakness of existing MAML variants when the task distribution is heterogeneous, e.g., the numbers of classes or the domain do not match during testing and training. We propose a novel nonparametric version of MAML that overcomes these issues while still being able to perform cross-domain adaption.

1. INTRODUCTION

Learning tasks from only a few observations is known as few-shot learning and of major interest in the machine learning community (Finn et al., 2017; Vinyals et al., 2016; Snell et al., 2017; Cai & Shen, 2020; Tseng et al., 2020) . Usually, a problem is solved by minimizing the empirical risk with many training samples in many iterations. However, humans learn new tasks very quickly by using knowledge obtained in their lives a priori (Salakhutdinov et al., 2012) . Meta-learning is motivated by how humans learn, where the goal is learning how to learn and a common approach to solve few-shot learning problems due to its ability to efficiently leverage information from many tasks. Model-Agnostic Meta-Learning (MAML) has been one of the most successful meta-learning algorithms for few-shot learning in recent years (Finn et al., 2017) . In MAML, the network is metaoptimized for fast gradient-descent based fine-tuning on an unseen task. Its formulation of the meta-learning objective inspired a plethora of research (Yoon et al., 2018; Li et al., 2017; Vuorio et al., 2019; Finn et al., 2018) , to the extent that MAML exists both as a concrete meta-learning algorithm but also as a paradigm that influences meta-learning methods to this day. Previous work has discussed whether MAML actually allows rapid fine-tuning or simply leverages its meta-representations effectively (called feature reuse). Raghu et al. (2020) found out that freezing the earlier layers of a network during fine-tuning improves the performance, meaning that fine-tuning of the network body is not the major factor contributing to its few-shot capabilities, indicating feature reuse. Oh et al. (2021) discovered that in the case of cross-domain adaptation, a change in the earlier layers is beneficial and proposed to fix the network head instead to enforce earlier weight change, a method they call body only inner loop (BOIL). However, as we will argue in Section 3, its fixed final layer is impractical when the numbers of classes differ across tasks, which is a considerable limitation in real-world scenarios. In this paper, we develop a novel technique called NP-MAML, which has a nonparametric head but is still trainable via gradients. Similar to BOIL, NP-MAML enforces changes in earlier layers to solve cross-domain tasks. In addition, it is flexible to the heterogeneous task distribution. We compare the performance and representation change of these approaches under different challenges: cross-domain adaption and task dimensionality. We further deliver an analysis of the different components of the network and want to specify their respective role with regard to fine-tuning and task adaptation.

2. META-LEARNING FOR FEW-SHOT LEARNING

In few-shot learning, a dataset consists of tasks coming from a task distribution. Each task contains of a support set S with labels samples and a query set Q where predictions have to be made on the samples (Vinyals et al., 2016) . A typical support set consists of K examples for each of N classes, wherefore a problem is usually described as N -way-K-shot classification problem. Next to non-episodic approaches (Gidaris & Komodakis, 2018; Qi et al., 2018; Chen et al., 2019) , many meta-learning methods have been applied to few-shot learning problems (Vinyals et al., 2016; Li et al., 2017; Yoon et al., 2018; Snell et al., 2017) . They are particularly useful as they can learn a configuration from which it is easy to solve various tasks. The configuration has seen similar features or recurring patterns shared across the tasks and allowed to be transferred to novel unseen tasks.

2.1. OPTIMIZATION-BASED META-LEARNING

Optimization-based meta-learning (Ravi & Larochelle, 2017) follows the idea of meta-learning a task-specific optimizer U , that transforms some initial set of parameters θ into task-parameters ϕ. Although U can be chosen arbitrarily, it is typically modelled as a m-step, gradient-based update scheme, denoted U pmq pθq. While methods like Meta-LSTMs model U pmq pθq explicitly via an LSTM (Ravi & Larochelle, 2017), which iteratively transforms θ, given both loss and loss gradient, the popularity and success of MAML is due to its efficient and model-agnostic design, allowing U pmq to take any differentiable form and rather meta-optimizing θ, the initial parameters, leading to superior performance and more flexibility.

2.2. MODEL-AGNOSTIC META-LEARNING

MAML (Finn et al., 2017) introduces a couple of new interpretations of the meta-learning problem that allow its efficient and flexible training. Firstly, task data is assumed to come from a task distribution ppτ q, which in turn will allow us to form an expectation over task performance. Secondly, let L τ px; θ ˚q denote the loss of a model f θ on task-data x " tpx 1 , y 1 q, ..., px T , y T qu, parameterized by θ ˚. To stay consistent with the notation introduced hitherto, we will denote with S τ the support set of task τ and with Q τ the query set of task τ . Then, we can express the optimization objective of MAML as min θ E τ "ppτ q " L τ pQ τ ; U pmq pS τ ; θqq ı , where we write U pmq pS τ ; θq to denote a m-step optimizer, transforming meta-parameters θ given the support set S τ . Intuitively, MAML improves on-task performance on average by optimizing the initial parameters of model f θ and subsequently fine-tuning those parameters on the support set S τ , where on-task performance is measured by evaluating the fine-tuned model on the query set Q τ . As ppτ q is high-dimensional and typically unknown, computing the actual expectation integral is not feasible, which is why we define the meta-loss of MAML as Lpθq " 1 |T | ÿ τ PT L τ pQ τ ; U pmq pS τ ; θqq, ( ) where T is a batch of tasks, sampled from ppτ q, and where we replace the expectation with an empirical mean. This meta-loss is then optimized with standard gradient descent with step-size β, i.e., θ ptq " θ pt´1q ´β∇ θ pt´1q Lpθ pt´1q q. (3)

