META GRADIENT BOOSTING NEURAL NETWORKS

Abstract

Meta-optimization is an effective approach that learns a shared set of parameters across tasks for parameter initialization in meta-learning. A key challenge for metaoptimization based approaches is to determine whether an initialization condition can be generalized to tasks with diverse distributions to accelerate learning. To address this issue, we design a meta-gradient boosting framework that uses a base learner to learn shared information across tasks and a series of gradient-boosted modules to capture task-specific information to fit diverse distributions. We evaluate the proposed model on both regression and classification tasks with multi-mode distributions. The results demonstrate both the effectiveness of our model in modulating task-specific meta-learned priors and its advantages on multi-mode distributions.

1. INTRODUCTION

While humans can learn quickly with a few samples with prior knowledge and experiences, artificial intelligent algorithms face challenges in dealing with such situations. Learning to learn (or metalearning) (Vilalta & Drissi, 2002) emerges as the common practice to address the challenge by leveraging transferable knowledge learned from previous tasks to improve learning on new tasks (Hospedales et al., 2020) . An important direction in meta-learning research is meta-optimization frameworks (Lee & Choi, 2018; Nichol & Schulman, 2018; Rusu et al., 2019) , a.k.a., model-agnostic meta-learning (MAML) (Finn et al., 2017) . Such frameworks learn initial model parameters from similar tasks and commit to achieving superior performance on new tasks that conform to the same distribution through fast adaptation. They offer excellent flexibility in model choices and demonstrate appealing performance in various domains, such as image classification (Li et al., 2017; Finn et al., 2017) , language modeling (Vinyals et al., 2016) , and reinforcement learning (Fernando et al., 2018; Jaderberg et al., 2019) . Generally, such frameworks define a target model F θ and a meta-learner M. The learning tasks T = {T train , T test } are divided into training and testing tasks, where T are generated from the meta-dataset D, i.e., T ∼ P (D). Each task contains a support set D S and a query set D Q for training and evaluating a local model. The initialization of the model parameter θ is learned by the meta learner, i.e., θ ← M(T train ). We denote the meta-learned parameter as φ so that θ ← φ. For each task, the model obtains locally optimal parameter θ by minimizing the loss L(F θ (D S )). The meta parameter φ will be updated across all training tasks by minimizing the loss Σ T ∈T train (L(F θ (D Q ))). Generally, it takes only a small number of epochs to learn locally optimal parameters across training tasks so that meta-learned parameter φ can quickly converge to an optimal parameter for new tasks. Most methods assume some transferable knowledge across all tasks and rely on a single shared meta parameter. However, the success of the meta-learners are limited within similar task families, and the single shared meta parameter cannot well support fast learning on diverse tasks (e.g., a large meta-dataset) or task distributions (e.g., T are generated from multiple meta-datasets) due to conflicting gradients for those tasks (Hospedales et al., 2020) . Recent efforts have studied multiple initial conditions to solve the above challenges. Some employ probabilistic models (Rusu et al., 2019; Finn et al., 2018; Yoon et al., 2018) while others incorporate task-specific information (Lee & Choi, 2018; Vuorio et al., 2019; Alet et al., 2018) . The former learns to obtain an approximate posterior of an unseen task yet needs sufficient samples to get reliable data distributions; the latter conducts task-specific parameter initialization using multiple meta-learners yet requires expensive computation and cannot transfer knowledge across different modes of task distributions.

