

Abstract

Model-based reinforcement learning (MBRL) has been applied to meta-learning settings and has demonstrated its high sample efficiency. However, in previous MBRL for meta-learning settings, policies are optimized via rollouts that fully rely on a predictive model of an environment. Thus, its performance in a real environment tends to degrade when the predictive model is inaccurate. In this paper, we prove that performance degradation can be suppressed by using branched meta-rollouts. On the basis of this theoretical analysis, we propose Meta-Modelbased Meta-Policy Optimization (M3PO), in which the branched meta-rollouts are used for policy optimization. We demonstrate that M3PO outperforms existing meta reinforcement learning methods in continuous-control benchmarks.

1. INTRODUCTION

Reinforcement learning (RL) methods have achieved remarkable success in many decision-making tasks, such as playing video games or controlling robots (e.g., Gu et al. (2017) ; Mnih et al. (2015) ). In conventional RL methods, when multiple tasks are to be solved, a policy is independently learned for individual tasks. In general, each learning requires millions of training samples from the environment. This independent learning with a large number of samples prevents conventional RL methods from being applied to practical multi-task problems (e.g., robotic manipulation problems involving grasping or moving different types of objects (Yu et al., 2019) ). Meta-learning methods (Schmidhuber et al., 1996; Thrun & Pratt, 1998) have recently gained much attention as a promising solution to this problem (Finn et al., 2017) . They learn a structure shared in the tasks by using a large number of samples collected across the parts of the tasks. Once learned, these methods can adapt quickly to new (or the rest of the) tasks with a small number of samples given. Meta-RL methods have previously been introduced into both model-free and model-based settings. For model-free settings, there are two main types of approaches proposed so far, recurrent-based policy adaptation (Duan et al., 2017; Mishra et al., 2018; Rakelly et al., 2019; Wang et al., 2016) and gradient-based policy adaptation (Al-Shedivat et al., 2018; Finn & Levine, 2018; Finn et al., 2017; Gupta et al., 2018; Rothfuss et al., 2019; Stadie et al., 2018) . In these approaches, policies adapt to a new task by leveraging the history of past trajectories. Following previous work (Clavera et al., 2018) , we refer to these adaptive policies as meta-policies in our paper. In these modelfree meta-RL methods, in addition to learning control policies, the learning of policy adaptation is also required (Mendonca et al., 2019) . Thus, these methods require more training samples than conventional RL methods. In these approaches, the predictive models adapt to a new task by leveraging the history of past trajectories. In analogy to the meta-policy, we refer to these adaptive predictive models as meta-models in our paper. Generally, these model-based meta-RL approaches are more sample efficient than the model-free approaches. However, in these approaches, the meta-policy (or the course of actions) is optimized via rollouts relying fully on the meta-model. Thus, its performance in a real environment tends to degrade when the meta-model is inaccurate. In this paper, we address this performance degradation problem in model-based meta-RL.



For model-based settings, there have been relatively few approaches proposed so far. Saemundsson et al. (2018) and Perez et al. (2020) use a predictive model (i.e., a transition model) conditioned by a latent variable for model predictive control. Nagabandi et al. (2019a;b) introduced both recurrentbased and gradient-based meta-learning methods into model-based RL.

