PERIL: PROBABILISTIC EMBEDDINGS FOR HYBRID META-REINFORCEMENT AND IMITATION LEARNING

Abstract

Imitation learning is a natural way for a human to describe a task to an agent, and it can be combined with reinforcement learning to enable the agent to solve that task through exploration. However, traditional methods which combine imitation learning and reinforcement learning require a very large amount of interaction data to learn each new task, even when bootstrapping from a demonstration. One solution to this is to use meta reinforcement learning (meta-RL) to enable an agent to quickly adapt to new tasks at test time. In this work, we introduce a new method to combine imitation learning with meta reinforcement learning, Probabilistic Embeddings for hybrid meta-Reinforcement and Imitation Learning (PERIL). Dual inference strategies allow PERIL to precondition exploration policies on demonstrations, which greatly improves adaptation rates in unseen tasks. In contrast to pure imitation learning, our approach is capable of exploring beyond the demonstration, making it robust to task alterations and uncertainties. By exploiting the flexibility of meta-RL, we show how PERIL is capable of interpolating from within previously learnt dynamics to adapt to unseen tasks, as well as unseen task families, within a set of meta-RL benchmarks under sparse rewards.

1. INTRODUCTION

Reinforcement Learning (RL) and Imitation Learning (IL) are two popular approaches for teaching an agent, such as a robot, a new task. However, in their standard form, both require a very large amount of data to learn: exploration in the case of RL, and demonstrations in the case of IL. In recent years, meta-RL and meta-IL have emerged as promising solutions to this, by leveraging a meta-training dataset of tasks to learn representations which can quickly adapt to this new data. However, both these methods have their own limitations. Meta-RL typically requires hand-crafted, shaped reward functions to describe each new task, which is tedious and not practical for nonexperts. A more natural way to describe a task is to provide demonstrations, as with meta-IL. But after adaptation, these methods cannot continue to improve the policy in the way that RL methods can, and are restricted by the similarity between the new task and the meta-training dataset. A third limitation, which both methods can suffer from, is that defining a low-dimensional representation of the environment for efficient learning (as opposed to learning directly from high-dimensional images), requires hand-crafting this representation. As with rewards, this is not practical for nonexperts, but more importantly, it does not allow generalisation across different task families, since each task family would require its own unique representation and dimensionality. In this work, we propose a new method, PERIL, which addresses all three of these limitations, in a hybrid framework that combines the merits of both RL and IL. Our method allows for tasks to be defined using demonstrations only as with IL, but upon adaptation to the demonstrations, it also allows for continual policy improvement through further exploration of the task. Furthermore, we define the state representation using only the agent's internal sensors, such as position encoders of a robot arm, and through interaction we implicitly recover the state of external environment, such as the poses of objects which the robot is interacting with. Overall, this framework allows for learning of new tasks without requiring any expert knowledge in the human teacher. PERIL operates by implicitly representing a task by a latent vector, which is predicted during learning of a new task, through two means. First, during meta-training we encourage high mutual information between the demonstration data and the latent space, which during testing forms a prior for Figure 1 : Overview of our proposed method. We obtain a set of demonstrations of an unseen task and we adapt to it through efficient demonstration-conditioned exploration. the latent space and represents the agent's task belief from demonstrations alone. Second, we allow further exploration to continually update the latent space, by conditioning on the robot's states and actions during exploration of this new task. We model the latent space via an encoder, from which posterior sampling can be done to encode the agent's current task belief. In essence, the encoder aims to learn an embedding which can simultaneously (i) infer the task intent, and (ii) output a policy which can solve the inferred task. During meta-training, PERIL is optimised end-to-end, by simultaneously learning both a policy and the embedding function upon which the policy is conditioned. In our experiments we find PERIL achieves exceptional adaptation rates and is capable of exploiting demonstrations to efficiently explore unseen tasks. Through structured exploration, PERIL outperforms other Meta-RL and Meta-IL baselines and is capable of zero-shot learning. We show how our method is capable of multi-family meta-learning as well as out-of-family meta-learning by clustering distinct meta-trained latent space representations. As an extension, we also show how to use privileged information during training to create an auxiliary loss for training the embedding function, which helps to form a stronger relationship between the latent space and the true underlying state which defines the task. Supplementary videos are available at our anonymous webpage https://sites.google.com/view/peril-iclr-2021 2017) developed a learning-to-learn strategy, model agnostic meta learning (MAML), which meta-learns an initialisation that adapts the parameters of the policy network and fine tunes it during meta-testing. Although promising results in simple goalfinding tasks, MAML-based methods fail to produce efficient stochastic exploration policies and adapt to complex tasks (Gupta et al., 2018) . Meta-learning robust exploration strategies is key in order to improve sample efficiency and allow fast adaptation at test time. In light of this, contextbased RL was developed with the aim of reducing the uncertainty of newly explored tasks. These map transitions τ collected from an unseen task into a latent space z via an encoder q φ (z|τ ), such that the conditioned policy π θ (τ |z) can efficiently solve said task (Rakelly et al.; Wang & Zhou, 2020 ). An underlining benefit of decoupling task encoding from the policy is that it disentangles task inference from reward maximisation, whilst gradient-based and RNN-based meta-RL policies do this internally. An important remark regarding training conditions of the discussed meta-RL methods is that they are typically meta-trained using dense reward functions (Rakelly et al.; Wang, 2016; Wang & Zhou, 2020) . These dense reward functions provide information-rich contexts of the unseen task. Considering that the ultimate goal is to allow agents to solve new tasks in the real world, adaptation during test time must be robust to sparse reward feedback (Schoettler et al., 2020) .

2. RELATED WORK

In the context of RL, incorporating demonstrations has proven successful in aiding exploration strategies, stabilising learning and increasing sample efficiency (Vecerik et al., 2017) . Learning expressive policies from a set of demonstrations requires a vast amount of expert trajectories (Ross et al.), particularly in high dimensional state-action spaces (Rajeswaran et al.). In contrast, Meta-IL can be implemented in meta-RL by conditioning the agent with expert trajectories. On the another hand, Zhou et al. propose a MAML-based meta-IL approach which averages objective across demonstrations. The latter learns to adapt at test-time by receiving binary rewards. A similar strategy was also developed by and Mendonca et al.. The caveats of these approaches remain that of traditional IL: (i) Imitating expert trajectories hinders the policy from doing better than the demonstrations; (ii) Cloning behaviours reduces flexibility and generalisation capacity.



Meta-RL was conceptualised as an RNN-based task. Developed by Wang (2016) and Duan et al., the authors use RNNs to feed a history of transitions into the model such that the policy can internalise the dynamics. On another line, Finn et al. (

