PERIL: PROBABILISTIC EMBEDDINGS FOR HYBRID META-REINFORCEMENT AND IMITATION LEARNING

Abstract

Imitation learning is a natural way for a human to describe a task to an agent, and it can be combined with reinforcement learning to enable the agent to solve that task through exploration. However, traditional methods which combine imitation learning and reinforcement learning require a very large amount of interaction data to learn each new task, even when bootstrapping from a demonstration. One solution to this is to use meta reinforcement learning (meta-RL) to enable an agent to quickly adapt to new tasks at test time. In this work, we introduce a new method to combine imitation learning with meta reinforcement learning, Probabilistic Embeddings for hybrid meta-Reinforcement and Imitation Learning (PERIL). Dual inference strategies allow PERIL to precondition exploration policies on demonstrations, which greatly improves adaptation rates in unseen tasks. In contrast to pure imitation learning, our approach is capable of exploring beyond the demonstration, making it robust to task alterations and uncertainties. By exploiting the flexibility of meta-RL, we show how PERIL is capable of interpolating from within previously learnt dynamics to adapt to unseen tasks, as well as unseen task families, within a set of meta-RL benchmarks under sparse rewards.

1. INTRODUCTION

Reinforcement Learning (RL) and Imitation Learning (IL) are two popular approaches for teaching an agent, such as a robot, a new task. However, in their standard form, both require a very large amount of data to learn: exploration in the case of RL, and demonstrations in the case of IL. In recent years, meta-RL and meta-IL have emerged as promising solutions to this, by leveraging a meta-training dataset of tasks to learn representations which can quickly adapt to this new data. However, both these methods have their own limitations. Meta-RL typically requires hand-crafted, shaped reward functions to describe each new task, which is tedious and not practical for nonexperts. A more natural way to describe a task is to provide demonstrations, as with meta-IL. But after adaptation, these methods cannot continue to improve the policy in the way that RL methods can, and are restricted by the similarity between the new task and the meta-training dataset. A third limitation, which both methods can suffer from, is that defining a low-dimensional representation of the environment for efficient learning (as opposed to learning directly from high-dimensional images), requires hand-crafting this representation. As with rewards, this is not practical for nonexperts, but more importantly, it does not allow generalisation across different task families, since each task family would require its own unique representation and dimensionality. In this work, we propose a new method, PERIL, which addresses all three of these limitations, in a hybrid framework that combines the merits of both RL and IL. Our method allows for tasks to be defined using demonstrations only as with IL, but upon adaptation to the demonstrations, it also allows for continual policy improvement through further exploration of the task. Furthermore, we define the state representation using only the agent's internal sensors, such as position encoders of a robot arm, and through interaction we implicitly recover the state of external environment, such as the poses of objects which the robot is interacting with. Overall, this framework allows for learning of new tasks without requiring any expert knowledge in the human teacher. PERIL operates by implicitly representing a task by a latent vector, which is predicted during learning of a new task, through two means. First, during meta-training we encourage high mutual information between the demonstration data and the latent space, which during testing forms a prior for

