OFFLINE META LEARNING OF EXPLORATION

Abstract

Consider the following problem: given the complete training histories of N conventional RL agents, trained on N different tasks, design a meta-agent that can quickly maximize reward in a new, unseen task from the same task distribution. In particular, while each conventional RL agent explored and exploited its own different task, the meta-agent must identify regularities in the data that lead to effective exploration/exploitation in the unseen task. This meta-learning problem is an instance of a setting we term Offline Meta Reinforcement Learning (OMRL). To solve our challenge, we take a Bayesian RL (BRL) view, and seek to learn a Bayesoptimal policy from the offline data. We extend the recently proposed VariBAD BRL algorithm to the off-policy setting, and demonstrate learning of approximately Bayes-optimal exploration strategies from offline data using deep neural networks. For the particular problem described above, our method learns effective exploration behavior that is qualitatively different from the exploration used by any RL agent in the data. Furthermore, we find that when applied to the online meta-RL setting (agent simultaneously collects data and improves its meta-RL policy), our method is significantly more sample efficient than the state-of-the-art VariBAD.

1. INTRODUCTION

A central question in reinforcement learning (RL) is how to learn quickly (i.e., with few samples) in a new environment. Meta-RL addresses this issue by assuming a distribution over possible environments, and having access to a large set of environments from this distribution during training (Duan et al., 2016; Finn et al., 2017) . Intuitively, the meta-RL agent can learn regularities in the environments, which allow quick learning in any environment that shares a similar structure. Indeed, recent work demonstrated this by training memory-based controllers that can 'identify' the domain (Duan et al., 2016; Rakelly et al., 2019; Humplik et al., 2019) , or by learning a parameter initialization that can lead to good performance with only a few gradient updates (Finn et al., 2017) . Another approach to quick RL is Bayesian RL (Ghavamzadeh et al., 2016, BRL) . In BRL, the environment parameters are treated as unobserved variables, with a known prior distribution. Consequentially, the standard problem of maximizing expected returns (taken with respect to the posterior distribution) explicitly accounts for the environment uncertainty, and its solution is a Bayes-optimal policy, wherein actions optimally balance exploration and exploitation. Recently, Zintgraf et al. (2020) showed that meta-RL is in fact an instance of BRL, where the meta-RL environment distribution is simply the BRL prior. Furthermore, a Bayes-optimal policy can be trained using standard policy gradient methods, simply by adding to the state the posterior belief over the environment parameters. The VariBAD algorithm (Zintgraf et al., 2020) is an implementation of this approach that uses a variational autoencoder (VAE) for parameter estimation and deep neural network policies. Most meta-RL studies, including VariBAD, have focused on the online setting, where, during training, the meta-RL policy is continually updated using data collected from running it in the training environments. In domains where data collection is expensive, such as robotics and healthcare to name a few, online training is a limiting factor. For standard RL, offline (a.k.a. batch) RL mitigates this problem by learning from data collected beforehand by an arbitrary policy (Ernst et al., 2005; Levine et al., 2020) . In this work we investigate the offline approach to meta-RL (OMRL). It is well known that any offline RL approach is heavily influenced by the data collection policy. To ground our investigation, we focus on the following practical setting: we assume that data has been collected by running standard RL agents on a set of environments from the environment distribution.

