OFFLINE META LEARNING OF EXPLORATION

Abstract

Consider the following problem: given the complete training histories of N conventional RL agents, trained on N different tasks, design a meta-agent that can quickly maximize reward in a new, unseen task from the same task distribution. In particular, while each conventional RL agent explored and exploited its own different task, the meta-agent must identify regularities in the data that lead to effective exploration/exploitation in the unseen task. This meta-learning problem is an instance of a setting we term Offline Meta Reinforcement Learning (OMRL). To solve our challenge, we take a Bayesian RL (BRL) view, and seek to learn a Bayesoptimal policy from the offline data. We extend the recently proposed VariBAD BRL algorithm to the off-policy setting, and demonstrate learning of approximately Bayes-optimal exploration strategies from offline data using deep neural networks. For the particular problem described above, our method learns effective exploration behavior that is qualitatively different from the exploration used by any RL agent in the data. Furthermore, we find that when applied to the online meta-RL setting (agent simultaneously collects data and improves its meta-RL policy), our method is significantly more sample efficient than the state-of-the-art VariBAD.

1. INTRODUCTION

A central question in reinforcement learning (RL) is how to learn quickly (i.e., with few samples) in a new environment. Meta-RL addresses this issue by assuming a distribution over possible environments, and having access to a large set of environments from this distribution during training (Duan et al., 2016; Finn et al., 2017) . Intuitively, the meta-RL agent can learn regularities in the environments, which allow quick learning in any environment that shares a similar structure. Indeed, recent work demonstrated this by training memory-based controllers that can 'identify' the domain (Duan et al., 2016; Rakelly et al., 2019; Humplik et al., 2019) , or by learning a parameter initialization that can lead to good performance with only a few gradient updates (Finn et al., 2017) . Another approach to quick RL is Bayesian RL (Ghavamzadeh et al., 2016, BRL) . In BRL, the environment parameters are treated as unobserved variables, with a known prior distribution. Consequentially, the standard problem of maximizing expected returns (taken with respect to the posterior distribution) explicitly accounts for the environment uncertainty, and its solution is a Bayes-optimal policy, wherein actions optimally balance exploration and exploitation. Recently, Zintgraf et al. (2020) showed that meta-RL is in fact an instance of BRL, where the meta-RL environment distribution is simply the BRL prior. Furthermore, a Bayes-optimal policy can be trained using standard policy gradient methods, simply by adding to the state the posterior belief over the environment parameters. The VariBAD algorithm (Zintgraf et al., 2020) is an implementation of this approach that uses a variational autoencoder (VAE) for parameter estimation and deep neural network policies. Most meta-RL studies, including VariBAD, have focused on the online setting, where, during training, the meta-RL policy is continually updated using data collected from running it in the training environments. In domains where data collection is expensive, such as robotics and healthcare to name a few, online training is a limiting factor. For standard RL, offline (a.k.a. batch) RL mitigates this problem by learning from data collected beforehand by an arbitrary policy (Ernst et al., 2005; Levine et al., 2020) . In this work we investigate the offline approach to meta-RL (OMRL). It is well known that any offline RL approach is heavily influenced by the data collection policy. To ground our investigation, we focus on the following practical setting: we assume that data has been collected by running standard RL agents on a set of environments from the environment distribution. Importantly -we do not allow any modification to the RL algorithms used for data collection, and the meta-RL learner must make use of data that was not specifically collected for the meta-RL task. Nevertheless, we hypothesize that regularities between the training domains can still be learned, to provide faster learning in new environments. Figure 1 illustrates our problem: in this navigation task, each RL agent in the data learned to find its own goal, and converged to a behavior that quickly navigates toward it. The meta-RL agent, on the other hand, needs to learn a completely different behavior that effectively searches for the unknown goal position. Our key idea to solving OMRL is an off-policy variant of the VariBAD algorithm, based on replacing the on-policy policy gradient optimization in VariBAD with an off-policy Q-learning based method. This, however, requires some care, as Q-learning applies to states of fully observed systems. We show that the VariBAD approach of augmenting states with the belief in the data applies to the off-policy setting as well, leading to an effective algorithm we term Off-Policy VariBAD. The offline setting, however, brings about another challenge -when the agent visits different parts of the state space in different environments, it becomes challenging to obtain an accurate belief estimate, a problem we term MDP ambiguity. When the ambiguity is due to differences in the reward between the environments, we propose a simple solution, based on a reward relabelling trick that significantly improves the performance of the VariBAD VAE trained on offline data. Our experimental results show that our method can learn effective exploration policies from offline data on both discrete and continuous control problems. Our main contributions are as follows. To our knowledge, this is the first study of meta learning exploration in the offline setting. We provide the necessary theory to extend VariBAD to offpolicy RL. We show that a key difficulty in OMRL is MDP ambiguity, and propose an effective solution for the case where tasks differ in their rewards. We show non-trivial empirical results that demonstrate significantly better exploration than meta-RL methods based on Thompson sampling such as PEARL (Rakelly et al., 2019) , even when these methods are allowed to train online. Finally, our method can also be applied in the online setting, and demonstrates significantly improved sample efficiency compared to conventional VariBAD.

2. BACKGROUND

Our work leverages ideas from meta-RL, BRL and the VariBAD algorithm, as we now recapitulate.

Meta-RL:

In meta-RL, a distribution over tasks is assumed. A task T i is described by a Markov Decision Process (MDP, Bertsekas, 1995) M i = (S, A, R i , P i ), where the state space S and the action space A are shared across tasks, and R i and P i are task specific reward and transition functions. Thus, we write the task distribution as p(R, P). For simplicity, we assume throughout that the initial state distribution P init (s 0 ) is the same for all MDPs. The goal in meta-RL is to train an agent that can quickly maximize reward in new, unseen tasks, drawn from p(R, P). To do so, the agent must leverage any shared structure among tasks, which can typically be learned from a set of training tasks.

Bayesian Reinforcement Learning:

The goal in BRL is to find the optimal policy π in an MDP, when the transitions and rewards are not known in advance. Similar to meta-RL, we assume a prior



Figure1: Illustration of offline meta-RL: the task is to navigate to a goal position that can be anywhere on the semi-circle. The reward is sparse (light-blue), and the offline data (left) contains training histories of conventional RL agents trained to find individual goals. The meta-RL agent (right) needs to find a policy that quickly finds the unknown goal, here, by searching across the semi-circle. Note that this search behavior is completely different than the dominant behaviors in the data.

