INTRINSICALLY GUIDED EXPLORATION IN META RE-INFORCEMENT LEARNING

Abstract

Deep reinforcement learning algorithms generally require large amounts of data to solve a single task. Meta reinforcement learning (meta-RL) agents learn to adapt to novel unseen tasks with high sample efficiency by extracting useful prior knowledge from previous tasks. Despite recent progress, efficient exploration in meta-training and adaptation remains a key challenge in sparse-reward meta-RL tasks. We propose a novel off-policy meta-RL algorithm to address this problem, which disentangles exploration and exploitation policies and learns intrinsically motivated exploration behaviors. We design novel intrinsic rewards derived from information gain to reduce task uncertainty and encourage the explorer to collect informative trajectories about the current task. Experimental evaluation shows that our algorithm achieves state-of-the-art performance on various sparse-reward MuJoCo locomotion tasks and more complex Meta-World tasks.

1. INTRODUCTION

Human intelligence is able to transfer knowledge across tasks and acquire new skills within limited experiences. However, most reinforcement learning (RL) agents still require large amounts of data to achieve human-level performance (Silver et al., 2017; Hessel et al., 2018; Vinyals et al., 2019) . Meta reinforcement learning (meta-RL) makes a step toward such efficient learning by extracting prior knowledge from a set of tasks to quickly adapt to new tasks. However, efficient exploration of meta-RL needs to consider both training and adaptation phases simultaneously (Ishii et al., 2002) , which becomes a key challenge for meta-RL. One branch of meta-RL algorithms (Finn et al., 2017; Stadie et al., 2018; Rothfuss et al., 2019; Gurumurthy et al., 2019) utilizes policies injected with time-irrelevant random noise for meta-training exploration, while another branch of methods (Duan et al., 2016; Mishra et al., 2017; Gupta et al., 2018; Zintgraf et al., 2019; Rakelly et al., 2019) introduces memories or latent variables that enable temporally-extended exploration behaviors. EPI (Zhou et al., 2018) introduces intrinsic rewards based on the improvement of dynamics prediction. However, these exploration mechanisms are still inefficient in either meta-training or adaptation and underperform in complex sparse-reward tasks. To address this challenge of meta-RL, we introduce information-theoretic intrinsic motivations for learning to collect informative trajectories and enable efficient exploration in both meta-training and adaptation. Inspired by the common task-inference component in context-based meta-RL algorithms (such as PEARL (Rakelly et al., 2019) and VariBAD (Zintgraf et al., 2019 )), we leverage an insight that exploration behaviors should collect trajectories that contain rich information gain about the current task, and design an exploration objective to maximize the information gain for inferring taskss. Based this objective, we derive an intrinsic reward for learning an effective exploration policy. To reduce variance from estimating information-gain intrinsic rewards in complex domains, we simplify and derive an intrinsic reward based on prediction errors to achieve superior stability and scalability. We develop a novel off-policy meta-RL algorithm, Meta-RL with effiCient Uncertainty Reduction Exploration (MetaCURE), that incorporates our intrinsic rewards and separates exploration and exploitation policies. MetaCURE learns to perform sequential exploration behaviors to reduce task uncertainty across adaptation episodes and maximizes the expected extrinsic return in the last episode of the adaptation phase. During meta-training, MetaCURE collects training data from both exploration and exploitation policies. As for adaptation, the exploration policy collects informative trajectories, and then the exploitation policy utilizes gained experiences to maximize final performance. We evaluate our algorithm on various sparse-reward MuJoCo locomotion tasks as well as sparse-reward Meta-World tasks. Empirical results show that it outperforms baseline algorithms by a large margin. We also visualize how our algorithm explores in novel tasks and discuss the pros and cons of the two proposed intrinsic rewards.

2. BACKGROUND

In meta-RL, we consider a distribution of tasks p(T ), with each task T modelled as a Markov Decision Process (MDP), which consists of a state space, an action space, a transition function and a reward function. In common meta-RL settings (Duan et al., 2016; Finn et al., 2017; Zintgraf et al., 2019; Rakelly et al., 2019) , tasks differ in the transition and/or reward function, and we can describe a task T with a tuple p T 0 (s 0 ), p T (s |s, a), r T (s, a) , with p T 0 (s 0 ) the initial state distribution, p T (s |s, a) the transition probability and r T (s, a) the reward function. We denote context c T n = (a n , r n , s n+1 ) as the experience tuple collected at the n-th step of adaptation in task T , and we use c T -1:T -1 = c T -1 , c T 0 , ..., c T T -1 foot_0 to denote all trajectories collected in the T timesteps A common objective for meta-RL is to optimize final performance after few-shot adaptation (Finn et al., 2017; Gupta et al., 2018; Stadie et al., 2018; Rothfuss et al., 2019) . During adaptation, an agent first utilizes some exploration policy π e to explore for a few episodes, and then updates an exploitation policy π to maximize expected return. Such a meta-RL objective can be formulated as: max π,π e E T [R(T , π(c T π e ))] where c T π e is a set of experiences collected by π e , and R is the last episode's expected return with π. The exploitation policy π is adapted with c T π e for optimizing final performance.

3. METACURE

To support efficient exploration in both meta-training and adaptation, we propose MetaCURE, a novel off-policy meta-RL algorithm, that learns separate exploration and exploitation policies. The exploration policy aims to collect trajectories that maximize the agent's information gain to reduce uncertainty of task inference. The exploitation policy maximizes the expected extrinsic return in the last episode of adaptation. In this section, we first present the MetaCURE framework and then discuss its intrinsic reward design for learning an efficient exploration policy for both meta-training and adaptation.

3.1. THE METACURE FRAMEWORK

As shown in Figure 1 , MetaCURE is composed of three main components: (i) a task encoder q φ (z|c) that extracts information from context c and estimates the posterior of the task belief z, (ii) an Explorer to learn an behavior or exploration policy, and (iii) an Exploiter to learn the target or exploitation policy. We utilize variational inference methods (Kingma & Welling, 2013; Alemi et al., 2016; Rakelly et al., 2019) to train the task encoder q φ . In order to learn effective task encodings, its decoder is designed to recover the action value function of the exploiter, which captures rich and temporally-extended information about the current task. The algorithm utilizes two separate replay buffers, B and B enc . Buffer B is used to train both the Explorer and the Exploiter, while buffer B enc is used to train the task encoder. During meta-training, MetaCURE iteratively infers the posterior task belief with contexts and performs both exploration and exploitation policies to collect data. B stores experiences from both policies, while B enc only stores exploration trajectories. During the adaptation phase, only the explorer is used to collect informative trajectories for inferring the posterior task belief, and the exploiter utilizes this posterior belief to maximize final performance.



for n = -1, we define c T -1 = ( 0, 0, s0). In following derivations, we may drop T for brevity.

