INTRINSICALLY GUIDED EXPLORATION IN META RE-INFORCEMENT LEARNING

Abstract

Deep reinforcement learning algorithms generally require large amounts of data to solve a single task. Meta reinforcement learning (meta-RL) agents learn to adapt to novel unseen tasks with high sample efficiency by extracting useful prior knowledge from previous tasks. Despite recent progress, efficient exploration in meta-training and adaptation remains a key challenge in sparse-reward meta-RL tasks. We propose a novel off-policy meta-RL algorithm to address this problem, which disentangles exploration and exploitation policies and learns intrinsically motivated exploration behaviors. We design novel intrinsic rewards derived from information gain to reduce task uncertainty and encourage the explorer to collect informative trajectories about the current task. Experimental evaluation shows that our algorithm achieves state-of-the-art performance on various sparse-reward MuJoCo locomotion tasks and more complex Meta-World tasks.

1. INTRODUCTION

Human intelligence is able to transfer knowledge across tasks and acquire new skills within limited experiences. However, most reinforcement learning (RL) agents still require large amounts of data to achieve human-level performance (Silver et al., 2017; Hessel et al., 2018; Vinyals et al., 2019) . Meta reinforcement learning (meta-RL) makes a step toward such efficient learning by extracting prior knowledge from a set of tasks to quickly adapt to new tasks. However, efficient exploration of meta-RL needs to consider both training and adaptation phases simultaneously (Ishii et al., 2002) , which becomes a key challenge for meta-RL. One branch of meta-RL algorithms (Finn et al., 2017; Stadie et al., 2018; Rothfuss et al., 2019; Gurumurthy et al., 2019) utilizes policies injected with time-irrelevant random noise for meta-training exploration, while another branch of methods (Duan et al., 2016; Mishra et al., 2017; Gupta et al., 2018; Zintgraf et al., 2019; Rakelly et al., 2019) introduces memories or latent variables that enable temporally-extended exploration behaviors. EPI (Zhou et al., 2018) introduces intrinsic rewards based on the improvement of dynamics prediction. However, these exploration mechanisms are still inefficient in either meta-training or adaptation and underperform in complex sparse-reward tasks. To address this challenge of meta-RL, we introduce information-theoretic intrinsic motivations for learning to collect informative trajectories and enable efficient exploration in both meta-training and adaptation. Inspired by the common task-inference component in context-based meta-RL algorithms (such as PEARL (Rakelly et al., 2019) and VariBAD (Zintgraf et al., 2019 )), we leverage an insight that exploration behaviors should collect trajectories that contain rich information gain about the current task, and design an exploration objective to maximize the information gain for inferring taskss. Based this objective, we derive an intrinsic reward for learning an effective exploration policy. To reduce variance from estimating information-gain intrinsic rewards in complex domains, we simplify and derive an intrinsic reward based on prediction errors to achieve superior stability and scalability. We develop a novel off-policy meta-RL algorithm, Meta-RL with effiCient Uncertainty Reduction Exploration (MetaCURE), that incorporates our intrinsic rewards and separates exploration and exploitation policies. MetaCURE learns to perform sequential exploration behaviors to reduce task uncertainty across adaptation episodes and maximizes the expected extrinsic return in the last episode of the adaptation phase. During meta-training, MetaCURE collects training data from both exploration and exploitation policies. As for adaptation, the exploration policy collects informative

