REGIONED EPISODIC REINFORCEMENT LEARNING

Abstract

Goal-oriented reinforcement learning algorithms are often good at exploration, not exploitation, while episodic algorithms excel at exploitation, not exploration. As a result, neither of these approaches alone can lead to a sample-efficient algorithm in complex environments with high dimensional state space and delayed rewards. Motivated by these observations and shortcomings, in this paper, we introduce Regioned Episodic Reinforcement Learning (RERL) that combines the episodic and goal-oriented learning strengths and leads to a more sample efficient and effective algorithm. RERL achieves this by decomposing the space into several sub-space regions and constructing regions that lead to more effective exploration and high values trajectories. Extensive experiments on various benchmark tasks show that RERL outperforms existing methods in terms of sample efficiency and final rewards.

1. INTRODUCTION

Despite its notable success, the application of reinforcement learning (RL) still suffers from sample efficiency in real-world applications. To achieve human-level performance, episodic RL (Pritzel et al., 2017; Lee et al., 2019) is proposed to construct episodic memory, enabling the agent to assimilate new experiences and act upon them rapidly. While episodic algorithms work well for tasks where it is easy to collect valuable trajectories and easy to design dense reward functions, both of these requirements become roadblocks when applying to complex environments with sparse reward. Goal-oriented RL (Andrychowicz et al., 2017; Paul et al., 2019) decomposes the task into several goal-conditioned tasks, where the intrinsic reward is defined as the success probability of reaching each goal by the current policy and the ability to guide the agent to finally reach the target state. These methods intend to explore more unique trajectories and use all trajectories in the training procedure, which may involve unrelated ones and result in inefficient exploitation. In this paper, we propose a novel framework that can combine the strengths of episodic and goal-oriented algorithms and thus can efficiently explore and rapidly exploit high-value trajectories. The inefficient learning of deep RL has several plausible explanations. In this work, we focus on addressing these challenges: (C1) Environments with a sparse reward signal can be difficult to learn, as there may be very few instances where the reward is non-zero. Goal-oriented RL can mitigate this issue by building intrinsic reward signals (Ren et al., 2019) , but suffer from the difficulty of generating appropriate goals from high-dimensional space. (C2) Training goal-oriented RL models using all historical trajectories rather than selected ones would involve unrelated trajectories in training. The training process of goal generation algorithms could be unstable and inefficient (Kumar et al., 2019) , as data distribution shifts when the goal changes. It can be fairly efficient if updates happen only with highly related trajectories. (C3) Redundant exploration is another issue that limits the performance as it is inefficient for the agent to explore the same areas repeatedly (Ostrovski et al., 2017) . Instead, it would be much more sensible for agents to learn to divide the task into several sub-tasks to avoid redundant exploration. In this paper, we propose Regioned Episodic Reinforcement Learning (RERL), which tackles the limitations of deep RL listed above and demonstrates dramatic improvements in a wide range of environments. Our work is, in part, inspired by studies on psychology and cognitive neuroscience (Lengyel & Dayan, 2008; Manns et al., 2003) , which discovers that when we observe an event, we scan through our corresponding memory storing this kind of events and seek experiences related to 1

