REGIONED EPISODIC REINFORCEMENT LEARNING

Abstract

Goal-oriented reinforcement learning algorithms are often good at exploration, not exploitation, while episodic algorithms excel at exploitation, not exploration. As a result, neither of these approaches alone can lead to a sample-efficient algorithm in complex environments with high dimensional state space and delayed rewards. Motivated by these observations and shortcomings, in this paper, we introduce Regioned Episodic Reinforcement Learning (RERL) that combines the episodic and goal-oriented learning strengths and leads to a more sample efficient and effective algorithm. RERL achieves this by decomposing the space into several sub-space regions and constructing regions that lead to more effective exploration and high values trajectories. Extensive experiments on various benchmark tasks show that RERL outperforms existing methods in terms of sample efficiency and final rewards. * The common idea our method shares with neuroscience is utilizing highly related information to promote learning efficiency. The difference is that memories are regioned according to the generated goals in this paper, and fictions in cognitive neuroscience.

1. INTRODUCTION

Despite its notable success, the application of reinforcement learning (RL) still suffers from sample efficiency in real-world applications. To achieve human-level performance, episodic RL (Pritzel et al., 2017; Lee et al., 2019) is proposed to construct episodic memory, enabling the agent to assimilate new experiences and act upon them rapidly. While episodic algorithms work well for tasks where it is easy to collect valuable trajectories and easy to design dense reward functions, both of these requirements become roadblocks when applying to complex environments with sparse reward. Goal-oriented RL (Andrychowicz et al., 2017; Paul et al., 2019) decomposes the task into several goal-conditioned tasks, where the intrinsic reward is defined as the success probability of reaching each goal by the current policy and the ability to guide the agent to finally reach the target state. These methods intend to explore more unique trajectories and use all trajectories in the training procedure, which may involve unrelated ones and result in inefficient exploitation. In this paper, we propose a novel framework that can combine the strengths of episodic and goal-oriented algorithms and thus can efficiently explore and rapidly exploit high-value trajectories. The inefficient learning of deep RL has several plausible explanations. In this work, we focus on addressing these challenges: (C1) Environments with a sparse reward signal can be difficult to learn, as there may be very few instances where the reward is non-zero. Goal-oriented RL can mitigate this issue by building intrinsic reward signals (Ren et al., 2019) , but suffer from the difficulty of generating appropriate goals from high-dimensional space. (C2) Training goal-oriented RL models using all historical trajectories rather than selected ones would involve unrelated trajectories in training. The training process of goal generation algorithms could be unstable and inefficient (Kumar et al., 2019) , as data distribution shifts when the goal changes. It can be fairly efficient if updates happen only with highly related trajectories. (C3) Redundant exploration is another issue that limits the performance as it is inefficient for the agent to explore the same areas repeatedly (Ostrovski et al., 2017) . Instead, it would be much more sensible for agents to learn to divide the task into several sub-tasks to avoid redundant exploration. In this paper, we propose Regioned Episodic Reinforcement Learning (RERL), which tackles the limitations of deep RL listed above and demonstrates dramatic improvements in a wide range of environments. Our work is, in part, inspired by studies on psychology and cognitive neuroscience (Lengyel & Dayan, 2008; Manns et al., 2003) , which discovers that when we observe an event, we scan through our corresponding memory storing this kind of events and seek experiences related to this one. Our agent regionalizes the historical trajectories into several region-based memories * . At each timestep, the region controller will evaluate each region and select one for further exploration and exploitation. Each memory binds a specific goal and a series of goal-oriented trajectories and uses a value-based look-up to retrieve highly related and high-quality trajectories when updating the value function. We adopt hindsight (i.e., the goal state is always generated from visited states in the memory) and diversity (i.e., goal state should be distant from previous goal states in other memories) constraints in goal generation for goal reachability and agent exploration. This architecture conveys several benefits: (1) We can automatically construct region-based memory by goal-oriented exploration, where trajectories guided by the same goal share one memory (see Section 3.1). ( 2) Within each memory, we alleviate the high-dimensional issue (C1) by enforcing that the goal space is a set of visited states (see Section 3.2). (3) In order to improve efficiency in exploitation (C2), our architecture stabilizes training using trajectories within the memory instead of randomly selected transitions (see Section 3.3 for details). ( 4) Our algorithm takes previous goals in other memories when generating a goal in current memory. Specifically, we propose the diversity constraint to encourage the agent to explore unknown states (see Section 3.2), which aims at improving exploration efficiency (C3). The contributions of this paper are as follows: (1) We introduce RERL, a novel framework that combines the strengths of episodic RL and goal-oriented RL for efficient exploration and exploitation. (2) We propose hindsight and diversity constraints in goal generation, which allows the agents to construct and update the regioned memories automatically. (3) We evaluate RERL in challenging robotic environments and show that our method can naturally handle sparse reward environments without any additional prior knowledge and manually modified reward function. RERL can be closely incorporated with various policy networks such as deep deterministic policy gradient (DDPG (Lillicrap et al., 2015) ) and proximal policy optimization (PPO (Schulman et al., 2017) ). Further, ablation studies demonstrate that our exploration strategy is robust across a wide set of hyper-parameters.

2. PRELIMINARIES

In RL (Sutton & Barto, 2018) , the goal of an agent is to maximize its expected cumulative reward by interacting with a given environment. The RL problem can be formulated as a Markov Decision Process (MDP) by a tuple (S, A, P, r, γ), where S is the state space, A is the action space, P : S × A → ∆(S) is the state transition probability distribution, r : S × A → [0, 1] is the reward function, and γ ∈ [0, 1) is the discount factor for future rewards. Our objective is to find a stochastic policy π : S × A → [0, 1) that maximizes the expected cumulative reward R t = T k=0 γ k r t+k within the MDP, where T is the episode length. In the finite-horizon setting, the state-action value function Q π (s, a) = E[R t |s t = s, a] is the expected return for executing action a on state s and following π afterward. The value function can be defined as V π (s) := E T k=0 γ k r t+k (s t , a t ) | s t = s, π , ∀s ∈ S, where T is the episode length and the goal of the agent is to maximize the expected return of each state s t . Deep Q Network (DQN, (Mnih et al., 2015) ) utilizes an off-policy learning strategy, which samples (s t , a t , r t , s t+1 ) tuples from a replay buffer for training. It is a typical parametric RL method and suffers from sample inefficiency due to slow gradient-based updates. The key idea of episodic RL is to store good past experiences in a tabular-based non-parametric memory and rapidly latch onto past successful policies when encountering similar states, instead of waiting for many optimization steps. However, in environments with sparse rewards, there may be very few instances where the reward is non-zero, making it difficult for an agent to find good past experiences. In order to address this issue, goal-oriented RL is proposed. In the goal-conditioned setting that we use here, the policy and the reward are also conditioned on a goal g ∈ G (Schaul et al., 2015) . The distance function d (used to define goal completion and generate sparse reward upon the completion of goal) may be exposed as a shaped intrinsic reward without any additional domain knowledge: r(s t , a t |g) = 1, if d(φ(•|s t+1 ), g) ≤ δ, and r(s t , a t |g) = -d(φ(•|s t+1 ), g) otherwise, where φ :

