EXPLORE WITH DYNAMIC MAP: GRAPH STRUCTURED REINFORCEMENT LEARNING

Abstract

In reinforcement learning, a map with states and transitions built using historical trajectories is often helpful in exploration and exploitation. Even so, learning and planning on such a map within a sparse environment remains a challenge. As a step towards this goal, we propose Graph Structured Reinforcement Learning (GSRL), which leverages graph structure in historical trajectories to slowly adjust exploration directions and rapidly update value function estimation with related experiences. GSRL constructs a dynamic graph on top of state transitions in the replay buffer based on historical trajectories and develops an attention strategy on the map to select an appropriate goal direction, which decomposes the task of reaching a distant goal state into a sequence of easier tasks. We also leverage graph structure to sample related trajectories for efficient value learning. Results demonstrate that GSRL can outperform the state-of-the-art algorithms in terms of sample efficiency on benchmarks with sparse reward functions.

1. INTRODUCTION

How can humans learn to solve tasks with complex structure and delayed and sparse feedback? Take a complicated navigation task as an example in which the goal is to go from a start state to an end state. A straightforward approach that one often takes is by decomposing this complicated task into a sequence of easier ones, identifying/forming some intermediate goals that help to get to the final goal, and finally choosing a route with the highest return among a poll of candidate routes that lead to the final goal. Despite recent success and progress in reinforcement learning (RL) approaches (Mnih et al., 2013; Fakoor et al., 2020; Hansen et al., 2018; Huang et al., 2019; Silver et al., 2016) , there is still a huge gap between how humans and RL agents learn. In many real-world applications, an RL agent only has access to sparse and delayed rewards, which by itself leads to two major challenges. (C1) The first one is how can an agent effectively learn from sparse and delayed rewards? One possible solution is to build a new reward function, known as an intrinsic reward, that helps expedite the learning process and ultimately solve a given task. Although this may seem like an appealing solution, it is often not obvious how to formulate an effective intrinsic reward. Recent works have introduced goal-oriented RL (Schaul et al., 2015; Chiang et al., 2019) as a way of constructing intrinsic rewards. However, most goal-oriented RL algorithms require large amounts of reward shaping (Chiang et al., 2019) or human demonstrations (Nair et al., 2018) . (C2) The second issue is how to retrieve relevant past experiences that help to learn faster and improve sample efficiency. Recent attention has focused much on episodic RL (Blundell et al., 2016) , which builds a non-parametric episodic memory to store past experiences, and thus can rapidly latch onto related ones through search with similarity. However, most episodic RL algorithms require additional memory space (Pritzel et al., 2017) . Inspired by this, we propose a method to address the first problem (C1) by learning to generate an appropriate goal g e based on map G e 0 at episode e and form the intrinsic rewards during goal generation. This goal generation can be optimized in hindsight. That is, we can update the goal generation model under the supervision generated by planning algorithms based on map G e+1 0 at episode e + 1. We can solve the second problem (C2) by updating the value estimation of the current state with the trajectories that include neighborhood states. Note that GSRL draws a map to help agents plan an appropriate direction in the long horizon and learn an optimal strategy in the short horizon, which can be widely incorporated with various learning strategies such as DQN (Mnih et al., 2013 ), DDPG (Lillicrap et al., 2015) , etc. However, arriving at a well-explored map in terms of state-transitions in the first place from scratch remains a challenging problem. In this regard, GSRL should be able to jointly optimize map construction, goal selection, and value estimation. As illustrated in Figure 2 , we design GSRL mainly following four steps: (a) At the beginning, an agent is located at the initial state and aims at the terminal state with an empty map; (b) The agent then explores the environment with the current policy and records those explored states on the map; (c) Considering the far distance to the terminal state and sparse reward during the exploration, we decompose the whole task into a sequence of easier tasks guided by goals; (d) We leverage structured map information to select related trajectories to update the current policy; The agent can repeat procedures (b)-(d) and finally reach the terminal state with a well-explored map. Our primary contribution is a novel algorithm that constructs a dynamic state-transition graph from scratch and leverages structured information to find an appropriate goal in goal selection with a long horizon and sample related trajectories in value estimation with a short horizon. This framework provides several benefits: (1) How to balance exploration and exploitation is a fundamental problem in any reinforcement learning (Andrychowicz et al., 2017; Nachum et al., 2018) . GSRL learns an attention mechanism on the graph to decide whether to explore the states with high uncertainty or exploit the states with well estimated value. (2) Almost all the existing reinforcement learning approaches (Paul et al., 2019; Eysenbach et al., 2019) utilize randomly sampled trajectories to update the current policy leading to inefficiency. GSRL selects highly related trajectories for updates through the local structure in the graph. (3) Unlike previous approaches (Kipf et al., 2020; Shang et al., 2019) , GSRL constructs a dynamic graph through agent exploration without any additional object representation techniques and thus is not limited to several specific environments. Empirically, we find that our method constructs a useful map and learns an efficient policy for highly structured tasks. Comparisons with state-of-the-art RL methods show that GSRL is substantially more successful for boosting the performance of a graph-structured RL algorithm.

2. PRELIMINARIES

Let M = (S, A, T, P, R) be an Markov Decision Process (MDP) where S is the state space, A is the action space whose size is bounded by a constant, T ∈ Z + is the episode length, P : S ×A → ∆(S) is the transition function which takes a state-action pair and returns a distribution over states, and R : S × A → ∆(R) is the reward distribution. The rewards from the environment, called extrinsic,



Figure 1: An illustration for motivation of GSRL. In this work, we propose Graph Structured Reinforcement Learning (GSRL), which constructs a state-transition graph, leverages structural graph information, and presents solutions for the problems raised above. As shown in Figure 1, when we encounter a complex task, we draw a map for planning with a long horizon (Eysenbach et al., 2019) and scan through the graph to seek related experiences within a short horizon (Lee et al., 2019).

Figure 2: GSRL allows joint optimization of map construction and agent exploration from scratch. (a) An agent is located at the initial state and aims at the terminal state with an empty map at the beginning. (b) The agent explores the environment and records the explored states on the map. (c) The agent generates a goal from previous states to decompose the whole task into a sequence of easier tasks. (d) The agent updates current policy with related trajectories selected by structured map information.



