EXPLORE WITH DYNAMIC MAP: GRAPH STRUCTURED REINFORCEMENT LEARNING

Abstract

In reinforcement learning, a map with states and transitions built using historical trajectories is often helpful in exploration and exploitation. Even so, learning and planning on such a map within a sparse environment remains a challenge. As a step towards this goal, we propose Graph Structured Reinforcement Learning (GSRL), which leverages graph structure in historical trajectories to slowly adjust exploration directions and rapidly update value function estimation with related experiences. GSRL constructs a dynamic graph on top of state transitions in the replay buffer based on historical trajectories and develops an attention strategy on the map to select an appropriate goal direction, which decomposes the task of reaching a distant goal state into a sequence of easier tasks. We also leverage graph structure to sample related trajectories for efficient value learning. Results demonstrate that GSRL can outperform the state-of-the-art algorithms in terms of sample efficiency on benchmarks with sparse reward functions. * In this paper, we use 'degree' to represent 'out-degree', unless otherwise stated. † We define the full explored graph Gfull as the graph covering the optimal solution, and the whole graph Gwhole as the graph covering all the state-transitions. Hence, we have that Gfull is a sub-graph of Gwhole (i.e., Gfull ⊆ Gwhole).

1. INTRODUCTION

How can humans learn to solve tasks with complex structure and delayed and sparse feedback? Take a complicated navigation task as an example in which the goal is to go from a start state to an end state. A straightforward approach that one often takes is by decomposing this complicated task into a sequence of easier ones, identifying/forming some intermediate goals that help to get to the final goal, and finally choosing a route with the highest return among a poll of candidate routes that lead to the final goal. Despite recent success and progress in reinforcement learning (RL) approaches (Mnih et al., 2013; Fakoor et al., 2020; Hansen et al., 2018; Huang et al., 2019; Silver et al., 2016) , there is still a huge gap between how humans and RL agents learn. In many real-world applications, an RL agent only has access to sparse and delayed rewards, which by itself leads to two major challenges. (C1) The first one is how can an agent effectively learn from sparse and delayed rewards? One possible solution is to build a new reward function, known as an intrinsic reward, that helps expedite the learning process and ultimately solve a given task. Although this may seem like an appealing solution, it is often not obvious how to formulate an effective intrinsic reward. Recent works have introduced goal-oriented RL (Schaul et al., 2015; Chiang et al., 2019) as a way of constructing intrinsic rewards. However, most goal-oriented RL algorithms require large amounts of reward shaping (Chiang et al., 2019) or human demonstrations (Nair et al., 2018) . (C2) The second issue is how to retrieve relevant past experiences that help to learn faster and improve sample efficiency. Recent attention has focused much on episodic RL (Blundell et al., 2016) , which builds a non-parametric episodic memory to store past experiences, and thus can rapidly latch onto related ones through search with similarity. However, most episodic RL algorithms require additional memory space (Pritzel et al., 2017) . In this work, we propose Graph Structured Reinforcement Learning (GSRL), which constructs a state-transition graph, leverages structural graph information, and presents solutions for the problems raised above. As shown in Figure 1, when we encounter a complex task, we draw a map for planning with a long horizon (Eysenbach et al., 2019) and scan through the graph to seek related experiences within a short horizon (Lee et al., 2019) . Under review as a conference paper at ICLR 2021 Inspired by this, we propose a method to address the first problem (C1) by learning to generate an appropriate goal g e based on map G e 0 at episode e and form the intrinsic rewards during goal generation. This goal generation can be optimized in hindsight. That is, we can update the goal generation model under the supervision generated by planning algorithms based on map G e+1 0 at episode e + 1. We can solve the second problem (C2) by updating the value estimation of the current state with the trajectories that include neighborhood states. Note that GSRL draws a map to help agents plan an appropriate direction in the long horizon and learn an optimal strategy in the short horizon, which can be widely incorporated with various learning strategies such as DQN (Mnih et al., 2013) , DDPG (Lillicrap et al., 2015) , etc. However, arriving at a well-explored map in terms of state-transitions in the first place from scratch remains a challenging problem. In this regard, GSRL should be able to jointly optimize map construction, goal selection, and value estimation. As illustrated in Figure 2 , we design GSRL mainly following four steps: (a) At the beginning, an agent is located at the initial state and aims at the terminal state with an empty map; (b) The agent then explores the environment with the current policy and records those explored states on the map; (c) Considering the far distance to the terminal state and sparse reward during the exploration, we decompose the whole task into a sequence of easier tasks guided by goals; (d) We leverage structured map information to select related trajectories to update the current policy; The agent can repeat procedures (b)-(d) and finally reach the terminal state with a well-explored map. Our primary contribution is a novel algorithm that constructs a dynamic state-transition graph from scratch and leverages structured information to find an appropriate goal in goal selection with a long horizon and sample related trajectories in value estimation with a short horizon. This framework provides several benefits: (1) How to balance exploration and exploitation is a fundamental problem in any reinforcement learning (Andrychowicz et al., 2017; Nachum et al., 2018) . GSRL learns an attention mechanism on the graph to decide whether to explore the states with high uncertainty or exploit the states with well estimated value. (2) Almost all the existing reinforcement learning approaches (Paul et al., 2019; Eysenbach et al., 2019) utilize randomly sampled trajectories to update the current policy leading to inefficiency. GSRL selects highly related trajectories for updates through the local structure in the graph. (3) Unlike previous approaches (Kipf et al., 2020; Shang et al., 2019) , GSRL constructs a dynamic graph through agent exploration without any additional object representation techniques and thus is not limited to several specific environments. Empirically, we find that our method constructs a useful map and learns an efficient policy for highly structured tasks. Comparisons with state-of-the-art RL methods show that GSRL is substantially more successful for boosting the performance of a graph-structured RL algorithm.

2. PRELIMINARIES

Let M = (S, A, T, P, R) be an Markov Decision Process (MDP) where S is the state space, A is the action space whose size is bounded by a constant, T ∈ Z + is the episode length, P : S ×A → ∆(S) is the transition function which takes a state-action pair and returns a distribution over states, and R : S × A → ∆(R) is the reward distribution. The rewards from the environment, called extrinsic, are usually sparse and, therefore, difficult to learn. To address this, goal-oriented RL introduces goal g ∈ G to build dense intrinsic rewards R g : S × A × G → ∆(R). In order to make the learning procedure stable, the goal is limited to be updated episodically. A policy π : S × G → ∆(A) prescribes a distribution over actions for each state and goal. At each timestep, the agent samples an action a ∼ π(s, g) and receives a corresponding reward r g (s, a) that indicates whether or not the agent has reached the goal. The episode terminates only after T timesteps even when the agent reaches the goal. The agent's task is to maximize its cumulative discounted future reward. We use an off-policy algorithm to learn such a policy, as well as its associated goal-oriented Q-function and value function: Q π (s, a, g) = E T -1 t=0 γ t • r g (a t , s t ) | s t = s, a t = a, π , V * (s, g) = max a Q * (s, a, g). (1) We use off-policy algorithms named DQN (Mnih et al., 2013) for discrete action space and DDPG (Lillicrap et al., 2015) for continuous action space to learn Q-functions and policies by utilizing off-policy data (i.e., data stored in the replay buffer).

3. GRAPH STRUCTURED REINFORCEMENT LEARNING

As illustrated in Figure 1 , GSRL constructs a dynamic graph on top of state transitions. However, learning and planning algorithms on the state transition graph are non-trivial. In this section, we provide a theoretical analysis to show that exploration without any constraint would lead to the explosion of the graph. In order to guide the exploration with directions, we develop a novel goaloriented RL framework which incorporates the structure information of the state-transition graph. Specifically, we first divide states into several groups according to their uncertainty and locations in the graph. We then adopt an attention mechanism to select an appropriate group and assign the state with the highest value in the graph as a goal to encourage further exploration. We also propose to update goal generation hindsightly and value estimation with related trajectories, to help RL agents learn efficiently.

3.1. EXPLORE WITH DYNAMIC GRAPH

We directly build a dynamic and directed graph using sample trajectories from the replay buffer. Let an agent start from an initial state s 0 . Then G e t grows from a single node (i.e., G 0 0 = {s 0 }) and expands itself at each timestep, leading to the sequence {{G 1 t } T -1 t=0 , {G 2 t } T -1 t=0 , . . . , {G E t } T -1 t=0 } where T denotes episode length and E is the number of episodes. Given that state-transition graph G e t is always a connected directed graph, we describe the expansion behavior as consecutive expansion, which means no jumping across neighborhood allowed. We denote the graph increment at timestep t in episode e as ∆G e t , and then we can ensure that G e t ∪ ∆G e t = G e t+1 ⊆ G e t ∪ ∂G e t . Proposition 1. (adapted from (Xu et al., 2020) ) We assume that the probability of degree of an arbitrary state in the whole state-transition graph G whole being less than or equal to d is larger than p (i.e., P (deg(s) ≤ d) > p, ∀s ∈ S G ). We then consider a sequence of consecutively expanding sub-graphs {{G 1 t } T -1 t=0 , {G 2 t } T -1 t=0 , . . . , {G E t } T -1 t=0 }, starting with G 0 1 = {s 0 }, for all t ≥ 0, e ≥ 1. We can ensure that P |S G e t | ≤ > p , where = d•(d-1) T •e+t -2 d-2 when d > 2 and = 1, 3 when d = 1, 2 respectively. Proof. The basic proof structure follows from (Xu et al., 2020) ; however, because our task and detailed setting are considerably different, several modifications are required as described in in Appendix D.1. The proposition implies that even if the given assumption of the whole state-transition graph G whole has a small d and a large p (i.e., sparse graph), the guarantee of upper-bounding |S G e t | becomes looser and weaker as t, e get larger. In other words, randomly appending G e t through exploration would enhance computation complexity for learning and plan on the dynamic graph. In order to Under review as a conference paper at ICLR 2021 prevent G e t from an explosion, we need to constrain ∆G e t by guiding exploration with goals. In this paper, we propose to leverage the graph structure to assist agent exploration. Before the detailed introduction of exploration strategy, we firstly introduce the definition of certainty as Definition 1. Given a state s, we define its certainty as to the number of candidate actions that have been already taken. In this paper, we build the state-transition graph G on the replay buffer and thus can adopt the out degree * of each state to approximately measure the certainty. In other words, we can approximate the certainty of state cert(s) ≈ deg(s), where deg(s) denotes the degree of node s. Note that certainty of state in Definition 1 can be served as a local measurement to show the extent of exploration on some states, which is actually proportioned to the global measurement (i.e., the number of visited states) in a deterministic environment (See Appendix E.1 for details). In addition, we also can define uncertainty of state as the number of untaken candidate actions. Exploration Strategy. A simple but effective way is to do exploration based on goal-oriented RL, which decomposes the whole task into a sequential of goal-oriented tasks. In this case, newly explored states in each episode are guided by a specific goal. Therefore, we constrain the statetransition graph from the explosion and balance the exploration and exploitation through assigning an appropriate goal to the agent. In order to further discuss the definition of an appropriate goal, we here provide the definition of the optimal goal. As shown in Figure 4 , once we are accessible to the whole state-transition graph G whole , we can obtain the optimal solution by short path planning algorithm such as Dijkstra's Algorithm. We define the optimal goal as the terminal state on the path. Definition 2. For fully explored state-transition graph G full † , we define the optimal goal g * full as the terminal state (e.g., s 10 in Figure 4 (a)). For any not fully explored state-transition graph G e 0 , we define the optimal goal g * e hindsightly, where we generate the optimal solution path P e+1 based on the shortest path planning on the next episode graph G e+1 0 and regard the reachable terminal state included both in G e 0 and P e+1 as g * e (e.g., s 6 in Figure 4 (c)). One should be noted that the optimal goal is hindsight generated and keep approaching the terminal state during exploration. We further analyze and discuss the relations between goals in this paper and previous goal-oriented RL literature in Appendix E.2. Proposition 2. Assume that Q-value of each state is well estimated (i.e., Q = Q * ), then optimal goal g * e at the beginning timestep of any episode e is always included in the boundary of the statetransition graph G e 0 (i.e., ∂G e 0 ). Proof. The proof of Proposition 2 can be found in Appendix D.2. The proposition implies that the candidate set for each goal generation can be limited to the boundary. However, learning to generate an appropriate goal is non-trivial with respect to a dynamic and large-scale state-transition graph G e t . One intuitive solution is to divide the whole candidate goal space (i.e., ∂G e t ) into several candidate groups C 1 , . . . , C N , where N is the number of groups. In order to both cover all the states and control exploration size in these groups, our group segment should follow the principle that ∪ N n=1 C n = ∂G e t and C m ∩ C n = ∅ for any m = n; m, n = 1, 2, . . . , N . Here, we set the last visited state s last in the previous episodes as C 0 , since s last often is both close to the terminal state and has high uncertainty. As shown in Figure 3 , based on the last explored state, we build group C n , n = 1, . . . , N by extending scales in two different perspectives. We list two perspectives, as follows: • Extending from Neighbor Nodes. One perspective is to leverage local structure on graph, where C n := {s}, s ∈ N n (s last )∩∂G e t for n = 1, 2, . . . , N , and C N := ∂G e t -∪ N -1 n=1 C n . N n (s last ) means n-hop neighbors of s last . The intuitive motivation behind this is to keep the learning procedure stable by slowly adjusting the goal to explore the surrounding environment. In order to keep all the groups not empty, we append s last to them at the beginning timestep of each episode. However, RL algorithms face an exploration-exploitation dilemma, which requires the agent to makes appropriate selection rather than random exploration over these groups. In order to the trade-off between exploration and exploitation, we apply an attention mechanism to select an appropriate group on the current situation. For each group, we learn an embedding vector to represent its feature (i.e., certainty for extending from uncertain nodes). In this paper, the attention module runs over N groups at each episode and select the appropriate one, which needs to take the features of other groups into consideration. Therefore, it's natural to adopt self-attention mechanism (Vaswani et al., 2017) here. For simplicity, we structure an embedding vector for each group and denote all the features of groups as [f 1 , . . . , f N ], and then define Q = K = V := (f 1 , . . . , f N ) T ∈ R N ×d . The self-attention can be defined as AT T φ (C 1 , . . . , C N ) = softmax( QK T √ N )V, where AT T φ denotes self-attention function parameterized by φ. The output of the self-attention is then fed to a multi-layer perception (MLP) with ReLU activation function. The output of MLP is in the dimension R N ×1 . And the selected group can be obtained according to C AT T = arg max Cn σ(MLP(AT T φ (C 1 , . . . , C N ))), where σ(•) denotes a sigmoid function, and C AT T means the group selected according to the attention score. At the beginning timestep of any episode e, we have access to state-transition graph G e 0 and its boundary ∂G e 0 . We then divide ∂G e 0 into groups C 1 , . . . , C N , and obtain C AT T through the aforementioned attention mechanism. We select one state with the highest value in C AT T as the goal, which can be formulated as g e = arg max s V (s, g e-1 ) ‡ , ∀s ∈ C AT T . In this way, we can generate an appropriate goal g e to guide the agent exploration in the episode e.

3.2. LEARN WITH GRAPH STRUCTURED REINFORCEMENT LEARNING

As illustrated in Figure 1 , we update goal generation under supervised learning and value estimation under reinforcement learning. Specifically, we introduce learning strategies as follows: Goal Learning Strategy. As mentioned above, we learn the goal generation hindsight. As illustrated in Figure 4 , at the beginning timestep of each episode e + 1, we first compute the shortest path distance between initial state (s 0 in (c)) and highest value state (s 9 in (c)) to obtain the solution path (P e+1 = s 0 , s 1 , s 3 , s 6 , s 9 in (d)). We then find the optimal goal g * e § at episode e based on the solution path generated at episode e + 1. Specifically, we search for the intersection state between P e+1 and ∂G e t in the inverse order of the solution path. Given that s 9 is unexplored and unreachable in the episode e, we select s 6 as the optimal goal g * e . We then find the optimal group C * that contains goal g * e . With this hindsight supervision on the group selection, we are able to update the attention , we generate the optimal solution path hindsight, where we regard the state with the highest value in the next episode (G e+1 0 in (c)) as the target state, and shortest path to it as the optimal solution path (P e+1 = s 0 , s 1 , s 3 , s 6 , s 9 in (d)). We define the optimal goal g * e as the reachable target state in the optimal solution path in episode e (e.g., s 10 in G full and s 6 in G e 0 ). mechanism in Eq. ( 3) via a standard supervised learning algorithm, where the objective function can be formulated as L φ = E (s,a,g,s ,r)∼D (C * -C AT T ) 2 + α • φ 2 , ( ) where φ 2 is regularizer and α is the corresponding hyper-parameter. This method updating goal generation under the supervision of the group instead of goal can eliminate instability brought from potentially inaccurate value estimation because our group division does not involve the result from value estimation. Value Learning Strategy. With a generated goal at the beginning of each episode, we can build the critical block of our method, i.e., a goal-conditioned policy and its associated value function. We consider a goal-reaching agent interacting with an environment. The agent observes its current state s ∈ S and a goal state g ∈ G. The dynamics are governed by the distribution P (s t+1 |s t , a t ). At every timestep, the agent samples an action a ∼ π(a|s, g) and receives a corresponding reward r g (s, a) that indicates whether the agent has reached the goal. The episode terminates after T timesteps, and even the agent reaches the goal. The agent's task is to maximize its cumulative and undiscounted reward. We use an off-policy algorithm to learn such a policy, as well as its associated goal-conditioned Q-function. For example, we obtain a policy by acting greedily, w.r.t., the Q-function as Q(s, a, g) ← r g (s, a) + γ • max a Q(s, a , g). In order to improve data efficiency and obtain good value estimation, we consider choosing an offpolicy RL algorithm with goal relabelling and update parameters with related trajectories. Specifically, we choose an off-policy RL algorithm with goal relabelling, such as Andrychowicz et al. (2017) . We define related data as those trajectories that contain neighborhood nodes of the current state. Formally, when we update Q-function of state s, we sample D related := {τ } from replay buffer D, where τ denotes the trajectory that contains at least one state in s's neighborhood N 1 (s). The Q-network is learned by minimizing the following objective function: L θ = E (s,a,g,s ,r)∼Drelated (r g + γ • max a Q θ (s , a, g) -Q θ (s, a, g)) 2 + β • θ 2 , ( ) where β is the weight of the regularization term. Besides, as we show in Proposition 3, our GSRL algorithm can converge to a unique optimal point if the Q-learning strategy is adopted. Overall Algorithm. We provide the overall algorithm in Algorithm 1 in Appendix B, and an illustrated example in Figure 8 in Appendix C. One can see that our state-transition graph can be applied to both the learning and planning phase. Previous graph-based RL algorithms either focus on learning (Zhu et al., 2019) or planning (Eysenbach et al., 2019) based on structured information of the state-transition graph.

4. EXPERIMENTS

In this section, we perform an experimental evaluation of our algorithm and compare it with other state-of-the-art methods. Our experiment environments are based on the standard robotic manipulation environments in the OpenAI Gym (Brockman et al., 2016) . We provide the experimental results to answer the following questions: 1. Can GSRL obtain better convergence in various environments? 2. Can GSRL tackle a high-dimensional and continuous environment with obstacles? 3. Can GSRL perform higher sample efficiency? 4. What is the influence of group selection for goal generation and related experience for value estimation in GSRL? Then we check whether the exploration provided by the goals generated by HGG can result in better policy training performance. As shown in Figure 3 , we compare the vanilla HER, HER with Energy-Based Prioritization (HER+EBP), HGG, HGG+EBP. It is worth noting that since EBP is designed for the Bellman equation updates, it is complementary to our HGG-based exploration approach. Among the eight environments, HGG substantially outperforms HER on four and has comparable performance on the other four, which are either too simple or too difficult. When combined with EBP, HGG+EBP achieves the best performance on six environments that are eligible. Performance on tasks with obstacle In a more difficult task, crafted metric may be more suitable than `2-distance used in Eq. ( 5). As shown in Figure 4 , we created an environment based on FetchPush with a rigid obstacle. The object and the goal are uniformly generated in the green and the red segments respectively. The brown block is a static wall which cannot be moved. In addition to `2, we also construct a distance metric based on the graph distance of a mesh grid on the plane, the blue line is a successful trajectory in such hand-craft distance measure. A more detailed description is deferred to Appendix B.3. Intuitively speaking, this crafted distance should be better than `2 due to the existence of the obstacle. Experimental results suggest that such a crafted distance metric provides better guidance for goal generation and training, and significantly improves sample efficiency over `2 distance. It would be a future direction to investigate ways to obtain or learn a good metric. Since our method can be seen as an explicit curriculum learning for exploration, where we generate hindsight goals as intermediate task distribution, we also compare our method with another recently proposed curriculum learning method for RL. Florensa et al. (2018) leverages Least-Squares GAN (Mao et al., 2018b) to mimic the set called Goals of Intermediate Difficult as exploration goal generator.

5.2. Comparison with Explicit Curriculum Learning

Specifically, in our task settings, we define a goal set GOID(⇡) = {g : ↵  f (⇡, g)  1 ↵}, where f (⇡, g) represents the average success rate in a small region closed by goal g. To sample from GOID, we implement an oracle goal generator based on rejection sampling, which could uniformly sample goals from GOID(⇡). Result in Figure 5 indicates that our Hindsight Goal Generation substantially outperforms HER even with GOID from the oracle generator. Note that this experiment is run on a environment with fixed initial state due to the limitation of Florensa et al. (2018) . The choice of ↵ is also suggested by Florensa et al. (2018) .

5.3. Ablation Studies on Hyperparameter Selection

In this section, we set up a set of ablation tests on several hyper-parameters used in the Hindsight Goal Generation algorithm. Lipschitz L: The selection of Lipschitz constant is task dependent, since it iss related with scale of value function and goal distance. For the robotics tasks tested in this paper, we find that it is easier to set L by first divided it with the upper bound of the distance between any two final goals in a environment. We test a few choices of L on several environments and find that it is very easy to find a range of L that works well and shows robustness for all the environments tested in this section. We show the learning curves on FetchPush with different L. It appears that the performance of HGG is reasonable as long as L is not too small. For all tasks we tested in the comparisons, we set L = 5.0. Distance weight c: Parameter c defines the trade-off between the initial state similarity and the goal similarity. Larger c encourages our algorithm to choose hindsight goals that has closer initial state. D(T (1) , T (2) ) = inf µ2 (T (1) ,T (2) ) E µ [d((s 0 (1) , g (1) ), (s 0 (2) , g (2) ))] where (T (1) , T (2) ) denotes the collection of all joint distribution µ(s 0 (1) , g (1) , s 0 (2) marginal probabilities are T (1) , T (2) , respectively. Proof. By Eq. ( 4), for any quadruple (s, g, s 0 , g 0 ), we have V ⇡ (s 0 , g 0 ) V ⇡ (s, g) L • d((s, g), (s 0 , g 0 )). For any µ 2 (T , T 0 ), we sample (s, g, s 0 , g 0 ) ⇠ µ and take the expectation on both sid and get V ⇡ (T 0 ) V ⇡ (T ) L • E µ [d((s, g), (s 0 , g 0 ))]. Since Eq. ( 11) holds for any µ 2 (T , T 0 ), we have • FetchPush-v1: Let the origin (0, 0, 0) denote the projection of gripper's i nate on the table. The object is uniformly generated on the segment ( 0.15 (0.15, 0.15, 0), and the goal is uniformly generated on the segment ( 0.1 (0.15, 0.15, 0). • FetchPickAndPlace-v1: Let the origin (0, 0, 0) denote the projection of gripper dinate on the table. The object is uniformly generated on the segment ( 0.15 (0.15, 0.15, 0), and the goal is uniformly generated on the segment ( 0.15, (0.15, 0.15, 0.45). • FetchSlide-v1: Let the origin (0, 0, 0) denote the projection of gripper's i nate on the table. The object is uniformly generated on the segment ( 0.0 ( 0.05, 0.1, 0), and the goal is uniformly generated on the segment (0.55 (0.55, 0.15, 0). f the associative memory sider a robot exploring in (at place G), as shown in ectory experiences starting vely. All the states of traine) receive no reward bea state with a non-zero rejectory B (the bottom blue reward of catching an aptes through the whole path. h value at the intersection hen taking actions toward ing zero values at the other isodic memory based robot der around A because there ing the way to goal. Thus , the robot may eventually ne after multiple attempts. sociative memory, the high m trajectory B will be furnt A and thus the robot can cy. V ⇡ (T 0 ) V ⇡ (T ) L • inf µ2 (T ,T 0 ) (E µ [d((s, g), (s 0 , g 0 ))]) = V ⇡ (T ) L • D(T B Experiment Settings B.1 Modified Environments e memory is equivalent to automatic augmentation of counterfactual memory. Thus, our framework significantly improves the sampleearning. Comparisons with state-of-the-art episodic reinforcement RLAM is substantially more sample efficient for general settings of ition, our associative memory can be used as a plug-and-play module r reinforcement learning models, which opens the avenue for further ory based reinforcement learning. ment learning (Sutton & Barto, 1998) , an agent learns a policy to ds by exploring in a Markov Decision Processes (MDP) environment. (S, A, P, R, ), where S is a finite set of states, A is a finite set of P : S ⇥ A ⇥ S ! R defines the transition probability distribution, d 2 (0, 1] is the discount factor. At each time step t, the agent an action a t 2 A according to its policy ⇡ : S ! A, and receives a of finite horizon, the accumulated discounted return is calculated as, is the episode length and goal of the agent is to maximize the expected In order to answer the first question, we demonstrate our method in various robotic manipulation tasks including reach task (see Figure 5 (a)) and fetch task (see Figure 5(c) ). To answer the second one, we investigate how our method performs on environment with high-dimensional state and action space (see Figure 5 (b)) and obstacle (see Figure 5(d) ). We answer the third one by comparison on the convergence of learning curves in Figure 6 . By sample efficiency, we further analysis exploration performance in Appendix F.5. Specifically, we conduct experiments with previous approaches: n Q ⇡ (s, a) = E[R t |s t = s, • One should be noted that graph structure in GSRL is constructed on top of the replay buffer for goal generation and value estimation, which can be closely incorporated with policy networks such as DQN (Mnih et al., 2013) , DDPG (Lillicrap et al., 2015) , etc. To demonstrate the real performance gain from our GSRL, we set the policy network with DDPG for GSRL and all the baselines. The detailed description of environments, experiment settings, and implementation details can be found in Appendix F.1. Maze. We first test our method and other strong baselines in the Maze environment, where the ant agent learns to reach the specified target position ( -ball depicted in red) located at the other end of the U-turn as shown in Figure 5 (a). This environment is quite simple, where most RL approaches can converge with a success rate 1. As Figure 6 (a) illustrates, GSRL performs with high sample efficiency. AntMaze. We show that our GSRL is efficient in a complex environment of robotic agent navigation tasks, as illustrated in Figure 5 the fetch tasks are more complicated than they reach ones in the maze, GSRL also yields large performance gain, as shown in Figure 6 (c). FetchPush with Obstacle. As Figure 5 (d) illustrates, we create an environment based on FetchPush with a rigid obstacle, where the brown block is a static wall that cannot be moved. Experimental results in Figure 6 (d) suggests that the graph structure of state-transitions can provide additional useful information, which leads to better guidance for goal generation and value estimation. Impact of Environment Size. In order to investigate whether GSRL can be well adapted in the environments with different complexity sizes, we extend the Maze environment, as shown in Figure 5(a) . Maze environment with larger size usually means more sparse rewards. We report the performance comparisons on the three environments with SMALL, MEDIUM, and LARGE sizes in Figure 6 (e). More details of these environments are available in Appendix F.1. One can see that GSRL can be well address sparse reward issue in various environments. Impact of Environment Complexity. In order to investigate whether GSRL can be well adapted in the environments with different complexity levels, we extend the Maze environment with a more complex or high-dimensional structure. We create AntMaze with Obstacle environment, where we extend the AntMaze environment with an obstacle. The performance of GSRL in this environment is illustrated in Figure 6 (f). More details of these environments are available in Appendix F.1. One can see that GSRL can be well adapted in various environments with different complexity levels. Impact of Group Selection. We provide two strategies to divide the boundary of the graph into several groups, namely extending from neighbor nodes and extending from uncertain nodes in Section 3. We demonstrate the performance of these two strategies in Figure 6 (g). We set GSRL without using the attention strategy on the groups as NOGroup. We adopt the first strategy and set the number of groups as 3, which corresponds to NEIGH3. We then show the performance of the second strategy with the number of groups equal to 2, 3 and 4, which correspond to UNCERT2, UNCERT3 and UNCERT4. Results show that GSRL without any group selection would lead to inefficiency since the goal generation strategy at the beginning is almost random. Both of these two strategies perform well, as illustrated. We also provide complexity analysis in Appendix E.3. In the main experiment, we adopt extending from uncertain nodes with 3 groups (i.e., UNCERT3). Impact of Discretization. We apply K-bins discretization technique (Kotsiantis & Kanellopoulos, 2006) to discretize the continuous state and action spaces, where there is a wrapper that converts a Box observation into a single integer. DISCRETE10, DISCRETE20 and DISCRETE30 in Figure 6(h) denote the performances of K = 10, 20 and 30 respectively. We find that for the simple tasks, the choice of K is not critical. K is set at 20 in the main experiment. Impact of Related Experience. We further study the impact of using neighborhood structure of the state-transition graph to select related experience to efficiently update value estimation. We build a variant to update value estimation from transitions drawn uniformly from the replay buffer D, denoted as ALL. RELATED is used to denote GSRL, where D related is adopted instead of D.

5. RELATED WORK

Structured Models of Environments. Recent work has investigated leveraging structured environments to make great strides in improving predictive accuracy (Kipf et al., 2018; Xu et al., 2019; Chang et al., 2017; Battaglia et al., 2016; Watters et al., 2017; Sanchez-Gonzalez et al., 2018) and accelerating reinforcement learning procedures (Shang et al., 2019; Kipf et al., 2020; Eysenbach et al., 2019; Wang et al., 2018) . These methods mainly first construct some form of graph neural network where node update functions model the dynamics of individual objects, parts, or agents and edge update functions model their interactions and relations through unsupervised object discovery (Xu et al., 2019) , contrastive learning (Kipf et al., 2020) or agent exploration (Eysenbach et al., 2019) . Unlike these previous works based on well-organized graphs, our model GSRL constructs and makes use of the graph from scratch. Specially, GSRL constructs a dynamic map via exploration and builds efficient exploration with structured information from this map, which allows our approach to be widely deployed into more complex environments. Goal-oriented Reinforcement Learning. Goal-oriented RL allows an agent to generate intrinsic rewards, which is defined with respect to target subsets of the state space called goals (Florensa et al., 2018; Paul et al., 2019) . Recently, goal-oriented RL has been investigated widely in various deep RL scenarios such as imitation learning (Pathak et al., 2018; Srinivas et al., 2018) , disentangling task knowledge from the environment (Mao et al., 2018a; Ghosh et al., 2019) , constituting lower-level controller in hierarchical RL (Shang et al., 2019; Nachum et al., 2018) and organizing cooperation in multi-agent RL (Jin et al., 2019; Yang et al., 2020) . Hence, how to generate appropriate goals is the essential technique in any goal-oriented RL (Andrychowicz et al., 2017; Ren et al., 2019) . Eysenbach et al. (2019) proposed to utilize the shortest-path search on replay buffer to generate a sequence of goals. Instead of planning goals from a well-explored state-transition graph, in this paper, we investigate guiding exploration with structured information from a dynamic graph, which can largely improve the performance in terms of sample efficiency in tasks with sparse reward signals. (Zhao et al., 2019) proposed to maximize entropy of selected trajectories; however, our method utilizes the structure information in the state-transition graph to select related trajectories for learning. Hierarchical Reinforcement Learning. Hierarchical RL learns a set of primitive tasks that together help an agent learn the complex task. There are mainly two lines of work. One class of algorithms (Shang et al., 2019; Nachum et al., 2018; Bacon et al., 2017; Vezhnevets et al., 2017) jointly learn a low-level policy together with a high-level policy, where the lower-level policy interacts directly with the environment to achieve each task, while the higher-level policy instructs the lower-level policy via high-level actions or goals to sequence these tasks into the complex task. The other class of methods (Bagaria & Konidaris, 2019; Fox et al., 2017; Hartikainen et al., 2019; Pitis et al., 2020; Pong et al., 2019) focus on discovering sub-goals that are easy to reach in a short time, or options which are lower-level control primitives, can be invoked by the meta-control policy. The common idea GSRL shares with Hierarchical RL is to decompose the complex task into several sub-tasks to achieve. In this paper, the key difference is that, we propose to build the state-transition graph and utilize the structure information for goal generation and value estimation.

6. CONCLUSION

In this paper, we propose a novel framework called GSRL, which leverages structure information of the state-transition graph for efficient goal generation and value estimation. We provide theoretical analysis to show the efficiency and converge property of our method. In this paper, we construct a state-transition graph on top of the replay buffer. As Figure 7 (a) shows, we build the graph G e t based on historical explored trajectories at timestep t in episode e. For any not fully explored state-transition graph, there exist many poorly-explored states. We measure the exploration (i.e., certainty in Defintion 1) of these states according to the number of their untaken candidate actions. As illustrated in Figure 7 (b), we define the boundary of G e t as a set of states, at least one of whose candidate actions is not readily taken. Each untaken action may lead to unvisited states (denoted by ? icon). We denote the boundary as ∂G e t . As illustrated in Figure 7 (c), after each timestep t + 1, the agent explored a new state denoted as ∆G e t . Then, G e t and ∆G e t together make up the dynamic graph at timestep t + 1 denoted as G e t+1 .

B ALGORITHM

Algorithm 1 Graph Structured Reinforcement Learning (GSRL) 1: Initialize replay buffer D = {s 0 } and state-trainsition graph G = {s 0 } 2: for epsiode number e = 1, 2, . . . , E do 3: Select an appropriate group for exploration according to Eq. (3) 4: Generate goal g e according to Eq. ( 4) 5: for timestep t = 0, 1, 2, . . Compute optimal goal g * e according to Definition 2 18: Update parameter φ using Eq. ( 5) 19: end for We provide the overall algorithm in Algorithm 1. The key contribution of our paper is to leverage structured information in the state-transition graph for efficient goal generation and value estimation, which is represented in line 4 and 13, respectively. We then describe the overall procedure of GSRL according to Algorithm 1 as follows: There is no graph structure for the agent to support when the task starts. Hence, the agent initializes the replay buffer D and the state-transition graph G in line 1. At the beginning timestep of each Under review as a conference paper at ICLR 2021 episode e, we divide the boundary of the state-transition graph ∂G e 0 into N groups and adopt attention mechanism to select an appropriate one for exploration in line 3. Within the selected group, we choose the state with the highest value as the generated goal g e in line 4. The agent tries to reach the goal state through current policy-based Q value in line 7 and record interaction history in the replay buffer in line 9. As goal-oriented RL provides the agent intrinsic reward conditioned on the current goal, the agent is required to relabel reward with r g conditioned on g e in line 10. Then the agent updates the state-transition graph in line 11. In order to efficiently update policy, the agent sample related trajectories that contain at least one neighbor states of the current state in line 13. In line 14, the policy is updated with DDPG (Lillicrap et al., 2015) . At the end of each episode e, the state-transition graph is actually built for episode e + 1 denoted as G e+1 We provide an illustrated example for GSRL in Figure 8 . When an agent encounters a complex environment, GSRL first discretize the continuous space into a discrete one, as illustrated in (a). We color the last state (i.e., s last = s 7 ) visited in the previous episodes with red. Meanwhile, the historical trajectories stored in the replay buffer can be represented with several paths (e.g., s 0 , s 1 , s 3 , s 6 ). With these paths, we can easily construct a state-transition graph at the beginning timestep of episode (i.e., G e 0 ). As illustrated in (b), we can find the boundary of the graph (i.e., ∂G e t ) as shown in Appendix A. In this case, s 5 , s 6 are in the boundary of the current graph (i.e., ∂G e 0 ). We then follow extending from uncertain nodes to divide the boundary into several groups. The number groups is the hyper-parameter, and we choose 3 here. For each group, we include s last since s last is often with high value and close to the terminal state. As shown in (c), we then adopt attention mechanism to select one group C AT T and assign the state with the highest value in the select group as the goal g e . With the generated goal, the agent can perform goal-oriented exploration to divide the complex task into several sub-tasks to deal with. There are two learned parts in GSRL, namely attention network for goal generation, and value network for value estimation. We design a hindsight learning approach to update the attention network at the beginning timestep of each episode (as shown in Figure 4 ), and use related experiences to update value network at each update interval (as shown in Figure 2 ).

D PROOFS D.1 PROOF OF PROPOSITION 1

Proposition 1. (adopted from (Xu et al., 2020) ) We assume that the probability of degree of an arbitrary state in the whole state-transition graph G whole being less than or equal to d is larger than p (i.e., P (deg(s) ≤ d) > p, ∀s ∈ S G ). We then consider a sequence of consecutively expanding sub-graphs {{G 1 t } T -1 t=0 , {G 2 t } T -1 t=0 , . . . , {G E t } T -1 t=0 }, starting with G 0 1 = {s 0 }, for all t ≥ 0, e ≥ 1. We can ensure that P |S G e t | ≤ > p , where = d•(d-1) T •e+t -2 d-2 when d > 2 and = 1, 3 when d = 1, 2 respectively. Proof. In (Xu et al., 2020) , the new node of the current graph (i.e., G e 0 ) can be sampled directly from the neighborhood. As illustrated in Figure 9 , considering the graph in this paper is the state-transition graph, we can not directly sample from the neighborhood. We only can obtain the new states by the explored trajectories (e.g., s 0 , s 1 , s 3 , s 6 , s 9 in Figure 9 ). Therefore, we propose to consecutively expand the sub-graphs at the timestep level (i.e., {{G 1 t } T -1 t=0 , {G 2 t } T -1 t=0 , . . . , {G E t } T -1 t=0 }, starting with G 0 1 = {s 0 }). Although we follow the framework in (Xu et al., 2020) , several modifications are required. And, we provide the detailed proof as follows: We consider the extreme case of greedy consecutive expansion at each timestep t in any episode e, where G e t+1 = G e t ∪ ∆G e t = G e t ∪ ∂G e t , since if this case satisfies the inequality, any case of consecutive expansion can also satisfy it. By definition, all the subgraphs G e t are a connected graph. Here, we use ∆S t to denote S ∆G t for short. In each episode, we can ensure that the newly added nodes ∆S e t at timestep t only belong to the neighborhood of the last added nodes ∆S e t-1 . Within each episode e, we study the sequence {∆G e 0 , ∆G e 1 , . . . , ∆G e T -1 }, where T is the episode length. In this case, each node in ∆S e t already has at least one edge within ∆G e t-1 due to the definition of connected graphs. We can have P |∆S e t | ≤ |∆S e t-1 | • (d -1) > p |∆S e t-1 | . For e = 1 and t = 0, we have P (|∆S 1 1 | ≤ d) > p and thus P (|S G 1 0 | ≤ 1 + d) > p. For e ≥ 1 and t ≥ 1, we analyze the conseutive expansion of the state-transition graph G as G 1 → G 2 → • • • → G E ⇒ G 1 0 → G 1 1 → • • • → G 1 T -1 G 1 → G 2 0 → G 2 1 → • • • → G 2 T -1 G 2 → • • • → G E 0 → G E 1 → • • • → G E T -1 G E . (10) Given that |∆S G e t | ≥ 1, ∀t ∈ [0, T -1], we consider the extreme case that |∆S G e t | = 1, ∀t ∈ [0, T -1], which means that every exploration will result in a new explored state and should be respondes to the upper bound of the explosion. Based on |∆S G e t | = 1 + |∆S G 1 0 | + |∆S G 1 1 | + • • • + |∆S G 1 T -1 | + |∆S G 2 0 | + |∆S G 2 1 | + • • • + |∆S G 2 T -1 | + • • • + |∆S G e 0 | + • • • + |∆S G e t |, we have P |S G e | ≤ 1 + d + d • (d -1) + • • • + d • (d -1) e•T +t-1 > p 1+d+d•(d-1)+•••+d•(d-1) e•T +t-2 . ( ) When d = 1, there can be only one node, so in this case, = 1. When d = 2, we follow Eq. ( 11) and derive that in this case, = 3. When d > 2, it holds that P |S G t | ≤ d • (d -1) e•T +t -2 d -2 > p d•(d-1) e•T +t-1 -2 d-2 . ( ) We can find that t = 0 also satisfies this inequality.

D.2 PROOF OF PROPOSITION 2

Proposition 2. Assume that Q-value of each state is well estimated (i.e., Q = Q * ), then optimal goal g * e at the beginning timestep of any episode e is always included in the boundary of the statetransition graph G e 0 (i.e., ∂G e 0 ). Proof. According to Definition 2, as shown in Figure 10 (a), in the fully explored graph G full , the optimal goal g * full is the terminal state in the optimal solution P full , which is also the terminal state in the environment (i.e., s 10 ). The intuitive explanation behind this is very natural, where the environment in this case is fully explored, and thus the agent is ready to target at the terminal state. In the other cases, we generate the optimal goal g * e of episode e at the episode e + 1. Specially, we find the shortest path to the highest value state in G e+1 0 as the optimal solution path P e+1 . As Figure 10 illustrates, in the episode e + 1, the highest value is s 9 and the optimal solution path in this case is P e+1 = s 0 , s 1 , s 3 , s 6 , s 9 . We then compare the explored states in G e 0 with the states in P inverse e+1 , where P inverse e+1 = s 9 , s 6 , s 3 , s 1 , s 0 is the inverse order of P e+1 . As Figure 10(d) shows, finally we obtain s 6 as the optimal goal g * e . As stated above, it's easy to find that there are two cases in the optimal goal generation. One is the last node of solution path P e+1 . The other is one of the rest nodes in P e+1 except the last one. We then prove that in both of these cases, optimal goal g * e is always included in the boundary of the state-transition graph ∂G e 0 . Case I: Node at Last. If Q-value of each state is well estimated, i.e., Q = Q * , then the optimal solution path P e+1 at episode e+1 should be close to the optimal solution path P full in the full graph G full , and the last state of the path P e+1 should be closest to the terminal state. Hence, if g * e is not in the boundary, there must be one neighbor node closer to the terminal state. Otherwise, g * e is the dead end and thus should not be regarded as the optimal goal. And if there is one neighbor node closer to the terminal state, then this state should be regarded as the optimal goal. Therefore, we obtain a contradiction. Case II: Node Not at Last. If the optimal goal is not the last state, then there must exist the state unexplored at episode e. Take Figure 10 as an example, if we take s 6 as the optimal goal g * e in (d), state s 9 must be unexplored in G e 0 in (c) and explored in G e+1 0 in (b). If g * e is not included in ∂G e 0 , then there should not exist any unexplored state that is included in its neighborhood. According to the definition of the boundary of the graph, we have proved the proposition by contradiction. In summary, we have proved the proposition in both two cases by contradiction.

D.3 PROOF OF PROPOSITION 3

Proposition 3. Denote the Bellman backup operator in Eq. ( 6) as B : R |S|×|A|×|G| → R |S|×|A|×|G| and a mapping Q : S × A × G → R |S|×|A|×|G| with |S| < ∞ and |A| < ∞. Repeated applications of the operator B for our graph-based state-action value estimate QG converges to a unique optimal value Q * G * with well-explored graph G * (i.e., fully explored graph G full ) including optimal solution path. Proof. The proof of Proposition 3 is done in two main steps. The first step is to show that our statetransition graph G can converge to the well-explored graph G * . Here, we define G * as the graph that includes the optimal path (i.e., P full in Definition 2). In the second step, we prove that given graph G, our graph-based method can converge to unique optimal value Q * G . Under review as a conference paper at ICLR 2021 Step I. Since |S| < ∞ and |A| < ∞, we can obtain that V G < ∞ and E G < ∞. Note that the state-transition graph G is a dynamic graph, and goals g generated on G are updated at the beginning timestep of each episode. Hence, there is a sequence of goals denoted as (g 1 , g 2 , • • • , g E ) and corresponding sequence of graphs denoted as (G 1 0 , G 2 0 , • • • , G E 0 ) , where E here is the number of episodes. Given that |S| < ∞ and |A| < ∞, the number of nodes and edges in the full graph G full is also bounded. Based on the explore strategy introduced in Section 3, we know that goal-oriented RL will first search for a path leading to the terminal state. After that, the terminal state will be included in G. Then the agent will seek the shortest path to the terminal state because the agent is given a negative reward at each timestep. Hence, the optimal solution path P full will be involved. Hence, we can obtain that G 1 0 ⊆ G 2 0 ⊆ • • • ⊆ G * ⇒ G → G * . (13) Assume that E is large enough, our state-transition graph G can finally converge to well-explored graph G * . Step II. Note that the proof of convergence for our graph-based goal-oriented RL is quite similar to Q-learning (Bellman, 1966; Bertsekas, 1995; Sutton & Barto, 2018) . The differences between our approach and Q-learning are that Q value Q(s, a, g) is also conditioned on goal g, and that the state-transition probability P G (s |s, a) can be reflected by graph G. We provide detailed proof as follows: For any state-transition graph G, we can obtain goal g ∈ G conditioned on G from Step I. Based on that, our estimated graph-based action-value function QG can be defined as B QG (s, a, g) = R(s, a, g) + γ • max a ∈A s ∈S P G (s |s, a) • QG (s , a , g). For any action-value function estimates Q1 G , Q2 G , we study that |B Q1 G (s, a, g) -B Q2 G (s, a, g)| = γ • | max a ∈A s ∈S P G (s |s, a) • Q1 G (s , a , g) -max a ∈A s ∈S P G (s |s, a) • Q2 G (s , a , g)| ≤ γ • max a ∈A | s ∈S P G (s |s, a) • Q1 G (s , a , g) - s ∈S P G (s |s, a) • Q2 G (s , a , g)| = γ • max a ∈A s ∈S P G (s |s, a) • | Q1 G (s , a , g) -Q2 G (s , a , g)| ≤ γ • max s∈S,a∈A | Q1 G (s, a, g) -Q2 G (s, a, g)| So the contraction property of Bellman operator holds that max s∈S,a∈A |B Q1 G (s, a, g) -B Q2 G (s, a, g)| ≤ γ • max s∈S,a∈A | Q1 G (s, a, g) -Q2 G (s, a, g)| For the fixed point Q * G , we have that max s∈S,a∈A |B QG (s, a, g) -B Q * G (s, a, g)| ≤ γ • max s∈S,a∈A | QG (s, a, g) -Q * G (s, a, g)| ⇒ QG → Q * G . Combining Step I and II, we can conclude that our graph-based estimated state-action value QG can converge to a unique optimal value Q * G * .

E DISCUSSIONS E.1 DISCUSSION ON CERTAINTY OF STATE

In this section, we further discuss the relationship between the certainty of state and the number of states. In the previous exploration RL literature (Ostrovski et al., 2017; Bellemare et al., 2016) , the performance of exploration often is measured by the number of the visited states. Namely, given a fixed number of episodes, more visited states, better performance. In this paper, we propose to utilize a new measurement, i.e., certainty of state as illustrated in Definition 1. We conclude the relations between certainty and the number of visited states as Proposition 4. Under review as a conference paper at ICLR 2021 In the previous goal-oriented RL literature (Andrychowicz et al., 2017; Ren et al., 2019) , what kind of generated goals is helpful for the agent to efficiently learn a well-performed policy is one of the key questions to be answered. The basic idea of goal-oriented RL architecture is to generate goals to decompose the complex task into several goal-oriented tasks. In this paper, we analyze our generated goals from two perspectives, namely reachability and curriculum. Reachability. The first property required in the optimal goal is that the generated goal is guaranteed to be reachable for the agent. To this end, in this paper, the candidate goal set is constrained into the visited states. In other words, the goal generated in the episode e must be visited before the episode e. Therefore, we can guarantee that the generated goal is reachable. Curriculum. The second property is the curriculum, which means that our optimal goals are required to approach the terminal state during the exploration. If the Q-value of each state is well estimated, our goal generation under the supervision of forward-looking planning at the next episode will focus on the potential highest value states in the future, which is actually the terminal state when the agent has the full observation of states.

E.3 DISCUSSION ON GROUP DIVISION

Motivation. The intuitive motivation behind the group division is very natural. Proposition 1 implies that exploration on the state-transition graph G e t at timestep t in episode e without any constraint may lead to explosion of graph and inefficiency of exploration. Therefore, the agent is expected to do exploration within a limited domain. Considering that G e t is always changing and the number of nodes (i.e., |S G e t |) keeps increasing, it is non-trivial for the agent to learn to select state as the goal for further exploration. Hence, we first restrict the exploration within the boundary of state-transition graph ∂G e t according to Proposition 2. We then consider partitioning ∂G e t into several groups. We set the last visited state s last as the original point because s last is likely to be close to the target state and reachable for current policy. As introduced in Section 3, we propose to extend groups from s last following two possible perspectives, namely neighbor and uncertain nodes. 

F EXPERIMENTS F.1 ENVIRONMENT CONFIGURATION

Then we check whether the exploration provided by the goals generated by HGG can result in better policy training performance. As shown in Figure 3 , we compare the vanilla HER, HER with Energy-Based Prioritization (HER+EBP), HGG, HGG+EBP. It is worth noting that since EBP is designed for the Bellman equation updates, it is complementary to our HGG-based exploration approach. Among the eight environments, HGG substantially outperforms HER on four and has comparable performance on the other four, which are either too simple or too difficult. When combined with EBP, HGG+EBP achieves the best performance on six environments that are eligible. Performance on tasks with obstacle In a more difficult task, crafted metric may be more suitable than `2-distance used in Eq. ( 5). As shown in Figure 4 , we created an environment based on FetchPush with a rigid obstacle. The object and the goal are uniformly generated in the green and the red segments respectively. The brown block is a static wall which cannot be moved. In addition to `2, we also construct a distance metric based on the graph distance of a mesh grid on the plane, the blue line is a successful trajectory in such hand-craft distance measure. A more detailed description is deferred to Appendix B.3. Intuitively speaking, this crafted distance should be better than `2 due to the existence of the obstacle. Experimental results suggest that such a crafted distance metric provides better guidance for goal generation and training, and significantly improves sample efficiency over `2 distance. It would be a future direction to investigate ways to obtain or learn a good metric. Since our method can be seen as an explicit curriculum learning for exploration, where we generate hindsight goals as intermediate task distribution, we also compare our method with another recently proposed curriculum learning method for RL. Florensa et al. (2018) leverages Least-Squares GAN (Mao et al., 2018b) to mimic the set called Goals of Intermediate Difficult as exploration goal generator.

5.2. Comparison with Explicit Curriculum Learning

Specifically, in our task settings, we define a goal set GOID(⇡) = {g : ↵  f (⇡, g)  1 ↵}, where f (⇡, g) represents the average success rate in a small region closed by goal g. To sample from GOID, we implement an oracle goal generator based on rejection sampling, which could uniformly sample goals from GOID(⇡). Result in Figure 5 indicates that our Hindsight Goal Generation substantially outperforms HER even with GOID from the oracle generator. Note that this experiment is run on a environment with fixed initial state due to the limitation of Florensa et al. (2018) . The choice of ↵ is also suggested by Florensa et al. (2018) .

5.3. Ablation Studies on Hyperparameter Selection

In this section, we set up a set of ablation tests on several hyper-parameters used in the Hindsight Goal Generation algorithm. Lipschitz L: The selection of Lipschitz constant is task dependent, since it iss related with scale of value function and goal distance. For the robotics tasks tested in this paper, we find that it is easier to set L by first divided it with the upper bound of the distance between any two final goals in a environment. We test a few choices of L on several environments and find that it is very easy to find a range of L that works well and shows robustness for all the environments tested in this section. We show the learning curves on FetchPush with different L. It appears that the performance of HGG is reasonable as long as L is not too small. For all tasks we tested in the comparisons, we set L = 5.0. Distance weight c: Parameter c defines the trade-off between the initial state similarity and the goal similarity. Larger c encourages our algorithm to choose hindsight goals that has closer initial state.

8

For any µ 2 (T , T 0 ), we sample (s, g, s 0 , g 0 ) ⇠ µ and take the expectation on both sid and get V ⇡ (T 0 ) V ⇡ (T ) L • E µ [d((s, g), (s 0 , g 0 ))]. Since Eq. ( 11) holds for any µ 2 (T , T 0 ), we have • FetchPush-v1: Let the origin (0, 0, 0) denote the projection of gripper's i nate on the table. The object is uniformly generated on the segment ( 0.15 (0.15, 0.15, 0), and the goal is uniformly generated on the segment ( 0.1 (0.15, 0.15, 0). • FetchPickAndPlace-v1: Let the origin (0, 0, 0) denote the projection of gripper dinate on the table. The object is uniformly generated on the segment ( 0.15 (0.15, 0.15, 0), and the goal is uniformly generated on the segment ( 0.15, (0.15, 0.15, 0.45). • FetchSlide-v1: Let the origin (0, 0, 0) denote the projection of gripper's i nate on the table. The object is uniformly generated on the segment ( 0.0 ( 0.05, 0.1, 0), and the goal is uniformly generated on the segment (0.55 (0.55, 0.15, 0). er at ICLR 2020 c memory, and maintain a graph on top of these states based on state Then we develop an efficient reverse-trajectory propagation strategy eriences to rapidly propagate to all memory items through the graph. ed non-parametric high values in associative memory as early guidt so that it can rapidly latch on states that previously yield high returns ow gradient updates. f the associative memory sider a robot exploring in (at place G), as shown in ectory experiences starting vely. All the states of traine) receive no reward bea state with a non-zero rejectory B (the bottom blue reward of catching an aptes through the whole path. h value at the intersection hen taking actions toward ing zero values at the other isodic memory based robot der around A because there ing the way to goal. Thus , the robot may eventually ne after multiple attempts. sociative memory, the high m trajectory B will be furnt A and thus the robot can cy. V ⇡ (T 0 ) V ⇡ (T ) L • inf µ2 (T ,T 0 ) (E µ [d((s, g), (s 0 , g 0 ))]) = V ⇡ (T ) L • D(T B Experiment Settings B.1 Modified Environments e memory is equivalent to automatic augmentation of counterfactual memory. Thus, our framework significantly improves the sampleearning. Comparisons with state-of-the-art episodic reinforcement RLAM is substantially more sample efficient for general settings of ition, our associative memory can be used as a plug-and-play module r reinforcement learning models, which opens the avenue for further ory based reinforcement learning. ment learning (Sutton & Barto, 1998) , an agent learns a policy to ds by exploring in a Markov Decision Processes (MDP) environment. (S, A, P, R, ), where S is a finite set of states, A is a finite set of P : S ⇥ A ⇥ S ! R defines the transition probability distribution, d 2 (0, 1] is the discount factor. At each time step t, the agent an action a t 2 A according to its policy ⇡ : S ! A, and receives a of finite horizon, the accumulated discounted return is calculated as, is the episode length and goal of the agent is to maximize the expected n Q ⇡ (s, a) = E[R t |s t = s, a] is the expected return for executing ing policy ⇡ afterwards. DQN (Mnih et al., 2015) Maze. As shown in Figure 12 (a), in the maze environment, a point in a 2D U -maze aims to reach a goal represented by a red point. The size of maze is 15 × 15, the state space and is in this 2D U -maze, and the goal is uniformly generated on the segment from (0, 0) to (15.0, 15.0). The action space is from (-1.0, -1.0) to (1.0, 1.0), which represents the movement in x and y directions. AntMaze. As shown in Figure 12 (b), in the AntMaze environment, an ant is put in a U -maze, and the size of the maze is 12×12. The ant is put on a random location on the segment from (-2.0, -2.0) to (10.0, 10.0), and the goal is uniformly generated on the segment from (-2.0, -2.0) to (10.0, 10.0). The state of ant is 30-dimension, including its positions and velocities. FetchPush. As shown in Figure 12 (c), in the fetch environment, the agent is trained to fetch an object from the initial position (rectangle depicted in green) to a distant position (rectangle depicted in red). Let the origin (0, 0, 0) denote the projection of the gripper's initial coordinate on the table. The object is uniformly generated on the segment from (-0.0, -0.0, 0) to (8, 8, 0), and the goal is uniformly generated on the segment from (-0.0, -0.0, 0) to (8, 8, 0). FetchPush with Obstacle. As shown in Figure 12 (d), in the fetch with obstacle environment, we create an environment based on FetchPush with a rigid obstacle, where the brown block is a static wall that can't be moved. The object is uniformly generated on the segment from (-0.0, -0.0, 0) to (8, 8, 0), and the goal is uniformly generated on the segment from (-0.0, -0.0, 0) to (8, 8, 0). AntMaze with Obstacle. This environment is an extended version of AntMaze, where a 1 × 1 rigid obstacle is put in U-maze.

F.2 EVALUATION DETAILS

• All curves presented in this paper are plotted from 10 runs with random task initialization and seeds. • The shaded region indicates 60% population around the median. • All curves are plotted using the same hyper-parameters (except the ablation section). • Following (Andrychowicz et al., 2017) , an episode is considered successful if gs object 2 ≤ δ g is achieved, where s object is the object position at the end of the episode. δ g is the threshold. • The max timestep for each episode is set as 200 for training and 500 for tests. • The average success rate using in the curve is estimated by 10 2 samples.

F.3 HYPER-PARAMETERS

Almost all hyper-parameters using DDPG (Lillicrap et al., 2015) and HER Andrychowicz et al. (2017) are kept the same as benchmark results, except these: • Number of MPI workers: 1; • Actor and critic networks: 3 layers with 256 units and ReLU activation; • Adam optimizer with 5 × 10 -4 learning rate; • Polyak-averaging coefficient: 0.98; • Action l 2 -norm penalty coefficient: 0.5; • Batch size: 256; • Probability of random actions: 0.2; • Scale of additive Gaussian noise: 0.2; • Probability of HER experience replay: 0.8; • Number of batches to replay after collecting one trajectory: 50. Hyper-parameters in goal generation: • Adam optimizer with 1 × 10 -3 learning rate; • K of K-bins discretization: 20; • Number of groups to depart the graph: 3.

F.4 IMPACT OF GROUP SELECTION

We provide a separate numerical table of asymptotic performance values besides Figure 6 (g) here, since the curves of all the models (GSRL and its variants) are close to each other.

Model

NOGroup UNCERT2 UNCERT3 UNCERT4 NEIGH3 Success Rate 0.98 ± 0.03 0.98 ± 0.02 0.99 ± 0.02 0.97 ± 0.03 0.97 ± 0.03 Table 1 : Results of investigation on impact of group selection. Note that the success rate is limited between 0.00 and 1.00.

F.5 COMPARISON ON EXPLORATION

By sample efficiency, we show the comparisons according to the number of states visited and actions taken. In other words, given the fix number of episodes, more unique states visited and actions taken usually denote the efficiency of exploration. We report the log files of GSRL and HER in Maze environment here at 10, 50, 100 episodes, which contain the number of visited nodes and actions taken. ==================== Graph Structured Reinforcement Learning (GSRL) ==================== episode is: 10 nodes: [22, 21, 11, 31, 42, 32, 41, 23, 43, 33, 44, 34, 13, 14, 15, 26, 25, 36, 35, 45, 16, 24, 37, 47, 38, 48, 46, 12, 49, 59, 69, 79, 80, 78, 90, 89, 99, 109, 110, 100, 27, 28, 39, 50, 40 ] number of nodes: 45 edges: [(22, 22), (22, 21), (22, 23), (22, 32), (22, 33), (22, 13), (22, 31), (22, 12), (21, 21), (21, 11), (21, 31), (21, 32), (21, 22), (11, 11), (11, 21), (11, 12), (31, 31), (31, 42), (31, 41), (31, 22), (31, 32), (31, 21), (42, 42), (42, 32), (42, 43), (42, 41), (42, 31), (32, 31), (32, 32), (32, 42), (32, 43), (32, 33), (32, 23), (32, 22), (41, 41), (41, 31), (23, 22), (23, 23), (23, 32), (23, 34), (23, 33), (23, 24), (43, 32), (43, 43), (43, 42), (43, 33), (43, 44), (43, 34), (33, 33), (33, 44), (33, 32), (33, 43), (33, 23), (33, 34), (44, 44), (44, 33), (44, 43), (44, 35), (44, 34) , (44, 45), (34, 43), (34, 33), (34, 34), (34, 24), (34, 35), (34, 44), (34, 45), (13, 13), (13, 14), (13, 23), (14, 15), (14, 25), (14, 14), (15, 15), (15, 26), (15, 25), (15, 14), (15, 16), (26, 26), (26, 25), (26, 36), (26, 16), (26, 37), (26, 27), (25, 26), (25, 25), (25, 15), (25, 36), (36, 35), (36, 26), (36, 36), (36, 37), (36, 27), (35, 35), (35, 45), (35, 26), (35, 34), (35, 36), (35, 44), (35, 25), (45, 35), (45, 44), (45, 45), (16, 26), (24, 35), (24, 34), (24, 25), (37, 47), (37, 37), (37, 38), (37, 48), (47, 37), (47, 47), (47, 48), (47, 46), (38, 47), (38, 48), (38, 37) , (38, 49), (38, 28), (38, 39), (48, 38), (48, 48), (48, 49), (46, 47), (46, 46), (12, 11), (12, 21), (12, 23), (12, 22), (12, 13), (49, 48), (49, 59), (49, 50), (59, 69), (69, 69), (69, 79), (79, 80), (79, 78), (79, 79), (79, 90) [22, 21, 11, 31, 42, 32, 41, 23, 43, 33, 44, 34, 13, 14, 15, 26, 25, 36, 35, 45, 16, 24, 37, 47, 38, 48, 46, 12, 49, 59, 69, 79, 80, 78, 90, 89, 99, 109, 110, 100, 27, 28, 39, 50, 40, 29, 30, 51, 57, 18, 19, 58, 68, 17, 60, 20, 67, 70, 71, 61 ] number of nodes: 60 edges: [(22, 22), (22, 21), (22, 23), (22, 32), (22, 33), (22, 13), (22, 31), (22, 12), (22, 11), (21, 21), (21, 11), (21, 31), (21, 32), (21, 22), (21, 12), (11, 11), (11, 21), (11, 12), (11, 22), (31, 31), (31, 42), (31, 41), (31, 22), (31, 32), (31, 21), (31, 40), (31, 30), (42, 42), (42, 32), (42, 43), (42, 41), (42, 31), (32, 31), (32, 32), (32, 42), (32, 43), (32, 33), (32, 23), (32, 22), (32, 41), (32, 21), (41, 41), (41, 31), (41, 40), (41, 32), (23, 22), (23, 23), (23, 32), (23, 34), (23, 33), (23, 24), (23, 13), (23, 14), (23, 12), (43, 32), (43, 43), (43, 42), (43, 33), (43, 44) , (43, 34), (33, 33), (33, 44), (33, 32), (33, 43), (33, 23), (33, 34), (33, 22), (33, 42), (33, 24), (44, 44), (44, 33), (44, 43), (44, 35), (44, 34), (44, 45), (34, 43), (34, 33), (34, 34), (34, 24), (34, 35), (34, 44), (34, 45), (34, 25), (13, 13), (13, 14), (13, 23), (13, 12), (13, 22), (13, 24), (14, 15), (14, 25), (14, 14), (14, 24) , (14, 13), (14, 23), (15, 15), (15, 26), (15, 25), (15, 14), (15, 16), (26, 26), (26, 25), (26, 36), (26, 16), (26, 37), (26, 27), (26, 35), (26, 17), (25, 26), (25, 25), (25, 15), (25, 36), (25, 24), (25, 35), (25, 16), (25, 34), (36, 35), (36, 26) , (36, 36), (36, 37), (36, 27), (36, 46), (36, 47), (36, 45), (35, 35), (35, 45), (35, 26), (35, 34), (35, 36), (35, 44), (35, 25), (35, 46), (35, 24), (45, 35), (45, 44), (45, 45), (45, 46), (45, 36), (45, 34), (16, 26), (16, 16), (16, 27), (16, 17), (16, 15), (16, 25), (24, 35), (24, 34), (24, 25), (24, 24), (24, 14), (24, 23), (24, 15), (24, 33), (37, 47), (37, 37), (37, 38), (37, 48), (37, 28), (37, 36), (37, 46), (37, 27), (47, 37), (47, 47), (47, 48), (47, 46), (47, 38), (47, 58), (47, 57), (47, 36), (38, 47), (38, 48), (38, 37), (38, 49), (38, 28), (38, 39), (38, 38), (38, 27) , (48, 38), (48, 48), (48, 49), (48, 57), (48, 47), (48, 58), (48, 59), (46, 47), (46, 46), (46, 37), (46, 45), (46, 



‡ Similar as(Andrychowicz et al., 2017), we relabel all the rewards when the goal changes. Therefore, V (s, ge-1) is used here.§ Note that we define the target state and the optimal solution path in episode e + 1, and the optimal goal in episode e.



Figure 1: An illustration for motivation of GSRL.

Figure 2: GSRL allows joint optimization of map construction and agent exploration from scratch. (a) An agent is located at the initial state and aims at the terminal state with an empty map at the beginning. (b) The agent explores the environment and records the explored states on the map. (c) The agent generates a goal from previous states to decompose the whole task into a sequence of easier tasks. (d) The agent updates current policy with related trajectories selected by structured map information.

Figure 3: An illustration of exploration strategy.

Figure 4: Illustrations of optimal goals. States surrounded by red circles and connected by red arrows are optimal solution path, which can be obtained by the shortest path planning algorithm on the fully explored graph G full in (a). When the graph is not fully explored (G e 0 in (b)), we generate the optimal solution path hindsight, where we regard the state with the highest value in the next episode (G e+1

Denote the Bellman backup operator in Eq. (6) as B : R |S|×|A|×|G| → R |S|×|A|×|G| and a mapping Q : S × A × G → R |S|×|A|×|G| with |S| < ∞ and |A| < ∞. Repeated applications of the operator B for our graph-based state-action value estimate QG converges to a unique optimal value Q * G * with well-explored graph G * (i.e., fully explored graph G full ). Proof. The proof is shown in Appendix D.3.

Figure 4: Visualization of FetchPush with obstacle.

Figure 5: Comparison with curriculum learning. We compare HGG with the original HER, HER+GOID with two threshold values.

Figure 7: Visualization of modified task distribution in Fetch environments. The objec generated on the green segment, and the goal is uniformly generated on the red segme Fetch Environments:

Figure 1: Comparison of selected policies based on episodic memory and associative memory. An agent starts from two place A and B to collect two experiences.

Figure 5: Visualization of robotic manipulation environments.

HER: Andrychowicz et al. (2017) generated imaginary goals in a simple heuristic way to tackle the sparse reward issue. • MAP: Huang et al. (2019) explicitly modeled the environment in a hierarchical manner, with a high-level map abstracting the state space and a low-level value network to derive local decisions. • GoalGAN: Florensa et al. (2018) leveraged Least-Squares GAN (Mao et al., 2018b) to mimic the set of Goals of Intermediate Difficulty as an automatic goal generator. • CHER: Fang et al. (2019) proposed to enforce more curiosity in earlier stages and changes to larger goal-proximity later.

Figure 6: Learning curves of GSRL, HER, MAP, GoalGAN, and CHER on various environments with 10 random seeds, where the solid curves depict the mean, the shaded areas indicate the standard deviation, and dashed horizontal lines show the asymptotic performance.

Figure 7: An illustrated example of notations in Graph Structured Reinforcement Learning (GSRL).

Figure 8: An illustrated example of Graph Structured Reinforcement Learning (GSRL).

Figure 9: An illustrated example of the proof of Proposition 1 and the key difference between the proposition in this paper and the one in the previous work.

Figure10: An illustrated example of the relationship between optimal goal and boundary. In the fully explored graph G full , the red circled states together show the optimal solution path (P full = s 0 , s 1 , s 3 , s 6 , s 9 , s 10 ) with terminal one (s 10 ) for the optimal goal in (a). In any other not fully explored state-transition graph G e 0 at the beginning timestep of any episode e in (b), we regard the reachable state in the dashed line circle (s 6 ) through planning in the next episode G e+1 0 in (c) as the optimal goal in (d).

Figure 11: An illustrated example for relationship between certainty and number of visited states. Proposition 4. Given a whole state-transition graph G whole , we can regard the certainty of states as the local measurement and the number of states as the global measurement for exploration, which share a similar trend during agent exploration. Proof. We illustrate and prove the proposition hindsightly. If we have the full observation for states as shown in Figure 11(a), we can model the agent finding new states as connecting new states with visited states. In other words, since the state-transition graph G e t must keep being a fully connected graph at any timestep t in any episode e. Hence, adding new states into the visited state set can always be regarded as finding new edge between new states and the visited state set. And each directed edge in the state-transition graph, as shown in Figure 11(b) is determined by action and state-transition function. If the environment is determined, we can roughly regard the number of edges as the approximate measurement for exploration. The certainty of states is the local perspective for this measurement. E.2 DISCUSSION ON OPTIMAL GOAL

Figure 4: Visualization of FetchPush with obstacle.

Figure 5: Comparison with curriculum learning. We compare HGG with the original HER, HER+GOID with two threshold values.

Figure 7: Visualization of modified task distribution in Fetch environments. The objec generated on the green segment, and the goal is uniformly generated on the red segme Fetch Environments:



Figure 1: Comparison of selected policies based on episodic memory and associative memory. An agent starts from two place A and B to collect two experiences.

Figure 12: Visualization of robotic manipulation environments.



Extending from Uncertain Nodes. The other perspective is to utilize certainty information to guide goal generation, where C n := {s}, s ∈ S d=|A|-n ∩ ∂G e t for n = 1, 2, . . . , N , and C N := ∂G e t -∪ N n=1 C n . S d=|A|-n denotes set of states whose degree equals |A|n, and |A| is the size of action space. The intuitive motivation behind this is to eliminate uncertainty in the graph through exploration. One should be noted that extending from either neighbor or uncertain nodes can guarantee the number of groups C 1 , . . . , C N on a dynamic graph G e t is fixed.



Complexity. Let d ∂G e t denote the maximum degree of states in ∂G e t , and |S ∂G e t | denote the number of states in ∂G e t . Note that ∂G e t is always a directed fully connected graph. If we want to find the nhop neighbors of s last , we need to iteratively go through related nodes' neighborhood. In other words, the computation complexity should be O(d n ∂G e t ). Hence, the complexity to construct C 1 , . . . , C N by extending from neighbor nodes is O(d 1 If we want to find nodes whose uncertainty equals n, we need to go through the graph once. In this case, Under review as a conference paper at ICLR 2021 the computation complexity should be O(|S ∂G e t |). Hence, the complexity to construct C 1 , . . . , C N extending from uncertain nodes is O(|S ∂G e t |).

parameterizes eep neural networks Q ✓ (s, a) and use Q-learning (Watkins & Dayan, action at is best to take in each state s t at time step t. The parameters imized by minimizing the L 2 difference between the networks output arget y t = r t + max a Q ✓ (s t+1 , a t ), where ✓ are parameters of a version of the value network and updated periodically. DQN uses an hich samples (s t , a t , r t , s t+1 ) tuple from a replay buffer for training.

