TOMA: TOPOLOGICAL MAP ABSTRACTION FOR REINFORCEMENT LEARNING

Abstract

Animals are able to discover the topological map (graph) of surrounding environment, which will be used for navigation. Inspired by this biological phenomenon, researchers have recently proposed to learn a graph representation for Markov decision process (MDP) and use such graphs for planning in reinforcement learning (RL). However, existing learning-based graph generation methods suffer from many drawbacks. One drawback is that existing methods do not learn an abstraction for graphs, which results in high memory and computation cost. This drawback also makes generated graph non-robust, which degrades the planning performance. Another drawback is that existing methods cannot be used for facilitating exploration which is important in RL. In this paper, we propose a new method, called topological map abstraction (TOMA), for learning-based graph generation. TOMA can learn an abstract graph representation for MDP, which costs much less memory and computation cost than existing methods. Furthermore, TOMA can be used for facilitating exploration. In particular, we propose planning to explore, in which TOMA is used to accelerate exploration by guiding the agent towards unexplored states. A novel experience replay module called vertex memory is also proposed to improve exploration performance. Experimental results show that TOMA can outperform existing methods to achieve the state-of-the-art performance.

1. INTRODUCTION

Animals are able to discover topological map (graph) of surrounding environment (O' Keefe and Dostrovsky, 1971; Moser et al., 2008) , which will be used as hints for navigation. For example, previous maze experiments on rats (O' Keefe and Dostrovsky, 1971) reveal that rats can create mental representation of the maze and use such representation to reach the food placed in the maze. In cognitive science society, researchers summarize these discoveries in cognitive map theory (Tolman, 1948) , which states that animals can extract and code the structure of environment in a compact and abstract map representation. Inspired by such biological phenomenon, researchers have proposed to generate topological graph representation for Markov decision process (MDP) and use such graphs for planning in reinforcement learning (RL). Early graph generation methods (Mannor et al., 2004) are usually prior-based, which apply some human prior to aggregate similar states to generate vertices. Recently, researchers propose some learning-based graph generation algorithms which learn such state aggregation automatically. Such methods have been proved to be better than prior-based methods (Metzen, 2013) . These methods generally treat the states in a replay buffer as vertices. For the edges of the graphs, some methods like SPTM (Savinov et al., 2018) train a reachability predictor via self-supervised learning and combine it with human experience to construct the edges. Other methods like SoRB (Eysenbach et al., 2019) exploit a goal-conditioned agent to estimate the distance between vertices, based on which edges are constructed. These existing methods suffer from the following drawbacks. Firstly, these methods do not learn an abstraction for graphs and usually consider all the states in the buffer as vertices (Savinov et al., 2018) , which results in high memory and computation cost. This drawback also makes generated graph non-robust, which will degrade the planning performance. Secondly, existing methods cannot be used for facilitating exploration, which is important in RL. In particular, methods like SPTM rely on human sampled trajectories to generate the graph, which is infeasible in RL exploration. Methods like SoRB require training another goal-conditioned agent. Such training procedure assumes knowledge of the environment since it requires to generate several goal-reaching tasks to train the agent. This practice is also intractable in RL exploration. In this paper, we propose a new method, called TOpological Map Abstraction (TOMA), for learningbased graph generation. The main contributions of this paper are outlined as follows: • TOMA can learn to generate an abstract graph representation for MDP. Different from existing methods in which each vertex of the graph represents a state, each vertex in TOMA represents a cluster of states. As a result, compared with existing methods TOMA has much less memory and computation cost, and can generate more robust graph for planning. • TOMA can be used to facilitate exploration. In particular, we propose planning to explore, in which TOMA is used to accelerate exploration by guiding the agent towards unexplored states. A novel experience replay module called vertex memory is also proposed to improve exploration performance. • Experimental results show that TOMA can robustly generate abstract graph representation on several 2D world environments with different types of observation and can outperform previous learning-based graph generation methods to achieve the state-of-the-art performance.

2. ALGORITHM 2.1 NOTATIONS

In this paper, we model a RL problem as a Markov decision process (MDP). A MDP is a tuple M (S, A, R, γ, P ), where S is the state space, A is the action space, R : S × A → R is a reward function, γ is a discount factor and P (s t+1 |s t , a t ) is the transition dynamic. ρ(x, y) = x -y 2 denotes Euclidean distance. G(V, E) denotes a graph, where V is its vertex set and E is its edge set. For any set X, we define its indicator function 1 X (x) as follows: 1 X (x) = 1 if x ∈ X, 1 X (x) = 0 if x / ∈ X.

2.2. TOMA

Figure 1 gives an illustration of TOMA, which tries to map states to an abstract graph. A landmark set L is a subset of S and each landmark l i ∈ L is a one-to-one correspondence to a vertex v i in the graph. Each l i and v i will represent a cluster of states. In order to decide which vertex a state s ∈ S corresponds to, we first use a locality sensitive embedding function φ θ to calculate its latent representation z = φ θ (s) in the embedding space Z. Then if z's nearest neighbor in the embedded landmark set φ θ (L) = {φ θ (l)| l ∈ L} is φ θ (l i ), we will map s to vertex v i ∈ V . Figure 1 : Illustration of TOMA. We will pick up some states as landmarks (colored triangles) in the state space of the original MDP M . Each landmark l i is a one-to-one correspondence to a vertex v i (colored circles) in graph G and covers some areas in S. Embedding φ θ is trained by self-supervised learning. We will label each state on a trajectory with a corresponding vertex and use it to generate the graph dynamically.

