DISCRETE STATE-ACTION ABSTRACTION VIA THE SUCCESSOR REPRESENTATION

Abstract

While the difficulty of reinforcement learning problems is typically related to the complexity of their state spaces, Abstraction proposes that solutions often lie in simpler underlying latent spaces. Prior works have focused on learning either a continuous or dense abstraction, or require a human to provide one. Informationdense representations capture features irrelevant for solving tasks, and continuous spaces can struggle to represent discrete objects. In this work we automatically learn a sparse discrete abstraction of the underlying environment. We do so using a simple end-to-end trainable model based on the successor representation and max-entropy regularization. We describe an algorithm to apply our model, named Discrete State-Action Abstraction (DSAA), which computes an action abstraction in the form of temporally extended actions, i.e., Options, to transition between discrete abstract states. Empirically, we demonstrate the effects of different exploration schemes on our resulting abstraction, and show that it is efficient for solving downstream tasks.

1. INTRODUCTION

Reinforcement learning (RL) provides a general framework for solving search problems through the formalism of a Markov Decision Process (MDP); yet with that generality, it sacrifices some basic properties one might expect of a search algorithm. In particular, one should not explore a state multiple times, which is why basic search, such as Dijkstra's algorithm, keeps track of the explored frontier. Basic search methods perform well when state and action spaces are small and discrete, but don't necessarily translate to more complex environments, such as large or continuous ones. Humans have several intuitive ways to efficiently explore in complex scenarios. One is by abstracting states, thereby exploring a simpler, more structured model of the environment. For example, consider searching for an exit in a large room using only touch: we would never blindly roam the center, but rather follow the walls. By abstracting states based on the property "can contain exit", we greatly reduce the set of states we have to explore in the first place. Another closely related way we explore efficiently is by abstracting actions. Ingrained skills, or temporally extended actions, impose a prior on the types of action sequences that help solve problems. For example, when exploring we don't move forward and then immediately backwards. Randomly choosing the next direction to move is rarely a good idea, and so skills ensure that we explore new environments efficiently. In this work we are concerned with incorporating the intuitive concepts of state abstraction and action abstraction into reinforcement learning. Our method for state abstraction is based on the Successor Representation, which intuitively characterizes states based on "what happens after visiting this state". By learning a discrete state abstraction, we take advantage of a simple and natural definition for action abstraction: abstract actions are policies which help the agent navigate between pairs of abstract states (Abel et al., 2020) . We motivate our interest in discrete abstractions in two ways. Firstly, many of the decisions an agent must make in the world are discrete and depend on discrete objects or properties. For example, the length of the optimal path out of a room does not change continuously as a function of the number of doors. While we can still model such decision problems using continuous representations, it is known that discrete metrics cannot be perfectly embedded in continuous spaces (Bourgain, 1985) , and it has been shown empirically that policies trained in such continuous spaces struggle precisely at points of discontinuity (Tang & Hauser, 2019) . The second motivation is that classical algorithms for planning in discrete spaces are better understood and provide stronger guarantees; in fact, we often deal with continuous spaces by discretizing them with the help of local planners (Kavraki et al., 1996) . Our action abstraction is like a local planner, except learned rather than specified ahead of time. Moreover, our discrete abstraction represents an explicit reduction in the size of the state space, in which both the depth and branching factor of future search can be easily controlled. Thus, it is simpler and more efficient to reuse it to navigate the environment.

1.1. CONTRIBUTIONS

Our main contribution is a novel method to learn a discrete abstraction by partitioning an arbitrary state space from a dataset of transitions which explore that same space. In particular, we cluster states with a similar Successor Representation (SR) as being part of the same abstract state. Intuitively, if the dataset of transitions was generated by some policy, states from which that policy visits similar states are in turn marked as similar. Unlike prior works on the SR (e.g., Machado et al. (2018b) ; Ramesh et al. ( 2019)), our approach is end-to-end trainable and uses a comparatively weaker max-entropy regularization. Our neural network model resembles a discrete variational autoencoder, in which an encoder computes the abstraction and a decoder computes the SR. To demonstrate the effectiveness of our method, we propose an algorithm, Discrete State-Action Abstraction (DSAA), which creates a discrete state-action abstraction pair, using options for modeling the action abstraction. Since the SR depends on the policy used to generate data, we report the effect of changing the exploration method on the resulting abstraction, in contrast to prior work which has focused on uniform random exploration. We additionally compare DSAA to related works on both discrete and continuous tasks, demonstrating the value of learning a simple reusable representation.

2.1. REINFORCEMENT LEARNING

We consider the model-free reinforcement learning (RL) problem with an underlying infinitehorizon Markov Decision Process (MDP): M = (X , A, p, r, γ, x 0 ), with state space X , action space A, unknown environment dynamics p(x ′ | x, a) giving the probability of transitioning to state x ′ having taken action a at state x, reward function r : X × A → [0, 1], discount factor 0 < γ < 1, and initial state x 0 . Let x t ∈ X , a t ∈ A be the agent's state and action respectively at time t, and π(a t | x t ) be the agent's policy, determining the probability distribution of agent actions at state x t . Given a policy π, the expected return (i.e., discounted sum of rewards) the agent would obtain if action a is taken at initial state x, is described by the Q-value function: Q π (x, a) = E p,π ∞ t=0 γ t r(x t , a t ) a 0 = a, x 0 = x , where a t ∼ π(•|x t ) and x t+1 ∼ p(•|x t , a t ). The agent's goal is to compute a policy π which maximizes the expected return from the starting state x 0 , i.e., π ∈ arg max π E a∼π(•|x) [Q π (x 0 , a)]. Since the environment dynamics p is unknown, a common approach in RL is to iteratively improve an estimate of the Q-value function, while simultaneously exploring the environment using the induced policy. However, we highlight that when the reward function is sparse, such methods suffer from a long and uninformed (unrewarding) random exploration phase. Sutton et al. (1999) presented the options framework to extend the classic formalism of an MDP to a semi-MDP, in which we can replace primitive single-step actions with temporally extended policies in the form of options. An option in an MDP is a 3-tuple o = (I o , π o , T o ), where I o , T o ⊆ X are the initiation and termination sets respectively, and π o is a policy that initiates in some state x 0 ∈ I o and terminates in any state in the set T o .

2.2. OPTIONS FRAMEWORK

Intuitively, an option describes a local subproblem of navigating or funneling the agent between regions of the state space. A common approach is to create options so that the termination set of one lies in the initiation set of another, thus allowing for option chaining (Bagaria & Konidaris, 2019) . In

