A SIMPLE APPROACH FOR STATE-ACTION ABSTRAC-TION USING A LEARNED MDP HOMOMORPHISM

Abstract

Animals are able to rapidly infer, from limited experience, when sets of state-action pairs have equivalent reward and transition dynamics. On the other hand, modern reinforcement learning systems must painstakingly learn through trial and error that sets of state-action pairs are value equivalent-requiring an often prohibitively large amount of samples from their environment. MDP homomorphisms have been proposed that reduce the observed MDP of an environment to an abstract MDP, which can enable more sample efficient policy learning. Consequently, impressive improvements in sample efficiency have been achieved when a suitable MDP homomorphism can be constructed a priori-usually by exploiting a practitioner's knowledge of environment symmetries. We propose a novel approach to constructing a homomorphism in discrete action spaces, which uses a partial model of environment dynamics to infer which state-action pairs lead to the same state-reducing the size of the state-action space by a factor equal to the cardinality of the action space. We call this method equivalent effect abstraction. On MDP homomorphism benchmarks, we demonstrate improved sample efficiency over previous attempts to learn MDP homomorphisms, while achieving comparable sample efficiency to approaches that rely on prior knowledge of environment symmetries.

1. INTRODUCTION

Modern reinforcement learning (RL) agents outperform humans in previously impregnable benchmarks such as Go (Silver et al., 2016) and Starcraft (Vinyals et al., 2019) . However, the computational expense of RL hinders its deployment in promising real world applications. In environments with large state spaces, RL agents demand hundreds of millions of samples (or even hundreds of billions) to learn a policy-either within an environment model (Hafner et al., 2020) or by direct interaction (Mnih et al., 2013) . Function approximation can enable some generalisation within a large state space but still most RL agents struggle to extrapolate value judgements to equivalent states. In contrast, animals can easily abstract away details about states that do not effect their values. For example, a foraging mouse understands that approaching a goal state (shown as a piece of cheese in Figure 1 ) while travelling east will have the same value as approaching the same goal state from the west. These sort of state abstractions have been defined in RL as Markov decision process (MDP) homomorphisms Ravindran & Barto (2001) ; van der Pol et al. (2020b) . MDP homomorphisms reduce large state-action spaces to smaller abstract state-action spaces by collapsing equivalent state-action pairs :::::::::: state-actions : in an observed MDP onto a single abstract state-action pair in an abstract MDP van der Pol et al. (2020b) . Given a mapping between an abstract MDP and an experienced MDP, policies can be learned efficiently in the smaller abstract space and then mapped back to the experienced MDP when interacting with the environment (Ravindran & Barto, 2001) . However, obtaining mappings to and from the abstract state-action space is challenging. Success has been achieved by hard coding homomorphisms into policy networks ::::::: Previous :::::: works :::: hard ::::: code :::::::::::::: homomorphisms ::: into :: a :::::: policy (van der Pol et al., 2020b) but learning homomorphic mappings online is an unsolved problem. We develop equivalent effect abstraction, a method that constructs MDP homomorphisms from experience via a dynamics model-leveraging the fact that state-action pairs leading to the same next state usually have equivalent value. Consider the example of a gridworld maze, moving to a given cell has the same value whether you approached from the right or the left. Therefore, if we know the value for approaching from the right then we also know the value of moving the approaching from the left. By extrapolating value judgements between equivalent state-action pairs we can use this equivalence to reduce the amount of experience required to learn a policy. An important distinction from previous works is that we do not use environment symmetries to reduce the size of the state-action space. We exploit a separate redundancy common to many MDPs-for a given state there are often multiple state-action pairs that also lead to that state. While we do use a partial model in our approach, equivalent effect abstraction is different to model based RL because we focus on reducing the of the state-action space rather than augmenting experience with predicted trajectories. Additionally, unlike model-based RL, equivalent abstraction can be plugged into model-free algorithms without a reward model. Our contributions are as follows: 1. We develop a novel approach for constructing MDP homomorphisms (equivalent effect abstraction) that requires no prior knowledge from a practitioner 2. In the tabular setting, we show equivalent effect abstraction can improve the planning efficiency of model-based algorithms and the sample efficiency of model-free algorithms 3. In the deep RL setting, we show equivalent effect abstraction can be learned from experience and then leveraged to improve sample efficiency In Section 2 we formally describe the MDP homomorphism framework :::: MDP :::::::::::::: homomorphisms : and then introduce equivalent effect abstraction in Section 3. In Section 4 we empirically validate our approach using benchmarks from the MDP homomorphism literature. An overview of related work is found in Section 5 and we finish with limitations in Section 6 as well conclusions in Section :::::: Related ::::: work, ::::::::: limitations ::: and :::::::::: conclusions ::: are :: in ::::::: Sections :: 5, :: 6 ::: and : 7.

2. MDP HOMOMORPHISMS

Using the definition from (Silver, 2015) , an MDP M can be described by a tuple ⟨S, A, P, R, γ⟩ where S is the set of all states, A is the set of all actions, R = E[R t+1 |S t = s, A t = a] is the reward function that determines the scalar reward received at each state, P = P[S t+1 = s ′ |S t = s, A t = a] is the transition function of the environment describing the probability of moving from state to another for a given action and γ ∈ [0, 1] is the discount factor describing how much an agent should favour immediate rewards over those in future states. An agent interacts with an environment through its policy π(a|s) = P[A t = a|S t = s] (Silver, 2015) which maps the current state to a given action. To solve an MDP, an RL agent must develop a policy that maximises the return G, which is equal to the sum of discounted future rewards G = T t=0 γ t R t+1 (where t is the current timestep and T is the number of timesteps in a learning episode). It is worth briefly mentioning that equivalent effect abstraction assumes an MDP definition where the reward function is defined by a given state rather than how an agent travels to that state (i.e. reward functions are defined as R(s ′ ) rather than R(s, a, s ′ )-in the vast majority of RL benchmarks this is a safe assumption. (Ravindran & Barto, 2001) introduced the concept of a homomorphism which, using the notation and definitions from (van der Pol et al., 2020b) , is a homomorphism h = ⟨σ, {α s |s ∈ S⟩} from an agent's



Figure 1: State-action pairs that lead to the same state often have equivalent values (shown by the purple arrows). Instead of learning these equivalent values individually through reinforcement, we instead learn the value of one abstract action that represents both purple arrows. These values are looked up and learned with a backwards dynamics model during training-meaning no prior knowledge is required from a practitioner.

