A SIMPLE APPROACH FOR STATE-ACTION ABSTRAC-TION USING A LEARNED MDP HOMOMORPHISM

Abstract

Animals are able to rapidly infer, from limited experience, when sets of state-action pairs have equivalent reward and transition dynamics. On the other hand, modern reinforcement learning systems must painstakingly learn through trial and error that sets of state-action pairs are value equivalent-requiring an often prohibitively large amount of samples from their environment. MDP homomorphisms have been proposed that reduce the observed MDP of an environment to an abstract MDP, which can enable more sample efficient policy learning. Consequently, impressive improvements in sample efficiency have been achieved when a suitable MDP homomorphism can be constructed a priori-usually by exploiting a practitioner's knowledge of environment symmetries. We propose a novel approach to constructing a homomorphism in discrete action spaces, which uses a partial model of environment dynamics to infer which state-action pairs lead to the same state-reducing the size of the state-action space by a factor equal to the cardinality of the action space. We call this method equivalent effect abstraction. On MDP homomorphism benchmarks, we demonstrate improved sample efficiency over previous attempts to learn MDP homomorphisms, while achieving comparable sample efficiency to approaches that rely on prior knowledge of environment symmetries.

1. INTRODUCTION

Modern reinforcement learning (RL) agents outperform humans in previously impregnable benchmarks such as Go (Silver et al., 2016) and Starcraft (Vinyals et al., 2019) . However, the computational expense of RL hinders its deployment in promising real world applications. In environments with large state spaces, RL agents demand hundreds of millions of samples (or even hundreds of billions) to learn a policy-either within an environment model (Hafner et al., 2020) or by direct interaction (Mnih et al., 2013) . Function approximation can enable some generalisation within a large state space but still most RL agents struggle to extrapolate value judgements to equivalent states. In contrast, animals can easily abstract away details about states that do not effect their values. For example, a foraging mouse understands that approaching a goal state (shown as a piece of cheese in Figure 1 ) while travelling east will have the same value as approaching the same goal state from the west. These sort of state abstractions have been defined in RL as Markov decision process (MDP) homomorphisms Ravindran & Barto (2001) ; van der Pol et al. (2020b) . MDP homomorphisms reduce large state-action spaces to smaller abstract state-action spaces by collapsing equivalent state-action pairs :::::::::: state-actions : in an observed MDP onto a single abstract state-action pair in an abstract MDP van der Pol et al. (2020b) . Given a mapping between an abstract MDP and an experienced MDP, policies can be learned efficiently in the smaller abstract space and then mapped back to the experienced MDP when interacting with the environment (Ravindran & Barto, 2001) . However, obtaining mappings to and from the abstract state-action space is challenging. Success has been achieved by hard coding homomorphisms into policy networks ::::::: Previous :::::: works :::: hard ::::: code :::::::::::::: homomorphisms ::: into :: a :::::: policy (van der Pol et al., 2020b) but learning homomorphic mappings online is an unsolved problem. We develop equivalent effect abstraction, a method that constructs MDP homomorphisms from experience via a dynamics model-leveraging the fact that state-action pairs leading to the same next state usually have equivalent value. Consider the example of a gridworld maze, moving to a given cell has the same value whether you approached from the right or the left. Therefore, if we know the 1

