MDP PLAYGROUND: CONTROLLING ORTHOGONAL DIMENSIONS OF HARDNESS IN TOY ENVIRONMENTS Anonymous

Abstract

We present MDP Playground, an efficient benchmark for Reinforcement Learning (RL) algorithms with various dimensions of hardness that can be controlled independently to challenge algorithms in different ways and to obtain varying degrees of hardness in generated environments. We consider and allow control over a wide variety of key hardness dimensions, including delayed rewards, rewardable sequences, sparsity of rewards, stochasticity, image representations, irrelevant features, time unit, and action range. While it is very time consuming to run RL algorithms on standard benchmarks, we define a parameterised collection of fast-to-run toy benchmarks in OpenAI Gym by varying these dimensions. Despite their toy nature and low compute requirements, we show that these benchmarks present substantial challenges to current RL algorithms. Furthermore, since we can generate environments with a desired value for each of the dimensions, in addition to having fine-grained control over the environments' hardness, we also have the ground truth available for evaluating algorithms. Finally, we evaluate the kinds of transfer for these dimensions that may be expected from our benchmarks to more complex benchmarks. We believe that MDP Playground is a valuable testbed for researchers designing new, adaptive and intelligent RL algorithms and those wanting to unit test their algorithms.

1. INTRODUCTION

RL has succeeded at many disparate tasks, such as helicopter aerobatics, game-playing and continuous control (Abbeel et al., 2010; Mnih et al., 2015; Silver et al., 2016; Chua et al., 2018; Fujimoto et al., 2018; Haarnoja et al., 2018) , and yet a lot of the low-level workings of RL algorithms are not well understood. This is exacerbated by the absence of a unifying benchmark for all of RL. There are many different types of benchmarks, as many as there are kinds of tasks in RL (e.g. Todorov et al., 2012; Bellemare et al., 2013; Cobbe et al., 2019; Osband et al., 2019) . They specialise in specific kinds of tasks. And yet the underlying assumptions are nearly always those of a Markov Decision Process (MDP). We propose a benchmark which distills difficulties for MDPs that can be generalised across RL problems and allows to control these difficulties for more precise experiments. RL algorithms face a variety of challenges in environments. For example, when the underlying environment is an MDP, however the information state, i.e., the state representation used by the agent is not Markov, we have partial observability, (see, e.g., Mnih et al., 2015; Jaakkola et al., 1995) . There are additional aspects of environments such as having irrelevant features, having multiple representations for the same state, and the action range, that significantly affect agent performance. We aim to study what kind of failure modes can occur when an agent is faced with such environments and to allow other researchers the benefit of the same platform to be able to create their own experiments and gain high-level insights. We identify dimensions of MDPs which help characterise environments, and we provide a platform with different instantiations of these dimensions in order to understand the workings of different RL algorithms better. To this end, we implemented a Python package, MDP Playground, that gives us complete control over these dimensions to generate flexible environments. Furthermore, commonly used environments, that RL algorithms are tested on, tend to take a long time to run. For example, a DQN run on Atari took us 4 CPU days and 64GB of memory to run. Our platform can be used as a low-cost testbed early in the RL agent development pipeline to develop and unit test algorithms and gain quick and coarse insights into algorithms. The main contributions of this paper are: • We identify dimensions of MDPs that have a significant effect on agent performance, both for discrete and continuous environments; • We open-source a platform with fine-grained control over the dimensions; baseline RL algorithms can be run on these in as little as 30 seconds on a single core of a laptop; • We study the impact of the dimensions on baseline agents in our environments; • We evaluate transfer of some of these dimensions to more complex environments.

2. DIMENSIONS OF HARDNESS

We begin by defining basic deterministic version of MDP, followed by a POMDP and then motivate the dimensions of hardness. We define an MDP as a 7-tuple (S, A, P, R, ρ o , γ, T ), where S is the set of states, A is the set of actions, P : S×A×S → S describes the transition dynamics, R : S×A → R describes the reward dynamics, ρ o : S → R + is the initial state distribution, γ is the discount factor and T is the set of terminal states. We define a POMDP as an 9-tuple (S, O, A, P, Ω, R, ρ o , γ, T ), where O represents the set of observations, Ω : S × A × O → R describes the probability of an observation given a state and action and the rest of the terms have the same meaning as for the MDP above. To identify the dimensions of hardness, we went over the components of MDPs and POMDPs and tried to exhaustively list dimensions that could make an environment harder. This has resulted in many dimensions and a highly parameterisable platform. To aid in understanding we categorise the dimensions according to the component of (PO)MDPs they affect. To clarify terminology, we will use information state to mean the state representation used by the agent and belief state to be equivalent to the full observation history. If the belief state were to be used as the information state by an agent, it would be sufficient to compute an optimal policy. However, since the full observation history isn't tractable to store for many environments, agents in practice stack the last few observations to use as their information state which renders it non-Markov. An implicit assumption for many agents is that we receive immediate reward depending on only the current information state and action. However, this is not true even for many simple environments. In many situations, agents receive delayed rewards (see e.g. Arjona-Medina et al., 2019) . For example, shooting at an enemy ship in Space Invaders leads to rewards much later than the action of shooting. Any action taken after that is inconsequential to obtaining the reward for destroying that enemy ship. In many environments, we obtain a reward for a sequence of actions taken and not just the information state and action. A simple example is executing a tennis serve, where we need a sequence of actions which would result in a point if we served an ace. In contrast to delayed rewards, rewarding a sequence of actions addresses the actions taken which are consequential to obtaining a reward. Sutton et al. (1999) present a framework for temporal abstraction in RL to deal with such sequences. Environments can also be characterised by their reward sparsity (Gaina et al., 2019) , i.e., the supervisory reward signal is 0 throughout the trajectory and then a single non-zero reward is received at its end. This also holds true for our example of the tennis serve above. Another characteristic of environments that can significantly impact performance of algorithms is stochasticity. The environment, i.e., dynamics P and R, may be stochastic or may seem stochastic to the agent due to partial observability or sensor noise. A robot equipped with a rangefinder, for example, has to deal with various sources of noise in its sensors (Thrun et al., 2005) . Environments also tend to have a lot of irrelevant features (Rajendran et al., 2018) that one need not focus on. For example, in certain racing car games, though one can see the whole screen, we only need to concentrate on the road and would be more memory efficient if we did. This holds for table-based learners and approximators like Neural Networks (NNs). NNs additionally can even fit random noise (Zhang et al., 2017) and having irrelevant features is likely to degrade performance.

