MDP PLAYGROUND: CONTROLLING ORTHOGONAL DIMENSIONS OF HARDNESS IN TOY ENVIRONMENTS Anonymous

Abstract

We present MDP Playground, an efficient benchmark for Reinforcement Learning (RL) algorithms with various dimensions of hardness that can be controlled independently to challenge algorithms in different ways and to obtain varying degrees of hardness in generated environments. We consider and allow control over a wide variety of key hardness dimensions, including delayed rewards, rewardable sequences, sparsity of rewards, stochasticity, image representations, irrelevant features, time unit, and action range. While it is very time consuming to run RL algorithms on standard benchmarks, we define a parameterised collection of fast-to-run toy benchmarks in OpenAI Gym by varying these dimensions. Despite their toy nature and low compute requirements, we show that these benchmarks present substantial challenges to current RL algorithms. Furthermore, since we can generate environments with a desired value for each of the dimensions, in addition to having fine-grained control over the environments' hardness, we also have the ground truth available for evaluating algorithms. Finally, we evaluate the kinds of transfer for these dimensions that may be expected from our benchmarks to more complex benchmarks. We believe that MDP Playground is a valuable testbed for researchers designing new, adaptive and intelligent RL algorithms and those wanting to unit test their algorithms.

1. INTRODUCTION

RL has succeeded at many disparate tasks, such as helicopter aerobatics, game-playing and continuous control (Abbeel et al., 2010; Mnih et al., 2015; Silver et al., 2016; Chua et al., 2018; Fujimoto et al., 2018; Haarnoja et al., 2018) , and yet a lot of the low-level workings of RL algorithms are not well understood. This is exacerbated by the absence of a unifying benchmark for all of RL. There are many different types of benchmarks, as many as there are kinds of tasks in RL (e.g. Todorov et al., 2012; Bellemare et al., 2013; Cobbe et al., 2019; Osband et al., 2019) . They specialise in specific kinds of tasks. And yet the underlying assumptions are nearly always those of a Markov Decision Process (MDP). We propose a benchmark which distills difficulties for MDPs that can be generalised across RL problems and allows to control these difficulties for more precise experiments. RL algorithms face a variety of challenges in environments. For example, when the underlying environment is an MDP, however the information state, i.e., the state representation used by the agent is not Markov, we have partial observability, (see, e.g., Mnih et al., 2015; Jaakkola et al., 1995) . There are additional aspects of environments such as having irrelevant features, having multiple representations for the same state, and the action range, that significantly affect agent performance. We aim to study what kind of failure modes can occur when an agent is faced with such environments and to allow other researchers the benefit of the same platform to be able to create their own experiments and gain high-level insights. We identify dimensions of MDPs which help characterise environments, and we provide a platform with different instantiations of these dimensions in order to understand the workings of different RL algorithms better. To this end, we implemented a Python package, MDP Playground, that gives us complete control over these dimensions to generate flexible environments. Furthermore, commonly used environments, that RL algorithms are tested on, tend to take a long time to run. For example, a DQN run on Atari took us 4 CPU days and 64GB of memory to run. Our

