MODEL-BASED MICRO-DATA REINFORCEMENT LEARN-ING: WHAT ARE THE CRUCIAL MODEL PROPERTIES AND WHICH MODEL TO CHOOSE?

Abstract

We contribute to micro-data model-based reinforcement learning (MBRL) by rigorously comparing popular generative models using a fixed (random shooting) control agent. We find that on an environment that requires multimodal posterior predictives, mixture density nets outperform all other models by a large margin. When multimodality is not required, our surprising finding is that we do not need probabilistic posterior predictives: deterministic models are on par, in fact they consistently (although non-significantly) outperform their probabilistic counterparts. We also found that heteroscedasticity at training time, perhaps acting as a regularizer, improves predictions at longer horizons. At the methodological side, we design metrics and an experimental protocol which can be used to evaluate the various models, predicting their asymptotic performance when using them on the control problem. Using this framework, we improve the state-of-the-art sample complexity of MBRL on Acrobot by two to four folds, using an aggressive training schedule which is outside of the hyperparameter interval usually considered.

1. INTRODUCTION

Unlike computers, physical systems do not get faster with time (Chatzilygeroudis et al., 2020) . This is arguably one of the main reasons why recent beautiful advances in deep reinforcement learning (RL) (Silver et al., 2018; Vinyals et al., 2019; Badia et al., 2020) stay mostly in the realm of simulated worlds and do not immediately translate to practical success in the real world. Our long term research agenda is to bring RL to controlling real engineering systems. Our effort is hindered by slow data generation and rigorously controlled access to the systems. Micro-data RL is the term for using RL on systems where the main bottleneck or source of cost is access to data (as opposed to, for example, computational power). The term was introduced in robotics research (Mouret, 2016; Chatzilygeroudis et al., 2020) . This regime requires performance metrics that put as much emphasis on sample complexity (learning speed with respect to sample size) as on asymptotic performance, and algorithms that are designed to make efficient use of small data. Engineering systems are both tightly controlled for safety and security reasons, and physical by nature (so do not get faster with time), making them a primary target of micro-data RL. At the same time, engineering systems are the backbone of today's industrial world: controlling them better may lead to multi-billion dollar savings per year, even if we only consider energy efficiency. 1Model-based RL (MBRL) builds predictive models of the system based on historical data (logs, trajectories) referred to here as traces. Besides improving the sample complexity of model-free RL by orders of magnitude (Chua et al., 2018) , these models can also contribute to adoption from the human side: system engineers can "play" with the models (data-driven generic "neural" simulators) and build trust gradually instead of having to adopt a black-box control algorithm at once (Argenson & Dulac-Arnold, 2020) . Engineering systems suit MBRL particularly well in the sense that most system variables that are measured and logged are relevant, either to be fed to classical control or to a human operator. This means that, as opposed to games in which only a few variables (pixels) are relevant for winning, learning a forecasting model in engineering systems for the full set of logged variables is arguably an efficient use of predictive power. It also combines well with the micro-data learning principle of using every bit of the data to learn about the system.



1% of the yearly energy cost of the US manufacturing sector is roughly a billion dollar[link, link].1

