MODEL-BASED MICRO-DATA REINFORCEMENT LEARN-ING: WHAT ARE THE CRUCIAL MODEL PROPERTIES AND WHICH MODEL TO CHOOSE?

Abstract

We contribute to micro-data model-based reinforcement learning (MBRL) by rigorously comparing popular generative models using a fixed (random shooting) control agent. We find that on an environment that requires multimodal posterior predictives, mixture density nets outperform all other models by a large margin. When multimodality is not required, our surprising finding is that we do not need probabilistic posterior predictives: deterministic models are on par, in fact they consistently (although non-significantly) outperform their probabilistic counterparts. We also found that heteroscedasticity at training time, perhaps acting as a regularizer, improves predictions at longer horizons. At the methodological side, we design metrics and an experimental protocol which can be used to evaluate the various models, predicting their asymptotic performance when using them on the control problem. Using this framework, we improve the state-of-the-art sample complexity of MBRL on Acrobot by two to four folds, using an aggressive training schedule which is outside of the hyperparameter interval usually considered.

1. INTRODUCTION

Unlike computers, physical systems do not get faster with time (Chatzilygeroudis et al., 2020) . This is arguably one of the main reasons why recent beautiful advances in deep reinforcement learning (RL) (Silver et al., 2018; Vinyals et al., 2019; Badia et al., 2020) stay mostly in the realm of simulated worlds and do not immediately translate to practical success in the real world. Our long term research agenda is to bring RL to controlling real engineering systems. Our effort is hindered by slow data generation and rigorously controlled access to the systems. Micro-data RL is the term for using RL on systems where the main bottleneck or source of cost is access to data (as opposed to, for example, computational power). The term was introduced in robotics research (Mouret, 2016; Chatzilygeroudis et al., 2020) . This regime requires performance metrics that put as much emphasis on sample complexity (learning speed with respect to sample size) as on asymptotic performance, and algorithms that are designed to make efficient use of small data. Engineering systems are both tightly controlled for safety and security reasons, and physical by nature (so do not get faster with time), making them a primary target of micro-data RL. At the same time, engineering systems are the backbone of today's industrial world: controlling them better may lead to multi-billion dollar savings per year, even if we only consider energy efficiency. 1Model-based RL (MBRL) builds predictive models of the system based on historical data (logs, trajectories) referred to here as traces. Besides improving the sample complexity of model-free RL by orders of magnitude (Chua et al., 2018) , these models can also contribute to adoption from the human side: system engineers can "play" with the models (data-driven generic "neural" simulators) and build trust gradually instead of having to adopt a black-box control algorithm at once (Argenson & Dulac-Arnold, 2020) . Engineering systems suit MBRL particularly well in the sense that most system variables that are measured and logged are relevant, either to be fed to classical control or to a human operator. This means that, as opposed to games in which only a few variables (pixels) are relevant for winning, learning a forecasting model in engineering systems for the full set of logged variables is arguably an efficient use of predictive power. It also combines well with the micro-data learning principle of using every bit of the data to learn about the system. Robust and computationally efficient probabilistic generative models are the crux of many machine learning applications. They are especially one of the important bottlenecks in MBRL (Deisenroth & Rasmussen, 2011; Ke et al., 2019; Chatzilygeroudis et al., 2020) . System modelling for MBRL is essentially a supervised learning problem with AutoML (Zhang et al., 2021) : models need to be retrained and, if needed, even retuned hundreds of times, on different distributions and data sets whose size may vary by orders of magnitude, with little human supervision. That said, there is little prior work on rigorous comparison of system modelling algorithms. Models are often part of a larger system, experiments are slow, and it is hard to know if the limitation or success comes from the model or from the control learning algorithm. System modelling is hard because i) data sets are non-i.i.d., and ii) classical metrics on static data sets may not be predictive of the performance on the dynamic system. There is no canonical data-generating distribution as assumed in the first page of machine learning textbooks, which makes it hard to adopt the classical train/test paradigm. At the same time, predictive system modelling is a great playground and it can be considered as an instantiation of self-supervised learning which some consider the "greatest challenge in ML and AI of the next few years". 2We propose to compare popular probabilistic models on the Acrobot system to study the model properties required to achieve state-of-the-art performances. We believe that such ablation studies are missing from existing "horizontal" benchmarks where the main focus is on state-of-the-art combinations of models and planning strategies (Wang et al., 2019) . We start from a family of flexible probabilistic models, autoregressive mixtures learned by deep neural nets (DARMDN) (Bishop, 1994; Uria et al., 2013) and assess the performance of its models when removing autoregressivity, multimodality, and heteroscedasticity. We favor this family of models as it is easy i) to compare them on static data since they come with exact likelihood, ii) to simulate from them, and iii) to incorporate prior knowledge on feature types. Their greatest advantage is modelling flexibility: they can be trained with a loss allowing heteroscedasticity and, unlike Gaussian processes (Deisenroth & Rasmussen, 2011; Deisenroth et al., 2014) , deterministic neural nets (Nagabandi et al., 2018; Lee et al., 2019) , multivariate Gaussian mixtures (Chua et al., 2018) , variational autoencoders (VAE) (Kingma & Welling, 2014; Rezende et al., 2014) , and normalizing flows (Rezende & Mohamed, 2015) , deep (autoregressive) mixture density nets can naturally and effortlessly represent a multimodal posterior predictive and what we will call y-interdependence (dependence among system observables even after conditioning on the history). We chose Acrobot with continuous rewards (Sutton, 1996; Wang et al., 2019) which we could call the "MNIST of MBRL" for three reasons. First, it is simple enough to answer experimental questions rigorously yet it exhibits some properties of more complex environments so we believe that our findings will contribute to solve higher dimensional systems with better sample efficiency as well as better understand the existing state-of-the-art solutions. Second, Acrobot is one of the systems where i) random shooting applied on the real dynamics is state of the art in an experimental sense and ii) random shooting combined with good models is the best approach among MBRL (and even model-free) techniques (Wang et al., 2019) . This means that by matching the optimal performance, we essentially "solve" Acrobot with a sample complexity which will be hard to beat. Third, using a single system allows both a deeper and simpler investigation of what might explain the success of popular methods. Although studying scientific hypotheses on a single system is not without precedence (Abbas et al., 2020) , we leave open the possibility that our findings are valid only on Acrobot (in which case we definitely need to understand what makes Acrobot special). There are three complementary explanations why model limitations lead to suboptimal performance in MBRL (compared to model-free RL). First, MBRL learns fast, but it converges to suboptimal models because of the lack of exploration down the line (Schaul et al., 2019; Abbas et al., 2020) . We argue that there might be a second reason: the lack of the approximation capacity of these models. The two reasons may be intertwined: not only do we require from the model family to contain the real system dynamics, but we also want it to be able to represent posterior predictive distributions, which i) are consistent with the limited data used to train the model, ii) are consistent with (learnable) physical constraints of the system, and iii) allow efficient exploration. This is not the "classical" notion of approximation, it may not be alleviated by simply adding more capacity to the function representation; it needs to be tackled by properly defining the output of the model. Third, models are trained to predict the system one step ahead, while the planners need unbiased multi-step predictions



1% of the yearly energy cost of the US manufacturing sector is roughly a billion dollar[link, link]. https://www.facebook.com/722677142/posts/10155934004262143/

