DEEP AUTOREGRESSIVE DENSITY NETS VS NEURAL ENSEMBLES FOR MODEL-BASED OFFLINE REINFORCE-MENT LEARNING

Abstract

We consider the problem of offline reinforcement learning where only a set of system transitions is made available for policy optimization. Following recent advances in the field, we consider a model-based reinforcement learning algorithm that infers the system dynamics from the available data and performs policy optimization on imaginary model rollouts. This approach is vulnerable to exploiting model errors which can lead to catastrophic failures on the real system. The standard solution is to rely on ensembles for uncertainty heuristics and to avoid exploiting the model where it is too uncertain. We challenge the popular belief that we must resort to ensembles by showing that better performance can be obtained with a single well-calibrated autoregressive model on the D4RL benchmark. We also analyze static metrics of model-learning and conclude on the important model properties for the final performance of the agent.

1. INTRODUCTION

Reinforcement learning consists in learning a control agent (policy) by interacting with a dynamical system (environment) and collecting its feedback (rewards). This learning paradigm turned out to be able to solve some of the world's most difficult problems (Silver et al., 2017; 2018; Mnih et al., 2015; Vinyals et al., 2019) . However, the scope of the systems that RL is capable of solving remains restricted to the simulated world and does not extend to real engineering systems. Two of the main reasons are i) small data due to operational constraints and ii) safety standards of such systems. In an attempt to bridge the gap between RL and engineering systems, we motivate the setting of offline reinforcement learning (Levine et al., 2020) . Offline reinforcement learning removes the need to query a dynamical system by using a previously collected dataset of controller-system interactions. In this optic, we view this setting as a supervised learning problem where one tries to approximate the underlying distribution of the data at hand, and hopefully be able to generalize to out-of-distribution samples. This turns out to be a difficult task for classical RL algorithms because of the distribution shift that occurs between the dataset and the learned policy during the learning process (Fujimoto et al., 2019; Levine et al., 2020) . Thus we need to design algorithms that are well-suited for offline reinforcement learning. A common idea in this field is conservatism where one would only consider the learned agent when the input states are close to the support of the offline dataset. Depending on the algorithm, conservatism can take multiple forms, ranging from penalized Q-targets (Kumar et al., 2020) to uncertainty-penalized Markov decision processes (Kidambi et al., 2020; Yu et al., 2020) . To develop further into this direction, we make the distinction between model-free and model-based RL (MBRL) algorithms. Model-free algorithms learn a policy and/or a value function by observing the reward signal realizations and the underlying dynamics of the system, which in most environments requires a significant number of interactions for achieving good performance (Haarnoja et al., 2018) . In this category, a way to incorporate conservatism is to penalize the value targets of data points that are distant from the offline dataset (Kumar et al., 2020) . Other methods include behavior regularized policy optimization (Wu et al., 2020) . Model-based algorithms are composed of two independent (and often alternating) steps: i) model learning: a supervised learning problem of learning the dynamics (and sometimes also the reward function) of the system of interest; and ii) policy optimization, where we sample from the learned dynamics to learn a policy and/or a value function. MBRL is known to be sample-efficient, since policy/value learning is done (completely or partially) from imaginary model rollouts (also called background planning) that are cheaper and more accessible than rollouts in the true dynamics (Janner et al., 2019) . Furthermore, a predictive model with good out-of-distribution performance affords easy transfer of the true model to new tasks or areas not covered in the offline dataset (Yu et al., 2020) . Conservatism in MBRL is frequently achieved by uncertainty-based penalization of the model predictions. This relies on well-calibrated estimation of the epistemic uncertainty of the learned dynamics, which is a limitation of this approach. It is of great interest to build models that know when (and how much) they do not know, thus uncertainty estimation remains a central problem in MBRL. Many recent works have made progress in this direction (Osband et al., 2021) . The most common approach to date is bootstrap ensembles: we construct a population of predictive models (most often probabilistic neural networks) and consider disagreement metrics as our uncertainty measurement. The source of randomness in this case is the random initialization of the parameters of neural networks and the subset of the training data that each model sees. When the environment is stochastic, ensembles help to separate the aleatory uncertainty (intrinsic randomness of the environment) and the epistemic uncertainty (Chua et al., 2018) . When the environment is deterministic (which is the case of the D4RL Mujoco benchmark environments considered in most of the offline RL literature (Fu et al., 2021a) ), the error is fully epistemic: it consists of the estimation error (due to lack of training data) and the approximation error (mismatch between the model class and the true distribution) (Hüllermeier & Waegeman, 2021) . This highlights the need of well-calibrated probabilistic models whose posterior variance can be used as an uncertainty measurement in conservative MBRL. In this work, we propose to compare autoregressive dynamics models (Uria et al., 2016) to ensembles of probabilistic feedforward models, both in terms of static evaluation (supervised learning metrics on the task of learning the system dynamics) and dynamic evaluation (final performance of the MBRL agent that uses the model). Autoregressive models learn a conditional distribution of each dimension of the next state conditioned on the input of the model (current state and action) and the previously generated dimensions of the next state. Meanwhile, probabilistic feedforward models learn a multivariate distribution of the next state conditioned on the current state and action. We argue that autoregressive models can learn the implicit functional dependence between state dimensions, which makes them well-calibrated, leading to good uncertainty estimates suitable for conservatism in MBRL. Our key contributions are the following. • We apply autoregressive dynamics models in the context of offline model-based reinforcement learning and show that they improve over neural ensembles in terms of static evaluation metrics and the final performance of the agent. • We introduce an experimental setup that decouples model selection from agent selection to reduce the burden of hyperparameter optimization in offline RL. • We study the impact of static metrics on the dynamic performance of the agents, and conclude on the importance of single-step calibratedness in model-based offline RL.

2. RELATED WORK

Offline RL has been an active area of research following its numerous applications in domains such as robotics (Chebotar et al., 2021 ), healthcare (Gottesman et al., 2018 ), recommendation systems (Strehl et al., 2010) , and autonomous driving (Kiran et al., 2022) . Despite outstanding advances in online RL (Haarnoja et al., 2018; Silver et al., 2017; Mnih et al., 2015) and iterated offline RL (Wang et al., 2019; Wang & Ba, 2020; Matsushima et al., 2021; Kégl et al., 2021) , offline RL remained a challenging problem due to the dependency on the data collection procedure and its potential lack of exploration (Levine et al., 2020) . Although any off-policy model-free RL agent can theoretically be applied to offline RL (Haarnoja et al., 2018; Degris et al., 2012; Lillicrap et al., 2016; Munos et al., 2016) , it has been shown that these algorithms suffer from distribution shift and yield poor performance (Fujimoto et al., 2019;  

