DEEP AUTOREGRESSIVE DENSITY NETS VS NEURAL ENSEMBLES FOR MODEL-BASED OFFLINE REINFORCE-MENT LEARNING

Abstract

We consider the problem of offline reinforcement learning where only a set of system transitions is made available for policy optimization. Following recent advances in the field, we consider a model-based reinforcement learning algorithm that infers the system dynamics from the available data and performs policy optimization on imaginary model rollouts. This approach is vulnerable to exploiting model errors which can lead to catastrophic failures on the real system. The standard solution is to rely on ensembles for uncertainty heuristics and to avoid exploiting the model where it is too uncertain. We challenge the popular belief that we must resort to ensembles by showing that better performance can be obtained with a single well-calibrated autoregressive model on the D4RL benchmark. We also analyze static metrics of model-learning and conclude on the important model properties for the final performance of the agent.

1. INTRODUCTION

Reinforcement learning consists in learning a control agent (policy) by interacting with a dynamical system (environment) and collecting its feedback (rewards). This learning paradigm turned out to be able to solve some of the world's most difficult problems (Silver et al., 2017; 2018; Mnih et al., 2015; Vinyals et al., 2019) . However, the scope of the systems that RL is capable of solving remains restricted to the simulated world and does not extend to real engineering systems. Two of the main reasons are i) small data due to operational constraints and ii) safety standards of such systems. In an attempt to bridge the gap between RL and engineering systems, we motivate the setting of offline reinforcement learning (Levine et al., 2020) . Offline reinforcement learning removes the need to query a dynamical system by using a previously collected dataset of controller-system interactions. In this optic, we view this setting as a supervised learning problem where one tries to approximate the underlying distribution of the data at hand, and hopefully be able to generalize to out-of-distribution samples. This turns out to be a difficult task for classical RL algorithms because of the distribution shift that occurs between the dataset and the learned policy during the learning process (Fujimoto et al., 2019; Levine et al., 2020) . Thus we need to design algorithms that are well-suited for offline reinforcement learning. A common idea in this field is conservatism where one would only consider the learned agent when the input states are close to the support of the offline dataset. Depending on the algorithm, conservatism can take multiple forms, ranging from penalized Q-targets (Kumar et al., 2020) to uncertainty-penalized Markov decision processes (Kidambi et al., 2020; Yu et al., 2020) . To develop further into this direction, we make the distinction between model-free and model-based RL (MBRL) algorithms. Model-free algorithms learn a policy and/or a value function by observing the reward signal realizations and the underlying dynamics of the system, which in most environments requires a significant number of interactions for achieving good performance (Haarnoja et al., 2018) . In this category, a way to incorporate conservatism is to penalize the value targets of data points that are distant from the offline dataset (Kumar et al., 2020) . Other methods include behavior regularized policy optimization (Wu et al., 2020) . Model-based algorithms are composed of two independent (and often alternating) steps: i) model learning: a supervised learning problem of learning the dynamics (and sometimes also the reward

