AUTOREGRESSIVE DYNAMICS MODELS FOR OFFLINE POLICY EVALUATION AND OPTIMIZATION

Abstract

Standard dynamics models for continuous control make use of feedforward computation to predict the conditional distribution of next state and reward given current state and action using a multivariate Gaussian with a diagonal covariance structure. This modeling choice assumes that different dimensions of the next state and reward are conditionally independent given the current state and action and may be driven by the fact that fully observable physics-based simulation environments entail deterministic transition dynamics. In this paper, we challenge this conditional independence assumption and propose a family of expressive autoregressive dynamics models that generate different dimensions of the next state and reward sequentially conditioned on previous dimensions. We demonstrate that autoregressive dynamics models indeed outperform standard feedforward models in log-likelihood on heldout transitions. Furthermore, we compare different model-based and model-free off-policy evaluation (OPE) methods on RL Unplugged, a suite of offline MuJoCo datasets, and find that autoregressive dynamics models consistently outperform all baselines, achieving a new state-of-the-art. Finally, we show that autoregressive dynamics models are useful for offline policy optimization by serving as a way to enrich the replay buffer through data augmentation and improving performance using model-based planning.

1. INTRODUCTION

Model-based Reinforcement Learning (RL) aims to learn an approximate model of the environment's dynamics from existing logged interactions to facilitate efficient policy evaluation and optimization. Early work on Model-based RL uses simple tabular (Sutton, 1990; Moore and Atkeson, 1993; Peng and Williams, 1993) and locally linear (Atkeson et al., 1997) dynamics models, which often result in a large degree of model bias (Deisenroth and Rasmussen, 2011) . Recent work adopts feedforward neural networks to model complex transition dynamics and improve generalization to unseen states and actions, achieving a high level of performance on standard RL benchmarks (Chua et al., 2018; Wang et al., 2019) . However, standard feedforward dynamics models assume that different dimensions of the next state and reward are conditionally independent given the current state and action, which may lead to a poor estimation of uncertainty and unclear effects on RL applications. In this work, we propose a new family of autoregressive dynamics models and study their effectiveness for off-policy evaluation (OPE) and offline policy optimization on continuous control. Autoregressive dynamics models generate each dimension of the next state conditioned on previous dimensions of the next state, in addition to the current state and action (see Figure 1 ). This means that to sample the next state from an autoregressive dynamics model, one needs n sequential steps, where n is the number of state dimensions, and one more step to generate the reward. By contrast, standard feedforward dynamics models take current state and action as input and predict the distribution of the next state and reward as a multivariate Gaussian with a diagonal covariance structure (e.g., Chua et al. (2018) ; Janner et al. ( 2019)). This modeling choice assumes that different state dimensions are conditionally independent. Autoregressive generative models have seen success in generating natural images (Parmar et al., 2018 ), text (Brown et al., 2020 ), and speech (Oord et al., 2016) , but they have not seen use in Model-based RL for continuous control. We find that autoregressive dynamics models achieve higher log-likelihood compared to their feedforward counterparts on heldout validation transitions of all DM continuous control tasks (Tassa et al., 2018) from the RL Unplugged dataset (Gulcehre et al., 2020) . To determine the impact of improved transition dynamics models, we primarily focus on OPE because it allows us to isolate contributions of the dynamics model in value estimation vs. the many other factors of variation in policy optimization and data collection. We find that autoregressive dynamics models consistently outperform existing Model-based and Model-free OPE baselines on continuous control in both ranking and value estimation metrics. We expect that our advances in model-based OPE will improve offline policy selection for offline RL (Paine et al., 2020) . Finally, we show that our autoregressive dynamics models can help improve offline policy optimization by model predictive control, achieving a new state-of-the-art on cheetah-run and fish-swim from RL Unplugged (Gulcehre et al., 2020) .

Key contributions of this paper include:

• We propose autoregressive dynamics models to capture dependencies between state dimensions in forward prediction. We show that autoregressive models improve log-likelihood over nonautoregressive models for continuous control tasks from the DM Control Suite (Tassa et al., 2018) . • We apply autoregressive dynamics models to Off-Policy Evaluation (OPE), surpassing the performance of state-of-the art baselines in median absolute error, rank correlation, and normalized top-5 regret across 9 control tasks. • We show that autoregressive dynamics models are more useful than feedforward models for offline policy optimization, serving as a way to enrich experience replay by data augmentation and improving performance via model-based planning.

2. PRELIMINARIES

Here we introduce relevant notation and discuss off-policy (offline) policy evaluation (OPE). We refer the reader to Lange et al. ( 2012) and Levine et al. (2020) for background on offline RL, which is also known as batch RL in the literature. A finite-horizon Markov Decision Process (MDP) is defined by a tuple M = (S, A, T , d 0 , r, γ), where S is a set of states s ∈ S, A is a set of actions a ∈ A, T defines transition probability distributions p(s t+1 |s t , a t ), d 0 defines the initial state distribution d 0 ≡ p(s 0 ), r defines a reward function r : S × A → R, and γ is a scalar discount factor. A policy π(a | s) defines a conditional distribution over actions conditioned on states. A trajectory consists of a sequence of states and actions τ = (s 0 , a 0 , s 1 , a 1 , . . . , s H ) of horizon length H. We use s t,i to denote the i-th dimension of the state at time step t (and similarly for actions). In reinforcement learning, the objective is to maximize the expected sum of discounted rewards over the trajectory distribution induced by the policy: V γ (π) = E τ ∼pπ(τ ) H t=0 γ t r(s t , a t ) . The trajectory distribution is characterized by the initial state distribution, policy, and transition probability distribution: p π (τ ) = d 0 (s 0 ) H-1 t=0 π(a t |s t )p(s t+1 |s t , a t ). In offline RL, we are given access to a dataset of transitions D = {(s i t , a i t , r i t+1 , s i t+1 )} N i=1 and a set of initial states S 0 . Offline RL is inherently a data-driven approach since the agent needs to optimize the same objective as in Eq. (1) but is not allowed additional interactions with the environment. Even though offline RL offers the promise of leveraging existing logged datasets, current offline RL algorithms (Fujimoto et al., 2019; Agarwal et al., 2020; Kumar et al., 2019) are typically evaluated using online interaction, which limits their applicability in the real world.

