DEEP AUTOREGRESSIVE DENSITY NETS VS NEURAL ENSEMBLES FOR MODEL-BASED OFFLINE REINFORCE-MENT LEARNING

Abstract

We consider the problem of offline reinforcement learning where only a set of system transitions is made available for policy optimization. Following recent advances in the field, we consider a model-based reinforcement learning algorithm that infers the system dynamics from the available data and performs policy optimization on imaginary model rollouts. This approach is vulnerable to exploiting model errors which can lead to catastrophic failures on the real system. The standard solution is to rely on ensembles for uncertainty heuristics and to avoid exploiting the model where it is too uncertain. We challenge the popular belief that we must resort to ensembles by showing that better performance can be obtained with a single well-calibrated autoregressive model on the D4RL benchmark. We also analyze static metrics of model-learning and conclude on the important model properties for the final performance of the agent.

1. INTRODUCTION

Reinforcement learning consists in learning a control agent (policy) by interacting with a dynamical system (environment) and collecting its feedback (rewards). This learning paradigm turned out to be able to solve some of the world's most difficult problems (Silver et al., 2017; 2018; Mnih et al., 2015; Vinyals et al., 2019) . However, the scope of the systems that RL is capable of solving remains restricted to the simulated world and does not extend to real engineering systems. Two of the main reasons are i) small data due to operational constraints and ii) safety standards of such systems. In an attempt to bridge the gap between RL and engineering systems, we motivate the setting of offline reinforcement learning (Levine et al., 2020) . Offline reinforcement learning removes the need to query a dynamical system by using a previously collected dataset of controller-system interactions. In this optic, we view this setting as a supervised learning problem where one tries to approximate the underlying distribution of the data at hand, and hopefully be able to generalize to out-of-distribution samples. This turns out to be a difficult task for classical RL algorithms because of the distribution shift that occurs between the dataset and the learned policy during the learning process (Fujimoto et al., 2019; Levine et al., 2020) . Thus we need to design algorithms that are well-suited for offline reinforcement learning. A common idea in this field is conservatism where one would only consider the learned agent when the input states are close to the support of the offline dataset. Depending on the algorithm, conservatism can take multiple forms, ranging from penalized Q-targets (Kumar et al., 2020) to uncertainty-penalized Markov decision processes (Kidambi et al., 2020; Yu et al., 2020) . To develop further into this direction, we make the distinction between model-free and model-based RL (MBRL) algorithms. Model-free algorithms learn a policy and/or a value function by observing the reward signal realizations and the underlying dynamics of the system, which in most environments requires a significant number of interactions for achieving good performance (Haarnoja et al., 2018) . In this category, a way to incorporate conservatism is to penalize the value targets of data points that are distant from the offline dataset (Kumar et al., 2020) . Other methods include behavior regularized policy optimization (Wu et al., 2020) . Model-based algorithms are composed of two independent (and often alternating) steps: i) model learning: a supervised learning problem of learning the dynamics (and sometimes also the reward function) of the system of interest; and ii) policy optimization, where we sample from the learned dynamics to learn a policy and/or a value function. MBRL is known to be sample-efficient, since policy/value learning is done (completely or partially) from imaginary model rollouts (also called background planning) that are cheaper and more accessible than rollouts in the true dynamics (Janner et al., 2019) . Furthermore, a predictive model with good out-of-distribution performance affords easy transfer of the true model to new tasks or areas not covered in the offline dataset (Yu et al., 2020) . Conservatism in MBRL is frequently achieved by uncertainty-based penalization of the model predictions. This relies on well-calibrated estimation of the epistemic uncertainty of the learned dynamics, which is a limitation of this approach. It is of great interest to build models that know when (and how much) they do not know, thus uncertainty estimation remains a central problem in MBRL. Many recent works have made progress in this direction (Osband et al., 2021) . The most common approach to date is bootstrap ensembles: we construct a population of predictive models (most often probabilistic neural networks) and consider disagreement metrics as our uncertainty measurement. The source of randomness in this case is the random initialization of the parameters of neural networks and the subset of the training data that each model sees. When the environment is stochastic, ensembles help to separate the aleatory uncertainty (intrinsic randomness of the environment) and the epistemic uncertainty (Chua et al., 2018) . When the environment is deterministic (which is the case of the D4RL Mujoco benchmark environments considered in most of the offline RL literature (Fu et al., 2021a) ), the error is fully epistemic: it consists of the estimation error (due to lack of training data) and the approximation error (mismatch between the model class and the true distribution) (Hüllermeier & Waegeman, 2021) . This highlights the need of well-calibrated probabilistic models whose posterior variance can be used as an uncertainty measurement in conservative MBRL. In this work, we propose to compare autoregressive dynamics models (Uria et al., 2016) to ensembles of probabilistic feedforward models, both in terms of static evaluation (supervised learning metrics on the task of learning the system dynamics) and dynamic evaluation (final performance of the MBRL agent that uses the model). Autoregressive models learn a conditional distribution of each dimension of the next state conditioned on the input of the model (current state and action) and the previously generated dimensions of the next state. Meanwhile, probabilistic feedforward models learn a multivariate distribution of the next state conditioned on the current state and action. We argue that autoregressive models can learn the implicit functional dependence between state dimensions, which makes them well-calibrated, leading to good uncertainty estimates suitable for conservatism in MBRL. Our key contributions are the following. • We apply autoregressive dynamics models in the context of offline model-based reinforcement learning and show that they improve over neural ensembles in terms of static evaluation metrics and the final performance of the agent. • We introduce an experimental setup that decouples model selection from agent selection to reduce the burden of hyperparameter optimization in offline RL. • We study the impact of static metrics on the dynamic performance of the agents, and conclude on the importance of single-step calibratedness in model-based offline RL.

2. RELATED WORK

Offline RL has been an active area of research following its numerous applications in domains such as robotics (Chebotar et al., 2021) , healthcare (Gottesman et al., 2018) , recommendation systems (Strehl et al., 2010) , and autonomous driving (Kiran et al., 2022) . Despite outstanding advances in online RL (Haarnoja et al., 2018; Silver et al., 2017; Mnih et al., 2015) and iterated offline RL (Wang et al., 2019; Wang & Ba, 2020; Matsushima et al., 2021; Kégl et al., 2021) , offline RL remained a challenging problem due to the dependency on the data collection procedure and its potential lack of exploration (Levine et al., 2020) . Although any off-policy model-free RL agent can theoretically be applied to offline RL (Haarnoja et al., 2018; Degris et al., 2012; Lillicrap et al., 2016; Munos et al., 2016) , it has been shown that these algorithms suffer from distribution shift and yield poor performance (Fujimoto et al., 2019; Levine et al., 2020) . To alleviate the problem of distribution shift, conservatism was introduced successfully by several techniques, such as BEAR (Kumar et al., 2019) , AlgaeDICE (Nachum et al., 2019) , AWR (Peng et al., 2020) , BRAC (Wu et al., 2020) , and CQL Kumar et al. (2020) . The general objective of these methods is to keep the model-free policy close to the behavioral policy, in other words, to avoid wandering into regions of the state/action space where no data was collected. Model-based RL has been successfully applied to the online RL setting by alternating model learning and planning (Deisenroth & Rasmussen, 2011; Hafner et al., 2021; Gal et al., 2016; Levine & Koltun, 2013; Chua et al., 2018; Janner et al., 2019; Kégl et al., 2021) . Planning is done either decision-time via model-predictive control (Draeger et al., 1995; Chua et al., 2018; Hafner et al., 2019; Pinneri et al., 2020; Kégl et al., 2021) ), or Dyna style by learning a model-free RL agent on imagined model rollouts (Janner et al., 2019; Sutton, 1991; Sutton et al., 1992; Ha & Schmidhuber, 2018) . For instance, MBPO (Janner et al., 2019) trains an ensemble of feed-forward models and generates imaginary rollouts to train a soft actor-critic, which policy is then used to generate new data for model learning. MBPO has been showed to achieve state of the art in continuous control task with the smallest sample efficiency. An adaptation of MBPO to the offline setting is MOPO (Yu et al., 2020) . MOPO incorporates conservatism via a surrogate MDP where the rewards are penalized with the uncertainty of the model. While MOPO relies on disagreement metrics between the members of the learned ensemble, we suggest the use of well-calibrated autoregressive models whose learned variance is a good proxy to the model estimation error. Similar uncertainty penalized policy search is used in a number of other works (Kidambi et al., 2020; Lee et al., 2021; Shen et al., 2021; Swazinna et al., 2021; Depeweg et al., 2018) , while others explore pessimism-based decision time planning (Argenson & Dulac-Arnold, 2021; Zhan et al., 2021) , conservative value learning (Yu et al., 2021; Liu et al., 2021) . Autoregressive models have been studied in a number of previous works for generative modeling in general (Uria et al., 2016; 2013; Papamakarios et al., 2017; Van Den Oord et al., 2016) . However, only a handful of papers use them in the context of MBRL (Kégl et al., 2021; Zhang et al., 2021b; Zhan et al., 2021) . Zhang et al. (2021b) used autoregressive models for model-based off-policy evaluation, while we focus our study on the important model properties for offline policy optimization. We also adapt metrics from Kégl et al. (2021) to provide a complete guide on model selection for offline MBRL. Previous works have tackled hyperparameter selection in online RL (Andrychowicz et al., 2021; Engstrom et al., 2020) , MBRL (Zhang et al., 2021a) , and offline RL (Paine et al., 2020) , showing the sensibility of existing algorithms to hyperparameter choices. Lu et al. (2022) perform a similar analysis to this work. Similarly to us, they base their analysis on MOPO, but they focus on the uncertainty-related hyperparameters while we revisit the model design and architecture.

3. PRELIMINARIES

The standard framework of RL is the finite-horizon Markov decision process (MDP) M = ⟨S, A, p, r, µ 0 , γ⟩ where S represents the state space, A the action space, p : S × A ; S the (possibly stochastic) transition dynamics, r : S × A → R the reward function, µ 0 the initial state distribution, and γ ∈ [0, 1] the discount factor. The goal of RL is to find, for each state s ∈ S, a distribution π(s) over the action space A, called the policy, that maximizes the expected sum of discounted rewards J(π, M) := E s0∼µ0,at∼π, st>0∼p [ H t=0 γ t r(s t , a t )], where H is the MDP horizon. Under a policy π, we define the state-action value function (Q-function) at an (s, a) ∈ S × A pair as the expected sum of discounted rewards, starting from the state s, taking the action a, and following the policy π afterwards until termination: Q π (s, a) = E at>0∼π,st>0∼p H t=0 γ t r(s t , a t ) | s 0 = s, a 0 = a . We can similarly define the state value function by taking the expectation with respect to the initial action a 0 : V π (s) = E at∼π,st>0∼p H t=0 γ t r(s t , a t ) | s 0 = s . In offline RL, we are given a set of transitions D = {(s i t , a i t , r i t , s i t+1 )} N i=1 , where N is the size of the set, generated by an unknown behavioral policy π β . The difficulty of offline RL comes from the fact that we are not allowed to interact further with the environment M even though we aim to optimize the objective J(π, M) with π ̸ = π β . In practice, the current offline RL algorithms are still provided with an online evaluation budget, a setting we will follow in the rest of the paper. The question of offline policy evaluation (or budget-limited policy evaluation) is an active research direction (see, e.g., Fu et al. (2021b) ) and is beyond the scope of this paper. Model-based RL algorithms use an offline dataset D to solve the supervised learning problem of estimating the dynamics of the environment p and/or the reward function r. For various reasons (stochastic environment, ability to represent the uncertainty of the predictions), the loss function is usually the log-likelihood L(D ; p) = 1 N N i=1 log p(s i t+1 |s i t , a i t ) . The learned model can then be used for policy search under the MDP M = ⟨S, A, p, r, µ 0 , γ⟩, which has the same state and action spaces S, A as the true environment M, but which has the transition probability p and the reward function r that are learned from the offline data D. The obtained optimal policy π = argmax π J(π, M) is not guaranteed to be optimal under the true MDP M due to distribution shift and model bias. J(π, M) and J(π, M) are somewhat analogous to training and test scores in supervised learning, with two fundamental differences: i) they are only loosely connected to the actual supervised loss L(D; p) that we can optimize and measure on a data set, and ii) because we are not allowed to collect data using π, there is a distribution shift between training and test. Regarding the type of model, the usual choice is a probabilistic model that learns the parameters of a multivariate Gaussian over the next state and reward, conditioned on the current state and action: s t+1 , r t ∼ pθ (.|s t , a t ) = N µ θ (s t , a t ), σ θ (s t , a t ) , where θ represents the parameters of the predictive model. In practice, we use fully connected neural networks as they are proved to be powerful function approximators (Nagabandi et al., 2018; Chua et al., 2018) , and for their suitability to high-dimensional environments over simpler non-parametric models such as Gaussian processes. Following previous work (Chua et al., 2018) , we assume a diagonal covariance matrix for which we learn the logarithm of the diagonal entries: σ θ = Diag(exp(l θ )) with l θ output by the neural network. One of the conditions of such a joint model is the conditional independence of the dimensions of the predicted state, which is a strong assumption, especially in the case of functional (or physical) dependency. y-interdependence (Kégl et al., 2021) happens, for example, when angles are represented by sine and cosine. For this purpose, we study autoregressive models that learn a single model per dimension, conditioned on the input of the model (s t , a t ) and the previously generated dimensions. Formally, pθ (s t+1 |s t , a t ) = pθ1 (s 1 t+1 |s t , a t ) ds j=2 pθj (s j t+1 |s 1 t+1 , . . . , s j-1 t+1 , s t , a t ), where d s is the dimension of the state space S. Conservatism in MBRL requires an uncertainty estimate û(s, a) reflecting the quality of the model in different regions of the state/action space. For this purpose, probabilistic models provide an uncertainty estimate by learning the variance of the predictions (in this case under a Gaussian distribution). In noisy environments, this uncertainty estimate represents both the aleatory uncertainty (intrinsic randomness of the environment) and the epistemic uncertainty (model estimation and approximation errors). Conservative MBRL uses the epistemic uncertainty only, so, in practice, the problem of separating the aleatory uncertainty and the epistemic uncertainty is addressed through the use of bootstrap ensembles (Chua et al., 2018) . Ensembling consists in having D ∈ N * -{1} models, each initialized randomly and trained on a set D ℓ for ℓ ∈ {1, . . . , D} generated by sampling with replacement from a common dataset D. Using ensembles, we can compute a disagreement metric to capture the epistemic uncertainty, as opposed to the aleatory uncertainty learned by each member of the ensemble. A detailed discussion about these uncertainty heuristics is provided in Section 4.

4. A BASELINE: MODEL-BASED OFFLINE POLICY OPTIMIZATION (MOPO)

Models p in MBRL are not used in isolation. Their likelhood ratio, precision, and calibratedness (LR, R2, and KS in Section 5.1 and Appendix C) are good proxies, but ultimately their quality is judged when they are used in a policy. To compare the dynamic performance of the models p, we fix the policy to MOPO (Yu et al., 2020) , a conservative agent-learning algorithm. MOPO uses a pessimistic MDP (P-MDP) to ensure that the performance of the policy with the model will be a lower bound of the performance of the policy on the real system. Yu et al. (2020) show a theoretical lower bound on the true return based on the estimation error of the learned dynamics J(π, M) ≥ E a∼π, s∼ p r(s, a) -γ|G π M(s, a)| . In this formula, G π M(s, a) is defined by E s ′ ∼ p(s,a) [V π M (s ′ )] - E s ′ ∼p(s,a) [V π M (s ′ )] which quantifies the effect of the model error on the return. However, this requires access to the value function of the policy π under the true MDP M, which is not given in practice. To derive an algorithm based on this theoretical bound, MOPO relies on an upper bound of G π M(s, a) based on the integral probability metric: G π M(s, a) ≤ sup f ∈F | E s ′ ∼ p[f (s ′ )] -E s ′ ∼p [f (s ′ )] |, where F is an arbitrary set of functions. In practice, the authors use ensemble-based uncertainty heuristics to set an upper bound on the true error of the model. The maximum standard deviation among the ensemble members (labeled max aleatory or MA) is considered to define a penalized reward r(s, a) = r(s, a) -λû(s, a), where û(s, a) = max ℓ=1,...,N ∥σ ℓ θ (s, a)∥ F and λ is a penalty hyperparameter. Yu et al. (2020) then define the associated P-MDP M = ⟨S, A, p, r, µ 0 , γ⟩ on which a soft actor-critic (SAC) (Haarnoja et al., 2018) agent is trained until convergence (Algorithm 1). This algorithm is based on Model-based policy optimization (MBPO) (Janner et al., 2019) which alternates between model learning and agent learning. MOPO can be described as one iteration of MBPO, which learns the dynamics model (a bootstrap ensemble of probabilistic neural networks) from the offline dataset and then learns the off-policy agent on a bufferfoot_0 of rollouts in the P-MDP M. Using this P-MDP prevents the agent from exploiting rewards of highly uncertain regions. 2020) uses û(s, a) = max ℓ=1,...,N ∥σ ℓ θ (s, a)∥ F ; we also experimented with two other penalty heuristics by Lu et al. (2022) . Data: Dataset D, While Yu et al. (2020) only tried the max aleatory estimator for the uncertainty heuristic, Lu et al. (2022) introduced concurrent ensemble-based uncertainty heuristics from recent works and deployed them in MOPO. Among these, we chose the following two, showing competitive performance in benchmarks. • Max pairwise difference (MPD) (Kidambi et al., 2020)  : û(s, a) = max l,l ′ ∥µ l θ l (s, a) - µ l ′ θ l ′ (s, a)∥ 2 for l ̸ = l ′ ∈ 1, . . . , D. This metric captures the largest disagreement among ensemble members as an indicator of model error. • Ensemble standard deviation (ESD) (Lakshminarayanan et al., 2017): û(s, a) = 1 D D l (σ l θ l (s, a)) 2 + (µ l θ l (s, a)) 2 -(μ(s, a)) 2 with μ(s, a) = 1 D D l µ l θ l (s, a) , is the standard deviation of the ensemble, i.e., the standard deviation of the equally-weighted mixture of the Gaussian densities.

5. EXPERIMENTAL SETUP

We implement our MOPO baseline based on the MBRL library released by Kégl et al. (2021) which is built on top of the RAMP framework (Kégl et al., 2018) . We run our experiments with the following models: For the single models (DARMDN and DMDN), we consider their learned standard deviation (σ θ ) as the uncertainty heuristic to use for reward penalization, which is equivalent to the max aleatory heuristic for an ensemble of a single member. For ENS, we follow the schema by Lu et al. (2022) and tune the uncertainty heuristic as an additional hyperparameter among MA, MPD, and ESD, defined in Section 4. In a typical MBRL loop, the experimental setup consists of alternating model learning and agent learning until the potential convergence of the dynamic performance (episodic return) of the agent on the real environment. For computationally limited hyperparameter optimization, this setup provides continuous feedback on the return of a given model, which helps to early-stop unpromising experiments. This is not possible in single-iteration offline RL as we only have access to a static dataset for model learning, and we have to run all the pipeline to compute the evaluation score of a given model. For this purpose, we suggest to decouple model selection and agent selection in an attempt to reduce the overall computational budget of the approach. The experimental setup will then be separated to two independent parts: • Static evaluation of the models: Starting from a dataset D, we evaluate the different models by computing supervised-learning evaluation metrics (Sections 5.1 and C) on a held-out validation set. We then select the best model hyperparameters based on these metrics. • Dynamic evaluation of the agents: After selecting the best model p, we train agents by interacting with the P-MDP defined on the learned dynamics of the model. During training, we evaluate the agents by repeatedly rolling-out trajectories in the real environment and computing their average episodic return. For this purpose, we assume access to the true simulator at evaluation time, although the recorded episodes are not made available to training. A limitation of this approach comes from the fact that static supervised learning metrics do not necessarily reflect the quality of the model for agent learning. We thus investigate how these static metrics predict the overall dynamic performance in Section 6.

5.1. STATIC METRICS

We use metrics introduced by Kégl et al. (2021) in the context of iterated offline reinforcement learning. These metrics are designed to assess different aspects of model quality: precision, calibratedness, and sensitivity to compounding errors via long-horizon metrics. Precision is evaluated using the explained variance (R2) which we prefer over the standard Meansquared error (MSE) because it is normalized and can be aggregated over multiple dimensions. Calibratedness is measured using the Kolmogorov-Smirnov statistics (KS) between the ground truth validation quantiles and a uniform distribution. This metrics indicates if the ground truth values are distributed following the predicted distributions. In the Gaussian case, it is equivalent to the predicted standard deviation being in the order of magnitude of the true model error (although a bad KS may also indicate that the model errors are not Gaussian). We also use the likelihood ratio with respect to a baseline score (LR), and the outlier ratio (OR), the rate of data points on which the likelihood is close to zero. For the impact of compounding errors, we sample a population of trajectories (following ground truth actions) and compute Monte-Carlo estimates of the long-horizon metrics (R2(L) and KS(L) for L ∈ {1, . . . , 20}). The formal definition of these metrics can be found in Appendix C.

5.2. DYNAMIC METRICS

Similarly to Kidambi et al. (2020) ; Wu et al. (2020) , we compute the average episodic return (undiscounted sum of rewards) of the agent on the real system during training, formally R({(s t , a t , r t , s t+1 )} H t=1 ) = H t=1 r t of the agent, where H is the episode size. We then keep track of the agent with the highest return for the final evaluation. This is not what we should do if the goal was to develop a standalone offline RL algorithm (we could not use the real return to select the agent), but our goal in this paper is to compare models p of the system dynamics, so as long as the agent is selected in the same way for all the models, the comparison is fair. 1We use the normalized scores introduced in the D4RL benchmark. This metric is a linear transformation of the episodic return and takes values between 0 and 100 with 0 corresponding to the score of a randomly initialized SAC agent, and 100 to a SAC agent that is trained until convergence on the real system.

6. EXPERIMENTS & RESULTS

Figure 1 : Hopper All our experiments are conducted in the continuous control environment Hopper. We use the implementation of OpenAI Gym (Brockman et al., 2016) that is based on the Mujoco physics simulator (Todorov et al., 2012) . A description of this environment can be found in Appendix B. For static datasets, we use the D4RL Hopper benchmark that provides four static sets generated by different behavior policies (random: 1M steps generated by a randomly initialized SAC agent, medium: 1M steps generated by a SAC agent trained until half the score at convergence, medium-replay: All the traces collected by a SAC agent trained until half the score at convergence, medium-expert: 2M steps consisting of the medium dataset and 1M steps generated by an expert SAC agent). The results of the static evaluation of the models are summarized in Table 1 . The reported scores are validation scores on a held-out 10% validation set from the D4RL datasets. Table 1 : Model evaluation results on static datasets. ↓ and ↑ mean lower and higher the better, respectively. Unit is given after the / sign. One-step metrics (LR, OR, R2, and KS). We first observed that single models are consistently better than the ensemble in terms of one-step metrics. To better understand this result, we propose to use the ground truth test quantiles as a debugging tool on the calibratedness of the models. Figure 2 and Appendix E show that the ensemble model overestimates its error because the ground truth values are concentrated around 0.5. We suggest that this is because each DMDN ensemble member has a well-calibrated variance, but when we treat the ensemble as a mixture model, the variance of the mean adds to the individual variances, "diluting" the uncertainty. Regarding the comparison between DMDN and the autoregressive DARMDN, we observe that although they have similar R2 scores, DARMDN is consistently beating DMDN in terms of KS and LR which depend also on accurate and well-calibrated sigma estimates, an important property for conservative MBRL. To push the analysis further, we suggest to look at the dimension-wise static metrics, reported in Appendix D. The results depend on the different datasets, yet some results are consistent and help explain the improvement that autoregressive models bring over their counterparts. For instance in three out of four datasets, the LR score of the thigh and thigh dot dimensions is an order of magnitude higher for the autoregressive model. We suggest that this is due to the functional dependence that might exist between the different observables, which is easily captured by the autoregressive model as it uses the previously predicted dimensions as input to the next ones. Long-horizon metrics. Unlike in single-step metrics, here we observe a significant degradation in the performance of DARMDN, both in terms of R2(L) and KS(L) for L ∈ {1, . . . , 20} (Figure 3 and Appendix F). We suggest that this is the due to optimizing the models for singlestep likelihood. Outliers (last bin of Figure 2 ) count little in the single-step likelihood, but may compound when recursing the model through L steps. Dynamic evaluation. Table 2 shows the episodic return achieved by the best agent throughout one million steps of SAC training. SAC agents that were trained using DARMDN models performed better on the real system despite their suboptimal long-horizon performance. We suggest that for an agent that trained by one-step Q-learning, such as SAC, only one-step errors matter. Ensemble models improve over DMDN in the random dataset, but scores are comparable or worse in the remaining tasks, although none of the differences are highly significant (they depend on a couple of lucky seeds; a phenomenon that muddies the offline RL field). One result seems remarkable: DARMDN models seem to be able to consistently generate agents that go beyond Hopper simply standing up (score of about 30). Correlating static metrics and dynamic scores. The experimental setup we introduce has the advantage of reducing the combinatorics of the hyperparameter optimization process. However, the best agents do not necessarily come from the models with the best static metrics, since these are measured on static data not representative of the distribution on which they are applied in the dynamic run. In an attempt to optimize model selection, we investigate the model properties (static metrics) that are most important for dynamic scores. For this, we compute Spearman rank correlation (ρ) and Pearson bivariate correlation (r) between the static score obtained for all models and their respective dynamic scores. metrics that evaluate the calibratedness of the models. This underlines the fact that autoregressive models yield the best agent because of their ability to learn one-step uncertainty estimates that represent well their true errors. D4RL benchmark. We compare the scores obtained with our best agent (based on an autoregressive model) with existing literature in the D4RL benchmark and include the results in Table 3 . Table 3 : Results on the D4RL benchmark. The scores indicate the mean ± standard deviation across 3 seeds (6 seeds for MOPO) of the normalized episodic return. We take the scores of MBRL algorithms from their respective papers, and the scores of the model free algorithms and Behavior cloning (BC) from the D4RL paper (Fu et al., 2021a Our algorithm achieves better or similar (medium replay) performance than MOPO, suggesting that potentially the improvement is brought by autoregressive models over neural ensembles, which supports the case of single well-calibrated models. However, we would like to emphasize that there may be other potential reasons behind such differences of performance. For instance Kidambi et al. (2020) append the observations with the unobserved x velocity to get access to the full state of the true MDP. The D4RL dataset version (v0 or v2) has also been criticized as providing different qualities for the same dataset (we use v2 similar to Kidambi et al. (2020) and Yu et al. (2021) while Yu et al. (2020) uses v0)foot_3 . Another important point is the evaluation protocol that sometimes assumes access to the real system for policy evaluation (Kidambi et al., 2020; Wu et al., 2020; Fujimoto et al., 2019; Kumar et al., 2019) , and sometimes only reports the online evaluation score of the policy at the last agent-training iteration (Yu et al., 2020; 2021) . Finally the architectural choices of the model design and the chosen policy optimization algorithm can also impact the performance. Consequently, we believe that, beyond designing benchmark data set, providing a unified evaluation framework for offline RL is highly necessary. We plan to explore this direction in future work.

7. CONCLUSION

In this paper, we ask what are the best neural networks based dynamic system models, estimating their own uncertainty, for conservativism-based MBRL algorithms. We build on a previous work by Yu et al. (2020) A IMPLEMENTATION DETAILS MOPO Implementation details. Following MBPO, MOPO uses a bootstrap ensemble of probabilistic neural networks pℓ θ = N (µ ℓ θ , σ ℓ θ )} D ℓ=1 trained independently by log-likelihood maximization. The dynamics model is a four-layer neural network with 200 units each, swish activation functions and ridge regularization with different weight decays on each hidden layer. During the model rollout generation phase, MOPO first samples initial states from the offline dataset, then performs short rollouts on the learned dynamics (with the horizon h ∈ {1, 5}). Our Implementation details. For all the models, we use a neural network composed of a common number of hidden layers and two output heads (with Tanh activation functions) for the mean and standard deviation of the learned probabilistic dynamics. We use batch normalization (Ioffe & Szegedy, 2015) , Dropout layers (Srivastava et al., 2014) , and set the learning rate of the Adam optimizer (Kingma & Ba, 2015) , the number of common layers, and the number of hidden units as hyperparameters that we tune using the built-in hyperoptimization engine in the RAMP framework (Kégl et al., 2018) . For the ensemble implementation, we replicate the DMDN model with the optimal hyperparameters and train them by shuffling the training set (a practical variation to bootstrapping (Chua et al., 2018; Pineda et al., 2021) ). In all experiments, we use an ensemble of three models. Table 4 shows the grid search ranges for the hyperparameters of our models. Using the one-million-timestep D4RL data sets, we first determine the best model hyperparameters (in terms of the aggregate validation static metrics) on a subset of 50K training points (and 500K validation points), then we train the best models on 90% of the whole data sets. For the dynamic scores, we use Ray-tune (Moritz et al., 2018) to find the optimal hyperparameters (short rollouts horizon h ∈ {1, 5, 50, 100}, uncertainty penalty λ ∈ {0.1, 1, 5, 25}, and uncertainty heuristic for ensembles u ∈ {Max aleatory (MA), Max pairwise difference (MPD), Ensemble standard deviation (ESD)} on each model/data pair. We use the implementation of the open-source library StableBaselines3 (Raffin et al., 2021) for the SAC agents. We give the best hyperparameters for each model/data pair in Table 5 . The hopper environment consists of a robot leg with 11 observations (rootz, rooty, thigh, leg, foot, rootx dot, rootz dot, rooty dot, thigh dot, leg dot, foot dot) including the angular positions and velocities of the leg joints, except for the x position of the root joint. The action is a control signal applied by three actuators located in the three joints. The goal of the system is to hop forward as fast as possible (maximizing the velocity in the direction of x) while applying the smallest possible control (measured by ∥a t ∥ 2 2 ), and without falling into unhealthy states (terminal states where the position of the leg is physically unfeasible). We detail the characteristics of the environment in Table 6 .  }) = 1 - 1 N N i=1 s j i,t+1 -µ j θ (s j i,t , a i,t ) 2 1 N N i=1 s j i,t+1 -sj t+1 2 where θ are the model parameters and sj t+1 the sample mean of the j th dimension of s t+1 . R2 is between 0 and 1, the higher the better.

LIKELIHOOD RATIO (LR):

The average log-likelihood evaluated on D is defined as L(D; θ; j ∈ {1, . . . , d s }) = 1 N N i=1 log p j θ (s i,t+1 |s i,t , a i,t ) where p θ is the PDF of the Gaussian distribution induced by the learned parameters: N µ θ (s t , a t ), σ θ (s t , a t ) . The log-likelihood is an uninterpretable unitless measure that we ideally want to maximize. Following Kégl et al. (2021) , we normalize L with the log-likelihood of a multivariate unconditional Gaussian distribution (L baseline ) whose parameters are estimated from the dataset D. LR(D; θ; j ∈ {1, . . . , d s }) = e L(D;θ;j∈{1,...,ds}) e Lbaseline(D;j∈{1,...,ds}) (3) OUTLIER RATE (OR): In practice, the log-likelihood estimator is dominated by out-of-distribution test points where the likelihood tends to zero. For this reason, we omit the data points that have a likelihood smaller or equal to p min = 1.47 × 10 -6 when computing the LR. The OR metric is the proportion of data points that fall in this category. Formally: OR(D; θ; j ∈ {1, . . . , d s }) = 1 - | (s t , a t , s t+1 ) ∈ D : p j θ (s t+1 |s t , a t ) > p min | N OR is between 0 and 1, the lower the better. 

CALIBRATEDNESS (KS):

This metric is computed using the quantile (under the model distribution) of the ground truth values. Hypothetically, these quantiles are uniform if the error we make on the ground truth is a random variable distributed according to a Gaussian having the predicted standard deviation, a property we characterize as calibratedness. To assess this, we compute the Kolmogorov-Smirnov (KS) statistics. Formally, starting from the model cumulative distribution function (CDF) F θ (s t+1 |s t , a t ), we define the empirical CDF of the quantiles of ground truth values by F θ,j (x) = F θ,j (F j θ (s i,t+1 |s i,t , a i,t )) -U (F j θ (s i,t+1 |s i,t , a i,t )) (5) The KS score is between zero and one, the lower the better. LONG HORIZON METRICS KS(L) AND R2(L): Although the models are trained to optimize the one-step prediction log-likelihood score, we want to assess their precision and calibratedness at a longer horizon. Indeed, during the agent learning phase we sample trajectories of multiple steps which can lead to uncertain regions in the case of significant compounding errors down the horizon. For this purpose, we use ground truth actions from a system trace to generate a population of n ∈ N trajectories of length L max : Y L = [ŝ ℓ,t+1:t+Lmax] n ℓ=1 and use the mean predictions to compute a Monte-Carlo estimate of the R2(L) metric, for L = 0, . . . , L max , using the sample mean μθ (s t+L |s t , a t ) = 1 n ŝ∈Y L ŝt+L as approximate prediction. For the KS(L) metric, we estimate the model CDF with the order statistic F θ (s t+L |s t , a t ) = |{ŝ∈Y L :ŝ t+L ≤s t+L }| n among the population of trajectories.

D PER-DIMENSION STATIC METRICS

In all plots, as in Table 1 , the KS score is multiplied by 1000, and the OR and R2 scores are multiplied by 10000, 



Initially, MOPO selected 5% of the batch from the real system dataset D, and 95% of model rollouts. However,Yu et al. (2020) show that this does not influence the performance of the algorithm. As a related remark, we consider the giant variance of the return both across seeds and across training iterations of the agent crucial, arguably the most important problem of offline RL, but outside the scope of this paper. The medium-expert dataset contains 2M timesteps which is costly in compute and memory. We therefore, omit this experiment. Some issues have been raised about this in prior work:Lu et al. (2022), Issue 1, Issue 2.



DARMDN(D): Deep autoregressive mixture density net. d s ∈ N * feed-forward neural network that learn the parameters (mean and log-standard deviation), and the weights of D ∈ N * univariate Gaussian distributions (d s being the dimension of the state space S). Although our implementation is general, for the rest of the paper we only consider DAR-MDN(1) due to runtime bottleneck, we refer to it as simply DARMDN.• DMDN(D): Deep mixture density net. A feed-forward neural network that learns the parameters (mean and log-standard deviation) and the weights of D ∈ N * multivariate Gaussian distributions. For similar reasons as DARMDN, we only consider DMDN(1) and refer to it as DMDN. • ENS: Ensemble of D ∈ N * DMDN models. We implement a vectorized version that is optimized to run on a Graphical Processing Units (GPUs). Notice that ENS is equivalent to the original model MOPO used, modulo architectural choices.

Figure 2: Histogram of Hopper's thigh ground truth quantiles, under the model distribution (D4RL medium dataset). The legend also includes the value of the KS calibratedness metric. The dotted red line indicates the ideal case when the quantiles follow a uniform distribution.

Figure 3: Long horizon explained variance R2(L) in the D4RL random dataset.

Figure 4: The Spearman and Pearson correlations between the episodic return and LR/-KS metrics on the D4RL medium dataset.A value of ρ = 1 indicates that the static metric conserves the same ranking observed in the dynamic evaluation (sufficient for model selection) while r = 1 tells that the gap observed in the static metric is in the same scale of the one observed in the dynamic performance (linear correlation). The results in Figure4and Appendix G show that in most datasets, the two most correlating metrics are LR (ρ = 1.0 and r = 0.93) and KS(1) (ρ = 1.0 and r = 0.83), metrics that evaluate the calibratedness of the models. This underlines the fact that autoregressive models yield the best agent because of their ability to learn one-step uncertainty estimates that represent well their true errors.

Figure 5: Kolmogorov-Smirnov (KS) statistic (in red) of the predicted reward.

st,at,st+1)∈D|F j θ (st+1|st,at)≤x N for x ∈ [0, 1]. We denote by U (x) the CDF of the uniform distribution over the interval [0, 1], and we define the KS statistics as the largest absolute difference between the two CDFs across the data set D: KS(D; θ; j ∈ {1, . . . , d s }) = max i∈{1,...,N }

Figure 6: Per-dimension static metrics in the random dataset. The metrics include: R2, KS, LR, and OR. They are computed for all Hopper observables, in addition to the predicted reward (labeled obs reward). The dots show the mean ± the standard deviation among the training and the validation scores for each metric.

Figure 11: Per-dimension Error quantile histograms in the medium dataset. The plot shows the ground truth validation quantiles under the model distribution. The legend includes the value of the KS calibratedness metric, and the dotted red line indicates the ideal case when the quantiles follow a uniform distribution. The histograms are computed for all Hopper observables, in addition to the predicted reward (labeled obs reward).

Figure 14: Long horizon explained variance R2(L) and calibratedness KS(L). The metric is aggregated by averaging over Hopper's observables and predicted reward.

penalty coefficient λ, rollout horizon h, Number of SAC training batches B, conservatism penalty û(s, a). Train dynamics model p on offline dataset D; Initialize SAC policy π and empty replay buffer D model ; for 1, 2, . . .

Model dynamic evaluation: mean ± std over 3 seeds of the hyperoptimal SAC agents. The reported score is the D4RL normalized score explained in Section 5.

).

(MOPO: model-based offline policy optimization) who use bootstrap ensembles. Throughout a rigorous empirical study incorporating metrics that assess different aspects of the model (precision, calibratedness, long-horizon performance), we show that deep autoregressive models can improve upon the baseline in Hopper, one of the D4RL benchmark environments. Our results exhibit the importance of calibratedness when the learned variance is used as an uncertainty heuristic for reward penalization. Future work includes confirming our results on other benchmarks and designing a unified offline RL evaluation protocol.Bangnig Zhang, Raghunandan Rajan, Luis Pineda, Nathan Lambert, André Biedenkapp, Kurtland Chua, Frank Hutter, and Roberto Calandra. On the importance of hyperparameter optimization for model-based reinforcement learning. In AISTATS, 2021a.Michael R Zhang, Thomas Paine, Ofir Nachum, Cosmin Paduraru, George Tucker, ziyu wang, and Mohammad Norouzi. Autoregressive dynamics models for offline policy evaluation and optimization. In International Conference on Learning Representations, 2021b. URL https: //openreview.net/forum?id=kmqjgSNXby.

Model hyperparameters Grid search range.

The optimal hyperparameters for all model/data pair.

Hopper characteristics.All metrics will be evaluated on a data set D of size N , consisting of transitions in the real system. D stands for a held-out validation set on the offline training datasets.

REPRODUCIBILITY STATEMENT

In order to ensure reproducibility we will release the code at <URL hidden for review>, once the paper has been accepted.Finally, the hyperparameters of the algorithms are listed in Appendix A and the pseudocode is shown in Section 4. 

G STATIC AND DYNAMIC METRICS CORRELATIONS

G.1 RANDOM DATASET Figure 15 : The Spearman and Pearson correlations between the episodic return and the static metrics (LR, negative OR, R2(1), negative KS(1), R2(10), negative KS(10), R2(20), negative KS( 20)) in the random dataset. To uniformly evaluate the metrics' positive correlation with the episodic return, we take the negative of the metrics where the smaller is the better (KS(L) and OR).

G.2 MEDIUM DATASET

Figure 16 : The Spearman and Pearson correlations between the episodic return and the static metrics (LR, negative OR, R2(1), negative KS(1), R2(10), negative KS(10), R2(20), negative KS(20)) in the medium dataset. To uniformly evaluate the metrics' positive correlation with the episodic return, we take the negative of the metrics where the smaller is the better (KS(L) and OR).

G.3 MEDIUM-REPLAY DATASET

Figure 17 : The Spearman and Pearson correlations between the episodic return and the static metrics (LR, negative OR, R2(1), negative KS(1), R2(10), negative KS(10), R2(20), negative KS(20)) in the medium-replay dataset. To uniformly evaluate the metrics' positive correlation with the episodic return, we take the negative of the metrics where the smaller is the better (KS(L) and OR).

G.4 MEDIUM-EXPERT DATASET

Figure 18 : The Spearman and Pearson correlations between the episodic return and the static metrics (LR, negative OR, R2(1), negative KS(1), R2(10), negative KS(10), R2(20), negative KS(20)) in the medium-expert dataset. To uniformly evaluate the metrics' positive correlation with the episodic return, we take the negative of the metrics where the smaller is the better (KS(L) and OR).

