ENTROPY-REGULARIZED MODEL-BASED OFFLINE REINFORCEMENT LEARNING Anonymous authors Paper under double-blind review

Abstract

Model-based approaches to offline Reinforcement Learning (RL) aim to remedy the problem of sample complexity in offline learning via first estimating a pessimistic Markov Decision Process (MDP) from offline data, followed by freely exploring in the learned MDP for policy optimization. Recent advances in modelbased RL techniques mainly rely on an ensemble of models to quantify the uncertainty of the empirical MDP which is leveraged to penalize out-of-distribution state-action pairs during the policy learning. However, the performance of ensembles for uncertainty quantification highly depends on how they are implemented in practice, which can be a limiting factor. In this paper, we propose a systematic way to measure the epistemic uncertainty and present EMO, an Entropy-regularized Model-based Offline RL approach, to provide a smooth error estimation when leaving the support of data toward uncertain areas. Subsequently, we optimize a single neural architecture that maximizes the likelihood of offline data distribution while regularizing the transitions that are outside of the data support. Empirical results demonstrate that our framework achieves competitive performance compared to state-of-the-art offline RL methods on D4RL benchmark datasets.

1. INTRODUCTION

Following the major success of deep Reinforcement Learning (RL) in numerous applications (Mnih et al., 2013; 2015; Silver et al., 2018) , offline RL has emerged to cope with the problems where simulation or online interaction is impractical, costly, and/or dangerous, thus, allowing to automate a wide range of decision-making problems from healthcare and education to finance and robotics (Levine et al., 2020) . The primary challenge in these scenarios is however that learning new policies from data stored with a different (possibly sup-optimal) policy, aka behavior policy, suffers from distributional shifts resulting in extrapolation error, which is infeasible to improve due to lack of additional exploration (Fujimoto et al., 2019; Kumar et al., 2019) . This is why standard (online) RL methods perform poorly in offline settings (Yu et al., 2020) . Consequently, several model-free offline RL algorithms are introduced to regularize the learned policies to stay close to the behavior policy, by constraining out-of-distribution trajectories (Fujimoto et al., 2019; Kumar et al., 2019; Wu et al., 2019; Kumar et al., 2020; Agarwal et al., 2020) . In model-free methods, policy optimization is limited to already observed states which most likely do not provide sufficient coverage of the entire state space. Alternatively, model-based methods first learn the corresponding empirical Markov Decision Process (MDP) using the offline dataset and then freely explore in the learned environment for policy optimization, which can attain excellent sample efficiency compared to model-free methods (Chua et al., 2018; Janner et al., 2019) . Most recently, model-based algorithms are specifically designed for offline settings to address distributional shifts in the learned model and have been proved effective in certain problems compared to their modelfree counterparts (Yu et al., 2020; 2021; Kidambi et al., 2020; Zhan et al., 2021; Swazinna et al., 2021; Chen et al., 2021; Rigter et al., 2022) However, prominent model-based methods, i.e., MOPO (Yu et al., 2020) and MOReL (Kidambi et al., 2020) , mainly leverage an ensemble of models for uncertainty quantification. Ensemble uncertainty quantification is a special case of uncertainty quantification in Bayesian neural networks with latent variables using nearest-neighbor methods, introduced by Depeweg et al. (2018) , where each model in the ensemble corresponds to sample from the posterior distribution. In these methods, a measure of ensemble discrepancy determines the estimation error. This can be particularly restrictive when theoretical assumptions on the ensemble do not hold in practical scenarios. In practice, an ensemble usually consists of a small number of models, where each model is a different initialization of the same neural architecture, trained on the same data. Hence, the models in the ensemble are likely to correlate to one another after training, which might make their variation a poor indicator of uncertainty. Yu et al. (2021) study this behavior and demonstrate that the uncertainty estimated via maximum variance over the learned ensemble (as in MOPO) struggles to accurately predict the model's error , and could lead to poor performance (see Fig. 2 in Yu et al. (2021) ). Accordingly, there have been efforts to eliminate the need for bootstrap ensembles for uncertainty estimation in model-based offline RL methods. Yu et al. (2021) propose utilizing model rollouts to conservatively learn the Q-function by penalizing the values over out-of-distribution areas, while Rigter et al. (2022) introduce an adversarial framework for training the policy and the model at the same time, such that at each step, the policy is trained to maximize the return, while the model is tuned to minimize it. In addition, Tennenholtz et al. (2021) propose to quantify uncertainty using a k-nearest neighbors approach, where the distance measure is defined as an approximate metric on the learned (Riemannian) manifold in a latent space encoded by a VAE. Although RAMBO-RL (Rigter et al., 2022) and COMBO (Yu et al., 2021) have shown promising empirical results on standard benchmark datasets, they both forgo the modularity aspect of methods such as MOPO and MOReL, and GELATO (Tennenholtz et al., 2021) is computationally expensive. Instead, we aim to get the best of both worlds and present a general-purpose, task-agnostic, computationally efficient framework to learn a pessimistic model of the environment that can be coupled with any RL algorithm to learn optimal policies, without ensemble learning. In this paper, we address this problem by proposing a novel method that eliminates the need for ensemble uncertainty quantification, while still being modular in the sense that the trained model can be combined with arbitrary RL algorithms to learn arbitrary tasks. Therefore, we present EMO, an Entropy-regularized Model-based Offline RL approach, that learns a pessimistic MDP using only a single model which can provide accurate estimates of the dynamics in the support of offline data while softly quantifies an upper bound for the uncertainty of model predictions when leaving the data support. To this end, we devise a regularized loss function to minimize the negative log-likelihood of the model w.r.t. the offline data distribution, and simultaneously, maximize the entropy of predictions outside of the data support in a single model. Furthermore, we propose to warm-start the learning procedure by only optimizing the unconstrained objective function, where the initial learned model in this step is used to generate rollouts for optimizing the uncertainty estimation. Our extensive empirical study illustrates that our approach achieves better or on par performance compared to state-of-the-art (SOTA) offline RL techniques, both model-free and model-based, on D4RL benchmark datasets for MuJoCo environments.

2. RELATED WORK

Offline reinforcement learning (Lange et al., 2012) , which allows for optimizing policies from static offline datasets, has received a lot of attention throughout the recent years, as the practical issues of applying online RL to many real-world scenarios became more apparent. Model-free offline RL approaches optimize a policy solely based on the visited states from the static offline data, without utilizing a learned model of the environment. Constraining the policy to be close to the behavior policy (Kumar et al., 2019; Fujimoto et al., 2019; Wu et al., 2019; Fujimoto & Gu, 2021) , conservative estimation of value functions (Kumar et al., 2020; Kostrikov et al., 2021) , incorporating the uncertainty of predictions to stabilize Q-functions (Agarwal et al., 2020; Wu et al., 2021) , and adversarial training of actor and critic (Cheng et al., 2022) are among active lines of work in modelfree offline RL. However, due to their limited generalization, the performance of model-free methods is highly reliant on the optimality of the offline data. On the other hand, model-based approaches incorporate a model of the environment to improve generalization and sample efficiency, which is used as a surrogate for the actual MDP to optimize a policy, combined with the original offline data. MOPO (Yu et al., 2020) and MOReL (Kidambi et al., 2020) incorporate ensemble uncertainty estimation to penalize highly uncertain transitions. COMBO (Yu et al., 2021) combines the idea of conservative estimation of value functions in CQL (Kumar et al., 2020) with a model-based learning framework. RAMBO-RL (Rigter et al., 2022) proposes an adversarial framework for training the policy and the model at the same time, such that at each step, the policy is trained to maximize the return, while the model is tuned to minimize it. As discussed in (Yu et al., 2021) , algorithms like MOPO and MOReL that rely on ensemble uncertainty estimation, are prone to erroneous estimations of uncertainty. On the other hand, COMBO and RAMBO-RL are not modular frameworks; COMBO applies the idea of conservative learning of value function as a part of its proposed RL algorithm, and RAMBO-RL tunes the model to be pessimistic with respect to the current policy of the agent; therefore, the resulting model from learning a particular task with either of these methods cannot be utilized to learn a new task in the environment. EMO, however, tries to eliminate the need for ensemble uncertainty quantification, while remaining a modular framework that can be extended to various tasks and RL algorithms.

3.1. PRELIMINARIES

The Reinforcement Learning (RL) problem is characterized via a Markov Decision Process (MDP) (Sutton & Barto, 2018) , represented by M = (S, A, T, r, γ, ρ 0 ), where S and A denote the state space and action space, respectively, T (s ′ |s, a) is the transition distribution, r(s, a) is the reward function, γ ∈ (0, 1) is the discount factor, and ρ 0 is the distribution of the initial state. A policy π(a|s) is defined as a mapping from states to a distribution over actions, π : S × A → [0, 1], and the goal is to learn a policy π * which maximizes the expected discounted return η M (π) when followed π * = arg max π η M (π) where η M (π) = E π,ρ0,T ∞ t=0 γ t r(s t , a t ) . However, in offline settings, we are not able to accurately evaluate the return values under different policies as there is no further interaction with the environment. Instead, a static dataset D = {(s i , a i , r i , s ′ i )} n i=1 which is collected under an arbitrary policy in the environment is provided. Therefore, the objective turns into finding the best policy that can be solely optimized on the available offline data D. Note that in this context, best policy might be different from the optimal policy, as the performance of the resulting policy, regardless of the learning algorithm, is affected and capped by factors such as distribution and optimality of the static data. In this paper, we focus on model-based offline reinforcement learning, where an empirical model of the environment is estimated and leveraged to enhance sample efficiency over model-free approaches. In this framework, the offline dataset is used to train a pessimistic MDP M which is then employed as a surrogate for the actual model of the environment. Subsequently, the RL agent interacts with this model and optimizes its policy based on both the acquired information and the original transitions from the offline data. Ideally, we aim to find a policy π * with minimum sub-optimality with respect to the optimal policy, i.e., π * = arg min π η M (π * ) -η M (π). π * = arg min π η M (π * ) -η M (π). The general workflow of EMO is depicted in Figure 1 . In this figure, the offline dataset is first utilized to train a pessimistic model of the environment in a two-phase training regime. After that, the model is considered as a surrogate for the actual environment to train a policy π * (a|s).

3.2. MODEL OPTIMIZATION

In this section, we present a novel approach for model optimization, as the main part of EMO, that learns a pessimistic MDP using only a single neural network with two aims. First, the estimated model should accurately capture the dynamics of the environment (which can only be reliable in the support of offline data,) and second, it should relatively quantify the uncertainty associated with model predictions in the form an error estimator. Consequently, we address these goals in a regularized optimization setting, where both the likelihood of the data as well as the error estimation are jointly optimized in a single model. The error estimator, denoted by u, allows the framework to relatively differentiate between the reliable and unreliable predictions of the model, which accordingly, can be utilized to penalize unreliable predictions based on their estimated uncertainty. Let r(s, a) and u(s, a) be the associated reward function and estimated error for a particular state-action pair (s, a), respectively. A pessimistic model can be determined by defining an alternative reward value in the form of r(s, a) = r(s, a) -λu(s, a), where λ is a constant value to control the amount of penalty associated with the state-action pair. As a result, the policies will be prevented from exploiting unreliable predictions, which can translate to conservative exploration in the actual environment. Inspired by Yu et al. (2020) ; Kidambi et al. (2020) , we characterize the one-step model of the environment with a Gaussian distribution over the next state and reward, conditioned on the current state and action, i.e., P (s ′ , r|s, a) = N µ θ (s, a), Σ ϕ (s, a) , where θ and ϕ are the weights of the corresponding neural networks. Both the mean vector µ θ (s, a) and covariance matrix Σ ϕ (s, a) are of the size d + 1, where d is the dimensionality of the state space and the covariance matrix is assumed to be diagonal. In this work, the main reason for explicitly estimating the covariance matrix lies in the fact that the entropy of a Gaussian distribution, as a quantitative measure for uncertainty, is formulated as a function of the covariance matrix H = 1 2 log det Σ ϕ (s, a) + (d + 1) 2 1 + log(2π) , and assuming that the covariance matrix is diagonal, we have: H = 1 2 d+1 i=1 log Σ ϕ,i (s, a) + (d + 1) 2 1 + log(2π) . Consequently, Σ ϕ (s, a) can solely act as the uncertainty/error estimator, while µ θ (s, a) is trained to model the transition dynamics. In this way, both terms can be optimized in a single model to learn a pessimistic MDP that allows for utilizing the offline data in a more efficient and reliable way. Furthermore, we propose a two-step algorithmic framework for learning the pessimistic model of the environment. During the first phase, indicated by the warm-up phase, the model is trained via maximum likelihood following the prior work (Yu et al., 2020; Kidambi et al., 2020) , while in the second phase, which we call the regularization phase, the model is leveraged to generate synthetic rollouts, that are then used to maximize the entropy of model predictions over the out-of-distribution data points. Note that the warm-up phase is essential to the next phase, since the additionally generated data from the model needs to be as close to the actual dynamics of the environment as possible, particularly, in the support of offline data and possibly its generalizable neighborhood.

3.2.1. WARM-UP PHASE

In this phase, we employ the Gaussian negative log-likelihood (NLL) loss to train an initial model Minit of the environment on the offline dataset. Let B D be a batch sampled from the offline data D, the NLL objective function denoted by L 1 can be written as in Equation 2L 1 (θ, ϕ; B D ) = 1 |B D | (s,a,r,s ′ )∈B D d+1 i=1 1 2 log(Σ ϕ,i (s, a)) + (µ θ,i (s, a) -[s ′ , r] i ) 2 Σ ϕ,i (s, a) . ( ) At the end of the warm-up phase, we expect µ θ (s, a) to provide accurate and reliable predictions for (s ′ , r) conditioned on (s, a) in the support of the offline data. Additionally, depending on the generalizability of the model, the performance can be extended to a neighborhood around the offline Algorithm 1 Generating Rollouts Require: D, µ θ , Σ ϕ , π e , batch size b, rollout horizon h, penalty coefficient λ 1: Set Bπe ← ∅ 2: for 1, 2, ..., b (in parallel) do 3: Set ρ ← ∅ 4: Sample state s 1 from D for the initialization of the rollout. 5: for j = 1, 2, ..., h do 6: Sample an action a j ∼ π e (s j ) 7: Obtain s j+1 , r j = µ θ (s j , a j ) 8: (Optional) Compute rj = r j -λ tr(Σ ϕ (s j , a j ) 9: Add (s j , a j , rj , s j+1 ) to ρ 10: end for 11: Add ρ to Bπe 12: end for 13: return Bπe data distribution. Similarly, Σ ϕ (s, a) can capture the uncertainty in the areas that we have the support of offline data. However, no devoted argument can be made about µ θ and Σ ϕ over out-of-distribution (OOD) data points, and thus, they might predict arbitrary values in those regions.

3.2.2. REGULARIZATION PHASE

If we directly employ the trained model from the warm-up phase as a surrogate for the actual environment and optimize a policy, the learned policy will most likely perform poorly in the real MDP (Levine et al., 2020) . This happens because while training the policy, the RL algorithm will query the model in OOD data points as well, meaning that it will rely on potentially inaccurate predictions of the model, which might lead to inferior policies due to over-or under-estimation of the values. Nevertheless, this is not an inherent issue in standard model-based RL as the model can be improved over time by collecting more transitions, whereas in offline settings, the problem remains due to lack of additional interactions with the real environment. Therefore, we aim to find an efficient way to either prevent the policies from exploiting unreliable predictions, or prevent the unreliable predictions from affecting the values of state-action pairs. To address this problem, we propose using a regularized objective function and a second phase for training the model. Once we have a preliminary empirical model, we expand the loss function to include a second objective to maximize the entropy of model predictions over the OOD domains. Since the model is characterized by a Gaussian distribution, entropy is a monotonically increasing function of det Σ ϕ (s, a) = Σ i Σ ϕ,i (s, a) (Equation 1). Hence, higher entropy, which translates to higher uncertainty, results in a higher value for det Σ ϕ (s, a) , and vice versa. As a result, the maximized entropy, which can be attained by maximizing det Σ ϕ (s, a) , provides an upper-bound estimation on the uncertainty of model predictions. Let µ θ (s, a) and Σ ϕ (s, a) be the mean and covariance of the Gaussian distribution over the next state and reward, i.e., P (s ′ , r|s, a) = N µ θ (s, a), Σ ϕ (s, a) , and Bπe be a sample batch of rollouts generated from the trained dynamics µ θ (s, a) using an exploration policy π e (a|s), i.e., (s ′ , r) = µ θ s, a ∼ π e (a|s) , the second loss L 2 is thus defined as L 2 (ϕ; Bπe ) = 1 | Bπe | (s,a,r,s ′ )∈ Bπe d+1 i=1 - 1 2 log Σ ϕ,i (s, a) , which together with L 1 introduced in Equation 2 yield the hybrid loss L L(θ, ϕ; B D , Bπe ) = L 1 (θ, ϕ; B D ) + αL 2 (ϕ; Bπe ), and α is a fixed constant to control the effect of regularization on the NLL loss. Sample a batch of transitions B D from the offline dataset D.

4:

Compute L 1 (θ, ϕ; B D ) (Equation 2.) 5: Compute gradients and update θ and ϕ. 6: end for 7: for K 2 iterations do ▷ Regularization phase 8: Sample a batch of transitions B D from the offline dataset D.

9:

Compute L 1 (θ, ϕ; B D ) (Equation 2.) 10: Generate a batch of transitions Bπe using model rollouts (Algorithm 1.) 11: Compute L 2 (ϕ; Bπe ) (Equation 3.) 12: Compute L(θ, ϕ; B D , Bπe ) = L 1 (θ, ϕ; B D ) + αL 2 (ϕ; Bπe ).

13:

Compute gradients and update θ and ϕ. 14: end for π e (a|s). In this algorithm, µ θ (s, a) models the transition dynamics and Σ ϕ (s, a) is used as the uncertainty estimator. We leverage an exploratory policy as well as the dynamics model, which is initially trained in the warm-up phase, to generate rollouts additional to samples from offline data to be employed in the regularized optimization problem. Note that the dynamics model will be gradually updated as the training progresses. Accordingly, Algorithm 2 plots the overall learning procedure of EMO, in which the batches of data from both the offline dataset as well as the generated rollouts from the model are utilized to minimize the hybrid loss in Equation 4. In the regularization phase, the model is trained to maximize the likelihood of the offline data, collected by a (possibly unknown) behavior policy, while maximizing the entropy of predictions over the distribution induced by an exploratory policy. Hence, L 1 ensures that µ θ and Σ ϕ maintain their accuracy in the support of offline data, while L 2 aims to increase Σ ϕ as we leave the support of offline dataset (over the distribution induced by the exploration policy π e .) As a result, Σ ϕ (s, a) can be used as an upper bound indicating how reliable/accurate the trained model is for a certain pair of s and a, as long as the distribution of rollouts generated by the exploration policy covers this particular state-action pair. In other words, as long as the distribution of rollouts generated from π e covers the potential distributions of other exploratory policies, (which will be used later during policy optimization in Section 3.3,) Σ ϕ (s, a) can be leveraged as the error estimator to relatively penalize unreliable predictions. Furthermore, by considering a small value for α, we ensure that the effect of L 2 will be negligible compared to L 1 where the distribution of offline data D overlaps with the distribution of rollouts under the exploratory policy π e . In this way, the performance of the model is practically unaffected in the support of offline data and possibly other generalizable neighborhoods. Note that L 2 will still be effective in the OOD areas, since L 1 is non-existent in those regions. 

3.3. POLICY LEARNING

Once the model is trained using EMO, we will utilize its components to define a pessimistic MDP, which can be coupled with any policy optimization technique to obtain the output policy π * . Sub-sequently, the overall learning framework, which also adheres to prior work (Yu et al., 2020; Kidambi et al., 2020) , is summarized in Algorithm 3. In this algorithm, the offline dataset D is first used to train a dynamics model µ θ along with an admissible error/uncertainty estimator Σ ϕ . Next, a new pessimistic MDP is defined as M = (S, A, T µ , r, γ, ρ 0 ), with T µ = {µ θ,i (s, a)} d i=1 and r(s, a) = r(s, a) -λu(s, a), where u(s, a) = tr(Σ ϕ (s, a)) (see section 3.2). Lastly, the pessimistic MDP M is leveraged as a surrogate model to train an RL algorithm and obtain π * .

3.4. THEORETICAL GROUNDS OF EMO

We argue that EMO is an extension to MOPO (Yu et al., 2020) , while addresses the limitations of MOPO regarding learning an ensemble of models for uncertainty estimation. Accordingly, we expand the theoretical grounds of MOPO to EMO, where we guarantee conservative policy evaluation and safe policy improvement of EMO, regardless of the stochasticity of the environment. For a detailed discussion on the theoretical analysis, refer to Appendix A.1.

4. EMPIRICAL STUDY

In this section, we evaluate the performance of our approach on D4RL benchmark datasets for Mu-JoCo environments (Fu et al., 2020) . We include data from three different environments: halfcheetah, hopper, and walker2d, and four different types from each, i.e., random, medium, medium-replay, and medium-expert, which results in twelve different datasets. Throughout the experiments, we aim to investigate the following questions: (1) How does EMO perform compared to SOTA methods on standard offline RL benchmark? (2) What impact do entropy regularization and reward penalty have on the performance of the trained policies? (3) What is the effect of exploration policy π e on the generalization ability of the trained models and the performance of the resulting policies? (4) What is the effect of regularization coefficient α on the performance of EMO? 4.1 EXPERIMENTAL SETUPS Following the setup in Yu et al. (2020) , we characterize the model as a 4-layer feed-forward neural network across all domains, with 200 hidden units in each layer. Subsequently, the output of the last hidden layer is fed into a two-head network architecture to generate µ θ (s, a) and Σ ϕ (s, a), where µ θ and Σ ϕ are two outputs of a single neural network, i.e., they share the same network, except for their output layers. Instead of directly estimating the reward function, the model predicts the center of mass velocity, and the reward is calculated afterwards based on its formulation in each domain. For all the experiments, a soft actor-critic (SAC) agentfoot_0 (Haarnoja et al., 2018) is used as the reinforcement learning agent for policy optimization on the trained pessimistic MDP M . At each time step, a batch of k-step rollouts are generated and added to the replay memory of the SAC agent, where the actions for generating the rollouts are taken based on the current policy of the agent, while transitions and rewards are produced by M . Next, the agent optimizes its policy based on samples from both its replay memory and the offline data D. Note that in our implementation, the agent will only utilize its own replay memory for policy optimization, meaning that samples from offline data are not directly utilized for policy optimization (see Algorithm 5 in Appendix A.5). Consequently, the resulting policies from SAC are evaluated in the real MuJoCo environment for testing.

4.2. OVERALL PERFORMANCE

We compare the performance of EMO to SOTA model-based offline algorithms, i.e., MOPO (Yu et al., 2020) , MOReL (Kidambi et al., 2020) , COMBO (Yu et al., 2021) , RAMBO-RL (Rigter et al., 2022) , and GELATO (Tennenholtz et al., 2021) , as well as 3 model-free counterparts, i.e., UWAC (Wu et al., 2021) , CQL (Kumar et al., 2020) , and ATAC (Cheng et al., 2022) . The performance results of both model-based and model-free techniques are summarized in Table 1 . All the presented scores are normalized according to the procedure proposed in Fu et al. (2020) . For EMO, the results are the performance of policy at the last iteration of training, averaged over 3 random seeds ± standard error. The results for CQL are taken from D4RL benchmark white-paper (Fu et al., 2020) . As for the values of other methods, the result are taken from their respective papers. The outlined results in Table 1 demonstrate that EMO outperforms MOPO and GELATO in almost all cases, which shows that EMO, as an extension to MOPO, clearly improves upon its predecessor by replacing ensemble uncertainty quantification with entropy regularization. Furthermore, EMO achieves competitive results compared to COMBO and MOReL, outperforming both on 5 out of 12 datasets, which places EMO among the highest-performing model-based algorithms. Although EMO can achieve comparable results to RAMBO-RL in certain scenarios, its performance falls short in some cases, and only outperforms RAMBO-RL on one dataset. This can be attributed to the fact that RAMBO-RL is a task-specific method, which tunes the model to be pessimistic with respect to the current policy of the agent, while EMO utilizes a general purpose, task-agnostic model of the environment for policy optimization. In addition, bear in mind that EMO achieves this level of performance using only a single model of the environment, while an ensemble of models is utilized in other model-based methods. Moreover, Table 1 illustrates that EMO outperforms CQL and UWAC techniques in 8 out of 12 datasets, and performs competitively against ATAC, while outperforming ATAC on 4 out of 12 datasets, which highlights the effectiveness of our simple, efficient method in keeping up with the SOTA baselines. Table 2 : Performance results of policy optimization on different model configurations

Dataset

Environment EMO NLL+SAC (λ ̸ = 0) NLL+SAC (λ = 0) medium halfcheetah 68.5 ± 0.9 35.1 ± 8.0 0.9 ± 0.9 medium walker2d 84.1 ± 2.5 13.0 ± 13.0 5.6 ± 5.1 medium-replay halfcheetah 56.7 ± 5.9 4.0 ± 3.8 10.8 ± 1.6 medium-replay walker2d 83.6 ± 1.3 18.7 ± 15.8 6.4 ± 1.2

4.3. ABLATION STUDIES

In order to address question (2), we conduct an experiment to compare the performance of EMO with two simplified versions of the algorithm: (i) training the policy on a model that is only trained via the NLL loss (i.e., the resulting model from the warm-up phase) without utilizing the reward penalty (λ = 0), denoted by NLL+SAC (λ = 0), and (ii) the same model with the reward penalty (λ ̸ = 0) leveraging the covariance matrix from the warm-start phase, indicated by NLL+SAC (λ ̸ = 0). The outcomes of the experiment are outlined in Table 2 on 4 different datasets. The results confirm that although penalizing the reward values in OOD areas improves the performance of the model, it is still insufficient to achieve comparable results to state-of-the-arts. The reason lies in the fact that the penalties depend on the arbitrary predictions of the covariance matrix in the OOD domains. However, by incorporating the entropy regularization step, we ensure that reward penalties will be proportional to the associated uncertainty of the predictions, which leads to a considerable gap between the performance of the resulting policies. This finding empirically validates the impact of the entropy regularization coupled with penalizing the reward values in uncertain regions. In the second ablation study, we investigate the impact of using an informed exploration policy vs. a random policy to address question (3). One could argue that for a specific task, utilizing an informed policy (i.e., the current policy of the agent) instead of a random policy can be beneficial. In this case, the entropy regularization phase can be guided more efficiently toward the areas that are more likely to be explored by the learning agent. Subsequently, the coverage of the rollouts generated under an informed policy is expected to be a small subset of the rollouts generated from a random policy. Consequently, we propose a modified version of EMO, summarized in Algorithm 4 in Appendix A.5, in which the (SAC) agent and the model (only the regularization phase) are trained simultaneously, rather than modular. In this variant, the model utilizes the current policy of the agent as the exploration policy for entropy regularization in order to benefit from an informed exploration scheme. We further examine this modification on two offline datasets, namely halfcheetah-medium-replay and walker2d-medium-replay, compared to RAMBO-RL. The summarized results in Table 3 indicate that informed exploration can indeed improve the performance of trained policies. Such improvements in the performance are expected since the modified version of EMO has a more specialized, task-specific approach for model optimization compared to the original version. Furthermore, although EMO can achieve comparable results compared to RAMBO-RL and even outperform it on one set of data (see Table 1 ), the modified version can tighten the gap and make the results more competitive to RAMBO-RL. However, we still prioritize the original EMO, as it has general purpose, task-agnostic pessimistic model of the environment, which can be employed in any modular framework with any RL algorithm for training policies. Whereas this cannot be achieved using the modified version of EMO or methods such as RAMBO-RL. Figure 2 : Performance of EMO variants in terms of α on walker2d-medium-replay. To answer question (4), we demonstrate the performance of EMO and modified EMO for different configurations of the regularization coefficient α on walker2d-medium-replay dataset in figure 2 . The figure illustrates how α affects the performance of both methods, also compared to RAMBO-RL. As we increase α, the regularization term in Equation 4becomes more dominant, leading to more conservative algorithms, and vice versa. As a result, smaller values of α are preferred to achieve best performance when offline data has limited coverage of the state-action space or contains less informative data. Conversely, when the offline data provides a better coverage, higher values for α are preferred. In addition, since the datastes from the medium-replay category are considerably smaller than other D4RL benchmark data (aka limited coverage), both EMO and modified EMO are expected to perform better when α is set to smaller numbers, as shown in figure 2. However, if α is too small, then the algorithms will not regularize very well, and that can lead a decrease in performance as well. The same can be said when the algorithm becomes too conservative, which is the case when α is set too large.

5. CONCLUSIONS

In this paper, we presented EMO, an entropy-regularized optimization algorithm to learn a pessimistic MDP for model-based offline RL problems. In this framework, we devised a hybrid loss function to minimize the NLL of the model on the distribution of offline data while maximizing the entropy over OOD domains. We thus optimized both objectives in a single model rather than an ensemble of models as in SOTA model-based approaches. Moreover, our empirical study on D4RL benchmark data showed that our approach competes with SOTA offline RL techniques.

A APPENDIX A.1 THEORETICAL GROUNDS OF EMO

As discussed in Section 3.4, EMO can be considered as an extension to MOPO (Yu et al., 2020) , with the aim to address the limitations of ensemble uncertainty estimation. Accordingly, in this section, we expand upon Lemma 4.1 from MOPO in order to establish the theoretical grounds for EMO.

A.1.1 PRELIMINARIES

We specify a Markov decision process (MDP) by the tuple M = (S, A, T, r, µ 0 , γ), where S and A denote the state space and action space, respectively, T (s ′ |s, a) is the transition dynamics, r(s, a) is the reward function, γ ∈ (0, 1) is the discount factor, and µ 0 is the distribution of the initial state. A policy π(a|s) is defined as a mapping from states to a distribution over actions π : S × A → [0, 1]. If we define P(s t = s|µ 0 , π, T ) as the probability of being in state s at time step t when following policy π from an initial state sampled from µ 0 , in an environment with transition dynamics T , then the discounted state-action visitation distribution of policy π under dynamics T can be defined as ρ π T (s, a) = (1 -γ)π(a|s) ∞ t=0 γ t P(s t = s|µ 0 , π, T ). The goal is to learn a policy that maximizes the expected discounted return η M (π) when followed: max π η M (π) := 1 1-γ E (s,a)∼ρ π T (s,a) [r(s, a)]. The value function, V π M (s) = E π,T [Σ ∞ t=0 γ t r(s t , a t )|s 0 = s], is regarded as the value of a particular state s under the policy π, which is defined as the expected discounted return under π when starting from s. In offline RL framework, we only have access to a static dataset of transition tuples D = {(s, a, r, s ′ )}, which is collected by running a behavior policy π B in the real environment. In offline RL setting, the goal is to find the best possible policy using the static offline dataset D.

A.1.2 THEORETICAL FORMULATION

We start by expanding the theoretical formulation of EMO and proving that EMO can guarantee conservative policy evaluation and safe policy improvement. First, we quantify the relationship between the performance of a policy π under two arbitrary MDPs. From the theoretical formulation of MOPO (Yu et al., 2020) , we have: Lemma A.1.1 (Telescoping lemma). Let M and M be two MDPs with the same reward function r, but different dynamics T and T respectively. For any arbitrary policy π, let G π M (s, a) := E s ′ ∼ T (s,a) [V π M (s ′ )] -E s ′ ∼T (s,a) [V π M (s ′ )]. Then, η M (π) -η M (π) = γ 1 -γ E (s,a)∼ρ π T (s,a) [G π M (s, a)] Next, we expand on Lemma A.1.1 to include MDPs with different rewards as well. Lemma A.1.2. Let M and M be two MDPs with reward functions r and r, and transition dynamics T and T respectively. For any arbitrary policy π, let G π M (s, a) := E s ′ ∼ T (s,a) [V π M (s ′ )] - E s ′ ∼T (s,a) [V π M (s ′ )] , and r(s, a) -r(s, a) = δr(s, a). Then, η M (π) -η M (π) = 1 1 -γ E (s,a)∼ρ π T (s,a) [δr(s, a) + γG π M (s, a)] Proof. Considering r(s, a) = r(s, a) + δr(s, a), we have η M (π) = 1 1 -γ E (s,a)∼ρ π T (s,a) [r(s, a)] + 1 1 -γ E (s,a)∼ρ π T (s,a) [δr(s, a)] The first term in the RHS of Equation 7 corresponds to the return of the policy under dynamics T and reward r. According to Lemma A.1.1, we have 1 1 -γ E (s,a)∼ρ π T (s,a) [r(s, a)] = η M (π) + γ 1 -γ E (s,a)∼ρ π T (s,a) [G π M (s, a)] Now, by substituting Equation 8 into Equation 7, we get η M (π) -η M (π)= γ 1 -γ E (s,a)∼ρ π T (s,a) [G π M (s, a)] + 1 1 -γ E (s,a)∼ρ π T (s,a) [δr(s, a)] = 1 1 -γ E (s,a)∼ρ π T (s,a) [δr(s, a) + γG π M (s, a)] Lemma A.1.2 quantifies the difference between the return of a policy under two arbitrary MDPs, as long as they share the same (S, A, µ 0 , γ). For the specific case of EMO, we can consider M as the actual MDP, and M as the learned MDP on the offline data, based on EMO algorithm, before applying the reward penalty. For now, we model the penalized reward function by r(s, a) = r(s, a)-λ∆r(s, a), where λ∆r(s, a) is the penalty we apply to the learned reward, such that λ ≥ 0. Let M = (S, A, T , r, µ 0 , γ) denote the MDP with transition dynamics T and reward function r(s, a). We have η M (π)= 1 1 -γ E (s,a)∼ρ π T (s,a) [r(s, a)] = 1 1 -γ E (s,a)∼ρ π T (s,a) [r(s, a)] - λ 1 -γ E (s,a)∼ρ π T (s,a) [∆r(s, a)] The first term in the RHS of Equation 10 corresponds to the return of the policy under dynamics T and reward r, namely η M (π). Using Lemma A.1.2, we can rewrite this term as η M (π) = 1 1 -γ E (s,a)∼ρ π T (s,a) [r(s, a)] = η M (π) + 1 1 -γ E (s,a)∼ρ π T (s,a) [δr(s, a) + γG π M (s, a)] By substituting 11 in 10 we have η M (π) -η M (π) = 1 1 -γ E (s,a)∼ρ π T (s,a) [δr(s, a) + γG π M (s, a)] - λ 1 -γ E (s,a)∼ρ π T (s,a) [∆r(s, a)] In order to achieve conservative policy evaluation, we need to ensure that the performance of any policy is not overestimated under the penalized MDP M . Thus, we need to ensure that η M (π)η M (π) ≤ 0 for all π. Proposition A.1. The penalized MDP M preserves conservative policy evaluation if, λE (s,a)∼ρ π T (s,a) [∆r(s, a)] ≥ E (s,a)∼ρ π T (s,a) [δr(s, a) + γG π M (s, a)], ∀π ∈ Π Proposition A.1 establishes the condition on the adjustment λ∆r(s, a) to guarantee conservative policy evaluation. Once again, note that in order to preserve conservative policy evaluation in the penalized MDP M , the condition stated in 13 should hold for any arbitrary policy π. A direct implication of Proposition A.1 is that training any policy in M is equal to optimizing a lower bound on the return under the real MDP M . Practical Implication. If we can ensure E (s,a)∼ρ π T (s,a) [∆r(s, a)] ≥ 0, ∀π ∈ Π, we can satisfy the condition on conservative policy evaluation by choosing λ large enough, assuming that δr(s, a) and G π M (s, a) are bounded. In addition to conservative policy evaluation, we want to guarantee safe policy improvement over the behavior policy π B as well. Let π be the optimal policy trained on the penalized MDP M . We first quantify the performance difference between π B and π under the actual MDP M : Note that since π is the optimal policy under the penalized MDP M , then we have η M (π) -η M (π B ) (12) = η M (π) - 1 1 -γ E (s, = η M (π) -η M (π B )- 1 1 -γ E (s, η M (π) - η M (π B ) = C M (π B ) ≥ 0. Proposition A.2. Let π(a|s) be the optimal policy trained on the penalized MDP M . Then, π is a safe policy improvement over π B , i.  which satisfies Conservative policy evaluation of EMO. In EMO, we define the penalty term as a positive, increasing function of the entropy ∆r(s, a) = u(s, a) = det(Σ ϕ (s, a)). Thus, we have ∆r(s, a) ≥ 0 for all (s, a) ∈ S × A, and it is obvious that E (s,a)∼ρ π T (s,a) [∆r(s, a)] ≥ 0 for all π ∈ Π; as a result, according to the practical implications of Proposition A.1, conservative policy optimization can be achieved by choosing λ large enough. η M (π) -η M (π B ) = C M (π B )- 1 1 -γ E (s, Safe policy improvement of EMO. Although we cannot theoretically guarantee safe policy improvement for ∆r(s, a) = det(Σ ϕ (s, a)) (which is also the case for MOPO Yu et al. (2020) , MOReL Kidambi et al. (2020) , and COMBO Yu et al. (2021) ), it is reasonable to assume that in practice, regardless of the stochasticity of the environment, ∆r(s, a) is smaller over ρ π B T (s, a) than ρ π T (s, a) for any arbitrary π. Note that throughout the regularization phase of EMO, we specifically try to maximize the entropy over the OOD domains, and as a result, we expect lower entropy over the support of data, which we can assume has a distribution very close to ρ π B T (s, a). By defining the penalty term as a positive, increasing function of the entropy, we can practically assume that E (s,a)∼ρ π T (s,a) [∆r(s, a) ] ≥ E (s,a)∼ρ π B T (s,a) [∆r(s, a) ] for all π ∈ Π, and guarantee safe policy improvement according to practical implications of Proposition A.2, by choosing λ large enough. Thus, EMO in its original formulation, can be applied to any environment, whether deterministic or stochastic, and guarantee conservative policy evaluation and safe policy improvement. A.2 DIFFERENTIATION BETWEEN ALEATORIC AND EPISTEMIC UNCERTAINTY AND APPLICABILITY TO STOCHASTIC ENVIRONMENTS Our model does not explicitly differentiate between aleatoric and epistemic uncertainties over OOD domains, and penalizes the rewards based on an upper bound over total uncertainty, instead of only epistemic uncertainty. However, we argue that this is a good approach in practice, even in the case of stochastic environments. In general, uncertainty, be it aleartoric or epistemic, is a source of error in RL algorithms. If we have a reliable and accurate estimation of either of these uncertainties, we can calculate their effect in our evaluations and algorithms. Otherwise, we should find ways to indirectly account for these sources of error, in order to prevent our methods from exploiting potential errors caused by these sources (e.g. by forming performance lower bounds with reward penalties etc.). Same as any other measure, we argue that aleatoric uncertainty cannot be reliably quantified over OOD domains either, especially when we only have a single model; even if we have an ensemble of models, where we can estimate the aleatoric uncertainty by averaging over the variances of each model (Depeweg et al., 2018) , the estimation would be inaccurate, as in practice, the sample size from the posterior distribution of parameters (the number of models in the ensemble) is small and samples are potentially correlated for the reasons discussed in the paper. Thus, aleatoric uncertainty over OOD domains becomes an unmeasurable source of error itself. As a result, it is reasonable to penalize a measure of total uncertainty over OOD domains, rather than only penalizing epistemic uncertainty, as we cannot confidently rely on the estimated aleatoric uncertainty over such domains. As for the in-distribution aleatoric uncertainty, we will show that our method preserves the reliable estimation of in-distribution aleatoric uncertainty. For this, we will discuss the characteristics of a model trained based on EMO over the support of data as well as OOD domains. We first restate the formulation of the proposed hybrid loss in Equation 4: L(θ, ϕ; B D , Bπe ) = L 1 (θ, ϕ; B D ) + αL 2 (ϕ; Bπe ) where L 1 corresponds to NLL loss, L 2 corresponds to entropy regularization term, and α is the regularization coefficient. Please note that as discussed in Section 3.2.2 of the paper, we assume α has a small value. (i) Since we assume that α is small, hybrid loss will be dominated by the NLL loss (L 1 ) over the support of offline data. As a result, Σ ϕ (s, a) will actually correspond to the aleatoric uncertainty over those regions (which, in the case of deterministic environments, is close to zero). Note that this estimation is reliable since it is learned over the support of offline data. (ii) Over OOD domains, NLL loss (L 1 ) does not exist (as we do not have any supervised data over OOD domains), and hybrid loss will be dominated by entropy regularization term. Thus, Σ ϕ (s, a) will correspond to an upper bound of total uncertainty/error over OOD domains. As a result, the estimated aleatoric uncertainty over the in-distribution data, is preserved by Σ ϕ (s, a), as discussed in (i). We also present in Still, we would like to mention that EMO in its original form proposed in this paper, theoretically guarantees conservative policy evaluation and safe policy improvement, regardless of the stochasticity of the environment (please refer to our theoretical analysis in Appendix A.1).

A.3 PRACTICAL EFFECTIVENESS OF ENTROPY MAXIMIZATION

We present in Table 5 the average error (penalty) predicted by EMO trained on 3 different datasets over: (1) Samples drawn from offline dataset D; (2) Generated rollouts of horizon H = 2 using a random exploration policy; and (3) Generated rollouts of horizon H = 5 using a random exploration policy. Please note that the predicted penalty for each transition directly corresponds to the upper bound of total uncertainty attributed to the transition by EMO. The difference between the average predicted errors goes to show that our method of entropy maximization is indeed effective and trustworthy, as there is a distinguishable difference between uncertainty estimation over in-distribution data (samples from D), and datasets that are dominated by OOD samples (generated rollouts). The difference between average predicted errors over rollouts of different horizons, however, depends on the value of hyperparameter α. When α is large, the algorithm is more conservative, thus it is expected to see a small margin between the average error for rollouts of different horizons, as is the case for walker2d-medium and walker2d-medium-expert; but when α is comparatively smaller, we expect to see a larger margin between the average error over rollouts of varying horizons, as is the case of walker2d-medium-replay. The difference between the relative scale of predicted errors over different datasets comes from the upper bound on the predicted entropy. We view the optimal relative scale of OOD uncertainty against in-distribution uncertainty as a function of environment and dataset charachteristics (e.g. coverage, optimality). As a result, in our practical implementation of EMO, we set an upper bound on the uncertainty prediction of the model, i.e. Σ ϕ (s, a) ≤ Σ max , ∀(s, a) ∈ S × A, which indirectly controls the scale of OOD uncertainty against the in-distribution uncertainty. For each environment and dataset configuration, we treat Σ max as a hyperparameter, and optimize it along other hyperparameters of EMO. Although our observations are not conclusive and generalizable, we have observed that environment-dataset configurations which allow models to have better generalization (e.g. stable environments such as halfcheetah, and datasets that provide broad coverage such as medium datasets), will have smaller optimal values for Σ max . In addition, we have also observed that optimality of the dataset can affect the optimal value of Σ max . Datasets with more optimal transitions tend to have larger optimal values for Σ max . On top of that, we conduct another experiment to compare the average of predicted penalties of EMO and MOPO associated to OOD samples. In this experiment, we train the models of EMO and MOPO on walker2d-medium-replay dataset, and calculate the average of predicted penalties of each model on generated rollouts under the learned models using a random exploration policy. For rollout horizon H = 2, the average of predicted penalties for EMO was 7.198 compared to 2.818 of MOPO; and, for rollout horizon H = 5, the average of predicted penalties for EMO was 8.544 against 3.011 of MOPO, which shows that EMO takes a more conservative approach by penalizing an upper bound of uncertainty rather than penalizing a (potentially inaccurate) estimation of the uncertainty.

A.5 REFERENCED ALGORITHMS

Algorithm 4 Modified Version of EMO Require: Offline data D, batch size b, rollout horizon h, penalty coefficient λ, regularization coefficient α 1: Initialize θ and ϕ 2: Initialize policy π 3: for K 1 iterations do ▷ Warm-up phase 4: Sample a batch of transitions B D from the offline dataset D.

5:

Compute L 1 (θ, ϕ; B D ) (Equation 2.) 6: Compute gradients and update θ and ϕ. 7: end for 8: for K 2 iterations do Sample the initial state s 1 of the rollout by sampling from offline data D. for j = 1, 2, ..., h do 6: Sample an action based on the current policy of the agent a j ∼ π(s j ).

7:

Compute the next state in the pessimistic MDP s ′ j = T µ (s j , a j ).

8:

Compute the reward in the pessimistic MDP rj = r(s j , a j ). 



https://github.com/pranz24/pytorch-soft-actor-critic



Figure 1: General scheme of EMO.

Algorithm 1 summarizes the procedure to generate a sample batch of b rollouts of length h given an offline dataset D, a trained model in the form of N µ θ (s, a), Σ ϕ (s, a) , and an exploration policy Algorithm 2 EMO Require: Offline data D, exploration policy π e , batch size b, rollout horizon h, penalty coefficient λ, regularization coefficient α 1: Initialize θ and ϕ 2: for K 1 iterations do ▷ Warm-up phase 3:

General Framework for Model-based Offline RL Require: Offline dataset D = {(s i , a i , r i , s ′ i )} n i=1 ; penalty coefficient λ. 1: Train the dynamics model µ θ and admissible uncertainty estimator Σ ϕ using D. (Algorithm 2) 2: Define u(s, a) = det Σ ϕ (s, a) 3: Define empirical MDP M with dynamics µ θ and reward r(s, a) = r(s, a) -λu(s, a). 4: Run any RL algorithm on M until convergence to obtain π * = arg max π η M (π).

a)∼ρ π T (s,a) [δr(s, a) + γG πM (s, a)] + λ 1 -γ E (s,a)∼ρ π T (s,a) [∆r(s, a)] -η M (π B ) + 1 1 -γ E (s,a)∼ρ π B T (s,a) [δr(s, a) + γG π B M (s, a)] -λ 1 -γ E (s,a)∼ρ π B T (s,a) [∆r(s, a)]

a)∼ρ π T (s,a) [δr(s, a) + γG πM (s, a)]+ 1 1 -γ E (s,a)∼ρ π B T (s,a) [δr(s, a) + γG π B M (s, a)] + λ 1 -γ (E (s,a)∼ρ π T (s,a) [∆r(s, a)] -E (s,a)∼ρ π B T (s,a) [∆r(s, a)])

e. η M (π) ≥ η M (π B ), if λ(E (s,a)∼ρ π T (s,a) [∆r(s, a)]-E (s,a)∼ρ π B T (s,a) [∆r(s, a)]) ≥ E (s,a)∼ρ π T (s,a) [δr(s, a) + γG πM (s, a)] -E (s,a)∼ρ π B T (s,a) [δr(s, a) + γG π B M (s, a)] -(1 -γ)C M (π B )

a)∼ρ π T (s,a) [δr(s, a) + γG πM (s, a)]+ 1 1 -γ E (s,a)∼ρ π B T (s,a) [δr(s, a) + γG π B M (s, a)] + λ 1 -γ (E (s,a)∼ρ π T (s,a) [∆r(s, a)] -E (s,a)∼ρ π B T (s,a) [∆r(s, a)])(15)Proposition A.2 establishes the condition on the penalty λ∆r(s, a) to guarantee safe policy improvement over π B in the form of Inequality 14, and quantifies the improvement over π B in the form of Equation 15. Practical Implication. If we can ensure E (s,a)∼ρ π T (s,a) [∆r(s, a)] ≥ E (s,a)∼ρ π B T (s,a) [∆r(s, a)], we can satisfy the condition for safe policy improvement by choosing λ large enough, assuming that δr(s, a) and G π M (s, a) are bounded. Another similar yet more practical approach could be to ensure that E (s,a)∼ρ π T (s,a) [∆r(s, a)] ≥ E (s,a)∼ρ π B T (s,a) [∆r(s, a)] for all π ∈ Π, which subsumes the original condition.

of transitions B D from the offline dataset D.11: Compute L 1 (θ, ϕ; B D ) (Equation 2.) 12:Generate a batch of transitions Bπ using model rollouts (Algorithm 1.) 13:Compute L 2 (ϕ; Bπe ) (Equation3.) 14:Compute L(θ, ϕ; B D , Bπe ) = L 1 (θ, ϕ; B D ) + αL 2 (ϕ; Bπe ).

empirical MDP M with dynamics µ θ and reward r(s, a) = r(s, a) -λu(s, a), where u(s, a) = det Σ ϕ (s, a) .

Policy Optimization Method for Experiments Require: pessimistic MDP M = (S, A, T µ , r, γ, ρ 0 ), rollout batch size b, rollout horizon h, offline dataset D 1: Initialize policy π and empty replay buffer D model ← ∅ 2: for K iterations do 3: for 1, 2, ..., b (in parallel) do 4:

s j , a j , rj , s ′ j ) to D model . D ∪ D model , use SAC to update π. 13: end for

Performance results of offline RL algorithms on D4RL benchmark datasets.

Performance results of policy optimization for random vs. informed exploration policy

meaning that the reliable aleatoric uncertainty is preserved in a model trained by EMO, which can be accounted for with minimal modifications to the original EMO algorithm, e.g. by excluding it from the penalty term. the average value of det(Σ ϕ (s, a)) over the samples drawn from the offline dataset D for: (1) a model trained based on EMO; and (2) an NLL model.

6. REPRODUCIBILITY STATEMENT

To ensure the reproducibility of the results, the codes are provided in the supplementary materials. Configurations and the choice of hyperparameters are also included in the supplementary materials in readme.txt file. To facilitate the understanding of theoretical formulation and practical implementation, additional algorithms and theories are also included in the appendix. (2) Generated rollouts of horizon H = 2 using a random exploration policy; and

Environment

(3) Generated rollouts of horizon H = 5 using a random exploration policy.

A.4 COMPUTATION AND MEMORY EFFICIENCY

Although we cannot directly compare EMO against previous ensemble-based methods in terms of computation resource (as their implementations are different, which can heavily affect such measures), we still provide some indirect indications of the computation and memory efficiency of EMO. EMO manages to achieve this level of performance with about 0.34M parameters, while ensemble methods (MOPO, COMBO, RAMBO-RL) operate with around 1.1M parameters, which is an indirect indication that EMO improves upon SOTA in terms of memory and computational efficiency.In addition, we present in Figure 3 the performance progress of EMO against MOPO on walker2dmedium-expert dataset, averaged over 3 random seeds, where we can attribute the faster convergence rate of EMO as an indirect indicator of its computational efficiency compared to MOPO. Figure 3 : Performance progress of EMO against MOPO on walker2d-medium-expert dataset, averaged over 3 random seeds. The raw data is depicted in Figure 3a . In order to make the the raw figure more interpretable, we apply two configurations of moving averages over the results of EMO and MOPO.

