POSTERIOR SAMPLING MODEL-BASED POLICY OPTI-MIZATION UNDER APPROXIMATE INFERENCE Anonymous

Abstract

Model-based reinforcement learning algorithms (MBRL) hold tremendous promise for improving the sample efficiency in online RL. However, many existing popular MBRL algorithms cannot deal with exploration and exploitation properly. Posterior sampling reinforcement learning (PSRL) serves as a promising approach for automatically trading off the exploration and exploitation, but the theoretical guarantees only hold under exact inference. In this paper, we show that adopting the same methodology as in exact PSRL can be suboptimal under approximate inference. Motivated by the analysis, we propose an improved factorization for the posterior distribution of polices by removing the conditional independence between the policy and data given the model. By adopting such a posterior factorization, we further propose a general algorithmic framework for PSRL under approximate inference and a practical instantiation of it. Empirically, our algorithm can surpass baseline methods by a significant margin on both dense rewards and sparse rewards tasks from the Deepmind control suite, OpenAI Gym and Metaworld benchmarks.

1. INTRODUCTION

Model-based reinforcement learning has demonstrated great success in improving the sample efficiency of RL. However, many existing popular model-based algorithms (Kurutach et al., 2018; Chua et al., 2018; Janner et al., 2019) cannot deal with exploration and exploitation properly, and hence may lead to poor performance when exploration is crucial. To trade off exploration and exploitation, most of the existing algorithms can be categorized into 1) optimism-based (Jaksch et al., 2010; Pacchiano et al., 2021; Curi et al., 2020) ; 2) posterior-sampling-based (Strens, 2000; Osband et al., 2013; 2018; Fan & Ming, 2021) ; and 3) information-directed sampling (Russo & Van Roy, 2014) approaches. As shown by Osband & Van Roy (2017) , posterior sampling reinforcement learning (PSRL) can match the statistical efficiency (or regret bound) of optimism-based algorithms, but enjoys better computational efficiency. Information-directed sampling methods can be more statistically efficient when faced with complex information structure (Russo & Van Roy, 2014) , but require estimators for the mutual information, which is difficult for high-dimensional random variables. Hence, in this paper, we focus on posterior sampling. For simplicity, we restrict attention to the episodic RL setting. Under the PSRL framework, one maintains a posterior p(M|D E ) of the Markov decision process (MDP) M given the observations D E collected in real environment E. At the beginning of each episode, an MDP is sampled from the posterior, and then we compute the optimal policy π(M) for the sampled model M. Equivalently, we can also view this policy as a sample from a "degenerate" posteriorfoot_0 over policies of the form p(π|D E ) = δ(π|M)p(M|D E )dM, where δ(π|M) = δ(ππ(M)) is a Dirac delta distribution. This policy is then executed in the real environment to collect new data. Theoretically, such a simple strategy has been shown to achieve a Bayesian regret of Õ( √ K) for K episodes (Osband et al., 2013; Osband & Van Roy, 2014) . However, the theoretical guarantees only hold under exact inference, i.e., when we have access to the true posterior over models p(M|D E ), and when we can compute the optimal policy, which is very unlikely in practice. A common heuristic approximation to PSRL (see e.g., Fan & Ming (2021) ) is to replace the posterior over models with some approximation, such as Bayesian linear regression on top of the representations learned by a neural network. In this way, the resulting policy is then sampled from q δ (π|D E ) = δ(π|M))q(M|D E )dM, where we replace p(M|D E ) with the approximate posterior q(M|D E ). At first glance, such a heuristic choice is natural as it shares the same form as the true posterior p(π|D E ). However, in this paper, we prove that we can get lower regret if we replace the degenerate δ(π|M) with a non-degenerate distribution of the form q(π|M, D E ), that depends on the model as well as empirical data D E ; this is necessary to compensate for the fact that the posterior over models q(M|D E ) may be suboptimal. By tuning the relative strength of the dependence of π on D E and M, we can find a sweet spot between maximizing the data efficiency and minimizing the effect of approximate inference error. Furthermore, such a decomposition is guaranteed to be no worse than the standard approach of using q δ (π|D E ). Building upon the above results, we come up with a generic framework for PSRL under approximate inference. To implement the method in practice, we combine deep ensembles (Lakshminarayanan et al., 2017) and Model-based Policy Optimization (MBPO) (Janner et al., 2019) . We also propose two different sampling strategies for policy selection that exploit our posterior approximation. Empirically, our algorithm significantly outperforms the baselines on both dense reward and sparse reward tasks (Brockman et al., 2016; Tunyasuvunakool et al., 2020; Yu et al., 2020) . Additionally, we also conduct various ablation studies to provide a better understanding of our algorithm. In summary, our contributions are 1. We conduct a rigorous study on how approximate inference affects the Bayesian regret in PSRL, showing that adopting the same methodology as in exact PSRL may be suboptimal when the true posterior is unavailable (Section 2). 2. Motivated by our analysis, we develop a generic framework for PSRL under approximate inference as well as a practical version of it based on deep ensembles and (optimistic) sampling approaches for the policies (Section 3). 3. We present empirical results on DM control suite, OpenAI Gym and Metaworld benchmarks to demonstrate the efficacy of the proposed approach (Section 4).

2. PROBLEM STATEMENT AND THEORETICAL RESULTS

We start by introducing some notation, and summarizing prior work, before presenting our new theoretical results, which forms a basis for our algorithm. Notation. We consider the finite-horizon episodic Markov Decision Process (MDP) problem, of which we denote an instance as M := {S, A, r M , p M , H, ρ}. For each instance M, S and A denote the set of states and actions, respectively. r M : S × A → [0, R max ] is the reward function, p M is the transition distribution, H is the length of the episode, and ρ is the distribution of the initial state. We further define the value function of a policy π under MDP M at timestep i as V M π,i (s) := E M,π H t=i r M (s t , a t )| s i = s , where s t+1 ∼ p M (s|s t , a t ) and a t ∼ π(a|s t ). We define π ⋆ as the optimal policy for an MDP M if V M π ⋆ ,i (s) = max π V M π,i (s) for all s ∈ S and i ∈ [1, H]. We define the cumulative reward obtained by policy π over H steps sampled from model M as follows: R M (π) = E M,π H t=1 r M (s t , a t ) where s t+1 ∼ p M (s|s t , a t ) and a t ∼ π(a|s t ), (2) where the initial state is sampled from s 1 ∼ ρ(s). Regret. For a given MDP M, the regret is defined as the difference between value function of the optimal policy in hindsight and that of the actual policy executed by the algorithm A , Regret(T, A , M) := K k=1 ρ(s 1 ) V M π ⋆ ,1 (s 1 ) -V M π k ,1 (s 1 ) ds 1 :=∆ k , where π ⋆ is the optimal policy for M, and π k is the policy employed by the algorithm for k th episode. Correspondingly, the Bayesian regret is defined as the expectation of the above regret, i.e., BayesianRegret(T, A , p(M)) := E [Regret(T, A , M)] = E K k=1 ∆ k . (4) Here the expectation is taken over the prior distribution of dynamics models p(M) and the randomness in the algorithm A and environment. PSRL. Posterior Sampling Reinforcement Learning or PSRL (Strens, 2000) serves as a generic algorithmic framework for automatically trading off exploration and exploitation in online RL. The core of PSRL is computing the posterior over MDPs (dynamics and reward models). Conditioned on the data D E collected from the environment, we denote the posterior of the model as p (M|D E ). Hence, the posterior distribution of policies is p(π|D E ) = p (π|M) p(M|D E )dM, where p (π|M) = δ(π|M) where δ(π|M) is a Dirac delta distribution, defined as δ(π(M)|M) = 1, where π(M) = arg max π R M (π) is the optimal policy for solving the MDP M. At the beginning of every episode, an MDP (or equivalently a policy) is sampled from the posterior distribution, and is then used for collecting new data. This simple algorithmic framework not only attains a Bayesian regret of Õ( √ K) (Osband et al., 2013) , where K is the total number of episodes, but also enjoys better empirical performance than optimism-based methods in bandits (Chapelle & Li, 2011; Osband & Van Roy, 2017) . Nevertheless, the theoretical results only hold under exact inference. In the rest of this section, we discuss how performance is affected when we use approximate inference. Bayesian regret under approximate inference. Let us denote the approximate and true posterior distribution of polices at k th episode by q k (π) = q(π|D k E ) and p k (π) = δ(π|M)p(M|D k E )dM. ( ) where D k E denotes all the data collected from the environment E till the k-th episode . Next we characterize how approximate posterior inference affects the Bayesian regret. Theorem 1 For K episodes, the Bayesian regret of posterior sampling reinforcement learning algorithm A with any approximate posterior distribution q k at episode k is upper bounded by CK(HR max ) 2 H (π ⋆ ) + 2HR max K k=1 E d KL ( q k (π)| p k (π)) , where H(π ⋆ ) is the entropy of the prior distribution of optimal polices, i.e., p(π ) = δ(π|M)p(M)dM, C is some problem-dependent constant and d KL ( •| •) is the KL-divergence. Theorem 1 (proved in the appendix) provides an upper bound of the Bayesian regret under approximate posterior inference. The first term is Õ( √ K). The second term will be zero under exact posterior inference. However, when performing approximate inference, we see that we should choose the approximate posterior distribution q(π|D E ) as "close" to the true distribution p(π|D E ) as possible. A natural and common choice of q(π|D E ) would be the following one, which takes the same form as the true posterior, q δ (π|D E ) := δ(π|M)q(M|D E )dM. (8) However, the following proposition shows that such a choice is suboptimal. Proposition 1 Under approximate inference (i.e., q(M|D E ) ̸ = p(M|D E )), the optimal q(π|M) may not be a Dirac delta distribution, i.e., there exists other q(π|D E ) such that d KL ( q(π|D E )| p(π|D E )) ≤ d KL q δ (π|D E ) p(π|D E ) . As an illustration, we provide the following example, which also serves as a constructive proof of Proposition 1. EXAMPLE 1. SUBOPTIMALITY OF q δ (π|M). Consider a toy setting, where the support set of MDPs is {M 1 , M 2 }, and the support set of policies is {π 1 , π 2 }. Suppose that the true posterior distribution of MDPs is p(M 1 |D E ) = 1/3, p(M 2 |D E ) = 2/3, and the optimal policy per MDP is δ(π 1 |M 1 ) = 1 and δ(π 2 |M 2 ) = 1. This we get the following exact distribution over policies: p(π|D E ) is p(π|DE ) = δ(π 1 |M 1 )=1, δ(π 1 |M 2 )=0 δ(π 2 |M 1 )=0, δ(π 2 |M 2 )=1 δ(π|M) p(M 1 |D E )= 2 3 p(M 2 |D E )= 1 3 p(M|D E ) = p(π 1 |D E )= 2 3 p(π 2 |D E )= 1 3 (10) Now suppose we use the approximate posterior distribution over models, q(M 1 |D E ) = 0 and q(M 2 |D E ) = 1. We can optimize q(π|M) by minimizing d KL ( q(π|D E )| p(π|D E )). One solution could be q(π|DE ) = q(π 1 |M 1 )= 1 2 , q(π 1 |M 2 )= 2 3 q(π 2 |M 1 )= 1 2 , q(π 2 |M 2 )= 1 3 q(π|M) q(M 1 |D E )=0 q(M 2 |D E )=1 q(M|D E ) = q(π 1 |D E )= 2 3 q(π 2 |D E )= 1 3 (11) We see that the optimal q(π|M) requires modeling uncertainty in the policy even conditional on the model. By contrast, if we adopt q δ (π|D E ) as our approximation, we will have d KL q δ (π|D E ) p(π|D E ) = log 3 = max q∈∆ 1 d KL ( q(π|D E )| p(π|D E )) . The above example tells us that the q δ (π|D E ) can perform arbitrarily poorly in terms of the KL divergence. Given this observation, a natural follow-up question would be: is there a better choice other than q δ (π|D E )? We provide an answer in the next section.

3. METHOD

Motivated by the results in the previous section, we first introduce a posterior decomposition which is guaranteed to be better than q δ (π|D E ). Then we introduce a practical version of the algorithm built upon deep ensembles (Lakshminarayanan et al., 2017) . Finally we propose two sampling approaches for encouraging efficient exploration.

3.1. POSTERIOR DECOMPOSITION

In section 2, we showed that q δ (π|D E ) is not a favorable choice. This is because it assumes that, once M is given, the policy π is determined. This motivates us to consider the following more flexible posterior decomposition of q(π|D E ), q(π|D E , λ) = q(π|M, D E , λ)q(M|D E )dM. Intuitively, such a posterior decomposition no longer assumes that the model can capture all the relevant properties of the data. We illustrate these two posterior approximations in Figure 1 . The extra parameter λ ∈ [0, 1] allows us to balance the importance of fictitious data (D M ) from M and the real data (D E ) from the environment. In particular, we define q(π|M, D E , λ = 0) = q(π|M) = δ(π|M) (14) q(π|M, D E , λ = 1) = q(π|D E ) (15) Thus when λ is small, we trust our model more, whereas when λ is large we trust it less. In the extreme case where λ = 0, this framework reduces to the degenerate posterior q δ (π|D E ). By adjusting λ, we can find a sweet spot in minimizing the effect of approximate inference error and maximizing the data efficiency. More formally, the following proposition illustrates the advantage of equation 13. Proposition 2 By adopting the posterior decomposition of equation 13, we have min λ d KL ( q(π|D E , λ)| p(π|D E )) ≤ d KL q δ (π|D E ) p(π|D E ) . ( ) Algorithm 1 PSRL with approximate inference using Ensemble Sampling (PS-MBPO) or Optimistic Ensemble Sampling (OPS-MBPO). Require: Initialize an ensemble of dynamics models Θ = { θn} N n=1 i.i.d. ∼ q(θ). Require: Initialize an ensemble of policy networks Φ = { φn,m} for each model n, policy m do The above proposition informs us that we can minimize the KL divergence, and hence the upper bound of the Bayesian regret, by carefully choosing the value of λ. As empirical evidence, in Figure 2 we plot the cumulative regret with different λ in the cartpole-swingup environment with sparse reward. We see that a value of about λ = 0.4 or λ = 0.5 is the best, and λ = 0 is the worst in this case.

3.2. PROPOSED ALGORITHM

Building upon the above observations, we introduce a simple and general algorithmic framework for PSRL under approximate inference, which only differs from the standard PSRL framework in the decomposition of policy posterior (see Algorithm 2 in Appendix C). To implement the method in practice, we use deep ensembles (Lakshminarayanan et al., 2017) to approximate the posterior distributions q(M|D E ) and q(π|M, D E ) with Θ and Φ (see Algorithm 1). This is similar to ME-TRPO (Kurutach et al., 2018) , PETS (Chua et al., 2018) and MBPO (Janner et al., 2019) , except we also model the uncertainty over policies (i.e., q(π|M, D E )) in addition to dynamics (i.e., q(M|D E )).

Dense reward tasks

Sparse reward tasks In more detail, each member in the deep ensemble is a conditional Gaussian distribution over outputs, characterized by its mean µ and variance σ 2 . For multi-dimensional predictions, we treat each dimension independently, and only predict the marginal mean and marginal variance for simplicity. Each ensemble member is trained independently by minimizing the negative log-likelihood, -log p θ (y|x) = log σ 2 θ (x) 2 + (y -µ θ (x)) 2 2σ 2 θ (x) + constant. ( ) We maintain N different dynamics models. For each such model, we also compute M different policies; we use the soft actor-critic (SAC) (Haarnoja et al., 2018) method (see the appendix for details). The policy network π n,m is updated based on synthetic data D n,m M generated by dynamics model n and policy model m, as well as environmental data D E generated by interacting the real world dynamics with a sampled policy. See Algorithm 1 for the pseudo-code.

3.3. SAMPLING POLICIES

Ensemble Sampling. Given the posterior distributions, it remains to specify the sampling approach for policies. The simplest sampling strategy is uniform sampling at the beginning of each episode, π ∼ U({π 1,1 , ..., π N,M }). (18) In the case of bandits (where N = 1, since there is no transition model), such a simple strategy has been shown to achieve a regret of Õ( & Van Roy, 2017; Qin et al., 2022) for T steps in Gaussian linear bandits, where M is the size of the ensemble and A is the number of arms. This regret bound analysis tells us that adding more ensemble members will reduce the regret, although it's not clear how this theoretical result extends to the RL setting. √ T + T A/M ) (Lu Optimistic Ensemble Sampling. Unfortunately, ensemble sampling may overly explore some unpromising regions, as it treats each member in the ensemble model equally. This may lead to unnecessary or wasteful explorations. To reduce this, we propose an optimistic version of ensemble sampling which we call OPS-MBPO. Specifically, we keep track of the performance of each ensemble member in terms of the accumulated episodic return (alternatively, one could also use the value function for each policy (Agarwal & Zhang, 2022) .) We then use this performance to determine the probability of choosing each member, thus gradually discarding unpromising ensemble members. More precisely, at the beginning of k th episode, we sample the policy from the following Boltzmann distribution, instead of uniformly at random, p k (π = π i ) := exp k l=1 R E (π i , l)/τ N •M j=1 exp k l=1 R E (π j , l)/τ , ( ) where τ is a temperature term for controlling the level of optimism, and R E (π i , l) is the empirical cumulative reward of π i at the l th episode. When τ → ∞, we recover uniform ensemble sampling, which we call PS-MBPO.

4. EXPERIMENTS

Our empirical evaluation aims to 1) verify the effectiveness of our proposed methods; 2) offer a deeper understanding about the mechanisms that are key to the improved performance; and 3) provide additional ablation studies on non-key components. We start by introducing the experimental setup. 

4.1. EXPERIMENTAL SETUP

We consider seven tasks from OpenAI Gym (Brockman et al., 2016) , Deepmind Control Suite (Tunyasuvunakool et al., 2020) and Metaworld (Yu et al., 2020) . This includes four dense reward tasks (ant, half-cheetah, walker2d and hopper) where the agent receives an immediate reward at each step, and three sparse reward tasks (ball-in-cup, cartpole-swingup and window-open-v2), where the agent receives a reward only if it finishes the corresponding task. For sparse reward tasks, efficiently exploring the environment is more crucial than in dense reward tasks. See Figure 3 for visualizations of the tasks. For baseline methods, we consider the model-based approaches SLBO (Luo et al., 2019) , PETS (Chua et al., 2018) and MBPO (Janner et al., 2019) , and a model-free method SAC (Haarnoja et al., 2018) . We compare each method in terms of the averaged episode reward, where each episode ends when the time step reaches 1, 000 or the agent reaches the terminal state. To draw more robust conclusions, we repeat each experiment with 10 different random seeds, and report the mean and the standard error. More details can be found in Appendix A and C.

4.2. COMPARISON WITH EXISTING METHODS

We report the results on the dense reward tasks in Figure 4 . We see that our (O)PS-MBPO outperforms the baselines in all four tasks, including the MBPO method. (We verified that our implementation of MBPO is comparable, or superior, to the performance of the original implementation; see appendix for details.) For example, on hopper, our method only requires roughly 40K iterations to reach an average reward around 3,500, whereas MBPO needs around 150K steps. We report the results on the sparse reward tasks in Figure 5 . We first observe that both PS-MBPO and OPS-MBPO outperform MBPO across three tasks, and the improvement is more significant on Cartpole-swingup and Window-open-v2. In contrast to Figure 4 , we also observe that OPS-MBPO further improves PS-MBPO on these three tasks by a significant margin. This confirms the advantage of adopting an optimistic sampling strategy in sparse reward tasks. 

4.3. ABLATION STUDIES

In this section, we conduct a variety of ablation studies to better understand our proposed method. Additional experiments can be found in the appendix, including ablation study with forced exploration (Phan et al., 2019) , random function prior network (Osband et al., 2018) etc. Figure 6 : Ablation study on the performance of with (solid curves) and without (dashed curves) the sampling step. Does the gain come from posterior sampling or ensemble? To assess the importance of posterior sampling, we compare performance where we sample a policy from the posterior at each episode to the standard approach where we use the average policy, computed by averaging the distribution over actions across all ensemble members. The results are presented in Figure 6 with N = 5 dynamics networks and M ∈ {1, 3, 5} policy networks. We see that when we disable the sampling procedure, the performance drops significantly. Another interesting observation is that, when with posterior sampling, the performance improves as we add more ensemble members (i.e., increase M ). By contrast, when posterior sampling is disabled, increasing M seems doesn't improve the performance. This confirms that posterior sampling is the main factor behind the the improved performance, not just the fact that we have a larger ensemble for both dynamics and policies. Effect of N and M . Since we are using the deep ensemble approximation, it is natural to wonder if we could get better performance by using a lager size of ensemble for dynamics models and policies. In Figure 7 , we plot the average reward of the last 10 evaluations with 10 different random seeds, using N dynamics models and M policy networks for each dynamics model, by varying N and M in {1, 2, 3, 4, 5}. We see that increasing both N and M can improve the performance, and that both forms of uncertainty (uncertainty on polices and dynamics) seem to matter. Visualization of the state space. To gain insight into the exploration behavior, we project the high-dimensional states of each trajectories collected by PS-MBPO and MBPO into a two-dimensional space using Umap (McInnes et al., 2018) . The visualization of these embeddings can be found in Figure 8 . We see that, at the initial phase, PS-MBPO (top left three figures) is more explorative than MBPO (top right three figures), and that this also leads to better final performance (see Figure 5 ). In addition, we also plot two representative trajectories of PS-MBPO and MBPO for a qualitative comparison in Figure 8 at the same training iterations. Visually, we observe that PS-MBPO can move the robot arm to cover a more diversified regions, whereas MBPO fails to do so and the two trajectories look very similar to each other (this is also reflected in the Umap visualization).

5. RELATED WORKS

Model-based reinforcement learning (MBRL). Research in MBRL mainly concerns how to learn the dynamics model given the data, and how to use the learned model. The most commonly adopted approach for model learning is using the L 2 loss for one-step transitions (Kurutach et al., 2018; Luo et al., 2019; Chua et al., 2018; Janner et al., 2019; Rajeswaran et al., 2020) , which is also equivalent to maximum likelihood estimation under a Gaussian assumption. In addition to the one-step training, Hafner et al. (2019) ; Asadi et al. (2019) ; Lutter et al. (2021) show that multi-step training can further improves the prediction accuracy for long horizons, but with extra computational costs, which scale quadratically with the number of prediction steps. Another line of work focusses on the objective mismatch problem in MBRL (Ziebart, 2010; Farahmand et al., 2017; Luo et al., 2019; Lambert et al., 2020; Eysenbach et al., 2021) , which modify the model training objective to provide performance guarantee for the induced policy in the unknown real environment. Given the learned model, there are several ways to utilize it. Model predictive control (MPC) (Camacho & Alba, 2013) is a derivative-free optimization method that has been adopted in many prior works (Nagabandi et al., 2018; Chua et al., 2018; Hafner et al., 2019; Fan & Ming, 2021; Lutter et al., 2021) . However, MPC is sensitive to the planning horizon and struggles with high-dimensional problems. As a mitigation, Kurutach et al. (2018) ; Luo et al. (2019) ; Janner et al. ( 2019) instead train a policy on top of the model for amortizing the planning cost. Similarly, there are also works that utilize the model to facilitate learning of value functions (Feinberg et al., 2018; Buckman et al., 2018) . However, few of these works have investigated the exploration and exploitation tradeoff. Exploration and exploitation. Handling the exploration and exploitation tradeoff is the central problem in online learning. Typical methods can be categorized into the following three classes: 1) optimism-based (Jaksch et al., 2010; Pacchiano et al., 2021; Curi et al., 2020) ; 2) posterior-samplingbased (Strens, 2000; Osband et al., 2013; Osband & Van Roy, 2014; Osband et al., 2018; Fan & Ming, 2021) ; and 3) information-directed sampling (Russo & Van Roy, 2014; Nikolov et al., 2019) approaches. Among them, optimism-based methods needs one to construct the confidence set that contains the target model/policy with high probability, which suffers from scalability issues (Osband & Van Roy, 2017); in addition this approach empirically performs worse than Thompson sampling (Chapelle & Li, 2011) . Information-directed sampling can be better than optimism-based methods and Thompson sampling, as it directly minimizes the "regret per information bit" (Russo & Van Roy, 2014) . However, it relies on estimating the mutual information between random variables, which is especially difficult for high-dimensional continuous random variables (McAllester & Stratos, 2020) . Therefore, we focus on posterior sampling. However, different from prior works, we study the effect of approximate inference in an RL setting.

6. SUMMARY AND FUTURE WORK

In this paper, we presented PS-MBPO and OPS-MBPO as two algorithms for efficient model-based reinforcement learning in complex environments. We demonstrate that both PS-MBPO and OPS-MBPO can greatly improve the sample efficiency in online reinforcement learning, and surpass various baseline methods by a large margin, especially on sparse reward tasks. We hope that, beyond our specific approach, our analysis can inspire future works to propose improved factorizations of the posterior over policies and models. In the future, we would like to explore automatically adapting the value of λ in an online fashion. (See Section B.2 for some preliminary results.) In addition, we would like to extend the result of Phan et al. (2019) and prove a sublinear regret for PSRL under approximate inference. It would also be interesting to explore epistemic neural networks (Osband et al., 2021) and transformers (Vaswani et al., 2017; Müller et al., 2021) as alternatives to deep ensembles for approximate posterior inference. Lastly, making information directed sampling (Russo & Van Roy, 2014) practical for reinforcement learning problems is another promising direction to further improve the sample efficiency. Reproducibility Statement. We will open source the code for reproducing the results in our paper. For details about our experiments and algorithms, we encourage the reader to checkout Appdendix A for extended backgrounds and Appendix C for the hyperparameters used in our experiments.

A EXTENDED BACKGROUNDS A.1 DYNAMICS MODEL

We use deep ensemble for fitting the environment dynamics. For each network in the ensemble f θ , it takes a whitened state and action pair as input, and predicts the residual of the next state as well as the reward, i.e., f θ s t -µ s σ s , a t -µ a σ a = N ∆s t r(s t , a t ) , diag(σ 2 ∆st ), 0 0, σ 2 rt , where µ s , µ a are the empirical mean of the states and actions, σ s and σ a are the empirical standard deviation of them. Then, the predictions of next state and reward will be s t+1 r t ∼ N s t + ∆s t r(s t , a t ) , diag(σ 2 ∆st ), 0 0, σ 2 rt . ( ) Below is our implementation of each individual neural network in JAX (Bradbury et al., 2018) . We use SAC for learning the policies. In a highlevel, SAC is a maximum entropy RL algorithm, which typically optimizes the following objective, J(π) = T t=0 E (s,a)∼ρπ [r(s, a) + αH(π(•|s))]. As a result, maximum entropy RL algorithm will favor those policies that not only optimize for the reward, but also has a large entropy. This can in turn improve the robustness of the optimized policy. As for SAC, it searches the policy by iteratively solving the policy evaluation and policy improvement steps. Policy Evaluation: Q πt (s, a) ← r(s, a) + γE s ′ ∼p(•|s,a) [V πt (s ′ )], (23) V πt (s) = E a∼πt(•|s) [Q πt (s, a) -log π t (a|s)]; Policy Improvement: π t+1 ← arg min π d KL ( π(•|s 0 )| exp(Q πt (s 0 , •))) . In the practical implementation of SAC, it uses a separate function approximator for the state value to stabilize the training. Specifically, there are three components in SAC, a parameterized state value function V ψ (s), a soft Q-function Q θ (s, a), and a policy π ϕ (a|s). The objectives for each component are J V (ψ) = E s∼D 1 2 V ψ (s) -E a∼π ϕ (•|s) [Q π ϕ (s, a) -log π ϕ (a|s)] 2 , ( ) J Q (θ) = E (s,a)∼D 1 2 Q θ (s, a) -Q(s, a) 2 , ( ) J π (ϕ) = E s∼D [d KL ( π ϕ (•|s)| exp(Q θ (s, •)))] , where Z θ (•) is a normalizing constant, and Q(s, a) is defined as Q(s, a) := r(s, a) + γE s ′ ∼p(•|s,a) V ψ (s ′ ) . Additionally, ψ is the exponentially moving average of the weights of the value network, and J π (ϕ) can be optimized with reparameterization trick under Gaussian case, which can further reduces the variance of the gradient estimator and hence stabilizes the training. We adopt the SAC implementation from Acme (Hoffman et al., 2020) .

A.3 REVISITING MBPO, PETS AND ME-TRPO

Popular model-based reinforcement learning algorithms such as ME-TRPO (Kurutach et al., 2018) , PETs (Chua et al., 2018) and MBPO (Janner et al., 2019) typically repeat the following three steps: 1) train a dynamics model (or an ensemble of models) q(M|D E ); 2) train/extract a policy π ⋆ from the learned model; 3) collect data from the environment with the policy. Consequently, their policy is (approximately) equivalent to the one obtained by solving π ⋆ = arg max π∈Π E M [R M (π)] = arg max π∈Π R M (π)q(M|D E )dM, where the posterior of the model M is approximated by an ensemble of neural networks, Π is the search space of policies, and the cumulative reward R M (π) for an episode of length H of a policy π under dynamics model M is defined as R M (π) = E H t=1 r M (s t , a t ) where s t+1 ∼ p M (s|s t , a t ) and a t ∼ π(a|s t ). (31) However, the above strategy only accounts for exploitation, so will lead to low data efficiency.

A.4 POSTERIOR SAMPLING REINFORCEMENT LEARNING

The idea of PSRL is introduced by Strens (2000) . The first regret bound Õ(HS √ AT ) for PSRL is proved by Osband et al. (2013) for a tabular case with S, A, T , H denotes the number of state, number of actions, number of time steps, and the length of each episode, respectively. In Osband & Van Roy (2017) ). In addition to the bound on Bayesian regret, there is also a line of works studying the worst-case or frequentist regret bound for PSRL (Agrawal & Jia, 2017; Tiapkin et al., 2022b; a) , and achieve a regret bound of order Õ( √ T ). Nevertheless, all these regret bounds are derived under exact Thompson sampling. For OPS-MBPO, we will maintain the weights for each policy throughout the entire process of online learning. To investigate how those weights evolve, we plot the weights of each policy in Figure 9 and Figure 10 , which covers one dense reward task and one sparse reward task. The leftmost figure corresponds to the weights in the initial phase, which will change more rapidly than the later phases. We observe that the weights of some polices will first go up and than go down, and finally it will converge to a single policy. More interestingly, the reward curve in the rightmost figure is also consistent with the pattern in the optimistic weights curve.

B.2 HOW DOES THE TEMPERATURE TERM τ AFFECT THE PERFORMANCE?

We further study how does the temperature term will affect the performance on both dense reward and sparse reward tasks. The results can be found in Figure 11 and Figure 12 . We observe that the temperature term will affect the convergence speed of the reward on most of the tasks. For some of the tasks, such as Ant, Hopper and Walker2d, it will also affect the converged reward slightly. In general, we recommend the temperature term to be around five times of the best averaged episodic reward that can be achieved.

B.3 HOW DOES THE SCHEDULE OF λ AFFECT THE PERFORMANCE?

Since λ plays a role in balancing the effect of approximate inference error and data efficiency, we are interested in how different schedules of λ will affect the reward curve. We consider two schemes for adjusting λ, i.e., 1) a constant schedule; and 2) a linear schedule. For constant schedule, we fix the value of λ throughout training, whereas for linear schedule, we decrease the value of λ from 1 to 0 linearly. The rightmost figure in Figure 13 visualizes the difference between these two schedules. To make them comparable, we ensure that the areas under both curves are of the same size, so that the Figure 12 : Ablation study on how the choice of the temperature will affect the performance on sparse reward tasks. All the experimental setups are the same as those experiments in our main paper, except for the temperature term. total amount of real-world data is the same. The comparison on three tasks are shown in the left three figures of Figure 13 . We observe that the linear schedule has very little effect on the dense reward tasks, though slightly improves the the final reward in Hopper. For Cartpole-swingup, the performance of linear schedule improves faster than constant schedule, but achieves similar rewards in the end. Nevertheless, we believe that there might be more sophisticated schedules for λ that can achieve better performance than the constant schedule, e.g., adapting the value of λ based on the model's validation loss. For simplicity, we recommend to use a constant schedule in practice. For sparse reward tasks, the search range of λ can be {0, 0.1, 0.3, 0.5, 0.7}, and {0, 0.05, 0.15, 0.3} for dense reward tasks. In addition to the grid search, we believe it's also possible to adapt the value of λ online. We can cast the the problem of choosing the optimal value of λ as a bandit problem. The high-level idea of the algorithm is: 1) Initialize a set of possible values for λ, and treat each value of λ as an arm in bandit. 2) Apply any no-regret learning algorithms for solving it, e.g., explore-then-commit. However, we haven't test this algorithm yet, and it would be interesting as a future extension.

C ADDITIONAL DETAILS ABOUT THE ALGORITHM AND EXPERIMENTS

Algorithm 2 PS-MBPO (abstract formulation) Require: Prior distributions q(M), q(π) and tuning hyperparameter λ. Require: Initialize an empty dataset D E for storing data collected from the environment. 1: for K episodes do 2: Fit the posterior of the policy q(π|D E , λ) on data D E using equation 13.

3:

Sample a policy π k ∼ q(π|D E , λ) from the posterior distribution.

4:

for H steps do 5: Run the policy π k in the environment and add the collected data to D E .

6:

end for 7: end for PS-MBPO with λ ∈ (0, 1). MBPO adopt a point estimation to the policy, which is obtained by MAP inference. Thus, the uncertainty propagation from the dynamics model to the policy is blocked. For (b), it implicit assumes the dynamics model captures all the relevant properties of the data. For (c), we add a short cut from data directly to the policy, which is controlled by the value of λ. Hence, the policy can further utilize the information in the data that are not captured in the dynamics model.

C.1 ALGORITHM DETAILS

In Algorithm 1, we approximate the posterior of MDPs and policies, i.e., q(M|D E ) and q(π|M, D E , λ) using deep ensemble, which can be regarded as a finite particle approximation to the posterior. Specifically, q(M|D E ) is approximated by {M θn } N n=1 and q(π|M θn , D E , λ) is approximated by {π φn,m } M m=1 for all n ∈ [N ], where both M θn and π φn,m are implemented using a multi-layer perceptron (MLP) with parameters θn trained on D E and φn,m trained on the mixed dataset λD E + (1 -λ)D n,m M , respectively. By mixed dataset λD E + (1 -λ)D n,m M , we mean that for each data point in the training batch, it is with probability of λ being sampled from the real data D E and probability of 1 -λ from the fictitious data D n,m M .

C.2 IMPLEMENTATION DETAILS

In this section, we provide the additional details about our algorithm and experiments. We provide a detailed description of our approach in Algorithm 1 and a visual illustration about its difference with MBPO in Figure 14 . In terms of the hyperparameters, our choice of them are mostly the same as the ones adopted in MBPO (Janner et al., 2019) and Pineda et al. (2021) for Ant, Halfcheetah, Hopper, Walker2D and Cartpole-swingup, and Eysenbach et al. (2021) for Window-open-v2, which are sufficiently optimized by the authors for MBPO. Specifically, the hyperparameters of MBPO are directly adopted from https://github.com/ facebookresearch/mbrl-lib for dense reward tasks. For sparse reward tasks, the hyperparameters are adopted from Eysenbach et al. (2021) . The hyperparameters for our method on each task are reported in Table 1 . We will also release our code for reproducing all the experiments. Next, we introduce the details about each tasks.

C.3 TASK DETAILS

Ant, Halfcheetah, Hopper and Walker2D. These tasks are taken from the official Github repository of MBPO (Janner et al., 2019) , https://github.com/jannerm/mbpo. Window-open-v2. This task is based on the original Window-open-v2 in Metaworld benchmakr (Yu et al., 2020) . The sparse reward is 0 only if the window is within 3 units of the open position, and -10 for all other positions. 

D FORCED EXPLORATION

Forced exploration is proposed in Phan et al. (2019) to improve approximate Thompson sampling for bandit problems. Without properly dealing with the approximate inference error, there will be an extra term in the regret that is linear in T , regardless how small the error is. In their paper, they use the α-divergence for measuring the approximate inference error, defined as D α (P, Q) = 1 -p(x) α q(x) 1-α dx α(1 -α) . ( ) The α-divergence can capture many divergences, including forward KL divergence (α → 1), backward KL divergence (α → 0), Hellinger distance (α = 0.5) and χ 2 divergence (α = 2). Different inference methods will give error guarantee measured by α-divergence with different α. We are interested in the error guarantee under the reverse KL-divergence, i.e., α = 0, as the ensemble sampling (Lu & Van Roy, 2017) provides error guarantees under the reverse Kl-divergence. In Phan et al. (2019) , they prove that forced exploration can make the posterior concentrate and hence restore the sub-linear regret bound, if the inference error is bounded by α-divergence with α ≤ 0. The reverse KL-divergence falls in this case. Specifically, the forced exploration is a simple method, that maintains a probability of random exploration. This probability decays as t, the online steps, grows. Though the above results only hold for bandit setting and it's unclear for reinforcement learning, we are interested in testing its empirical performance in RL. In our experiments, we consider the following exploration rate p k (random explore=True) = Bern(τ /k), ( ) where k is the index for the episode, and τ is the hyperparameter for controlling the frequency of forced exploration. As k increases, the random exploration probability will decrease. In our experiments, we consider τ ∈ {1, 5, 10}. All the other settings are the same as our experiments in the main paper. The results are presented in Figure 16 . We observe that forced exploration is mostly not helpful in our experiments, except for the Hopper task. Moreover, increasing τ usually make the performance even worse. On the other hand, this may not be so surprising as the forced exploration is designed for approximate Thompson sampling in the bandit setting, and the result may not necessary generalize to the RL setting. We leave the theoretical analysis as a future work.

E RANDOM FUNCTION PRIOR

The random function prior (RFP) is proposed in Osband et al. (2018) for improving the uncertainty estimation. The prior network are chosen for modelling the uncertainty that does not come from the observed data. The RFPs can also be viewed as a regularization in the output space. In contrast to weight space regularization, RFP makes it easier to incorporate different property (e.g., periodicity) of the function to be learned as a prior information. More importantly, when using deep ensemble, incorporating the RFP is fairly simple. It modifies the original training objective ℓ(f θ , D) by adding an extra regularization term, ℓ RF P (f θ , D) := ℓ(f θ + βf θ0 , D), where β is a scaling term for adjusting the effect of the prior, f θ0 is the prior network which is held fixed during training. Hence, we also conduct experiments with RFPs in our experiments to investigate how does the RFPs will affect the learning of dynamics models. We vary the value of β in {0. Nevertheless, one interesting observation is that both forced exploration and RFPs seem to help on Hopper, and their overall pattern on three tasks is a bit consistent. Therefore, it would be interesting to figure out if there is a deep connection between the forced exploration and RFPs.

F ADDITIONAL UMAP VISUALIZATIONS

We provide the Umap visualization of the state embeddings of PS-MBPO and MBPO during training in Figure 18 and Figure 19 . We observe that PS-MBPO will mostly cover a more broad range of the embedding space, and its pattern also evolves more rapidly than MBPO. 

G PROOFS

In this section, we present the proof of Theorem 1. The proof of this theorem is inspired by the techniques in Russo & Van Roy (2016) , with some additional modifications to extend the results from bandit setting to the reinforcement learning setting.

G.1 PROOF OF THEOREM 1

Theorem 1 For K episodes, the Bayesian regret of posterior sampling reinforcement learning algorithm A with any approximate posterior distribution q k at episode k is upper bounded by CK(HR max ) 2 H (π ⋆ ) + 2HR max K k=1 E d KL ( q k (π)| p k (π)) , where H(π ⋆ ) is the entropy of the prior distribution of optimal polices, i.e., p(π) = δ(π|M)p(M)dM, C is some problem-dependent constant and d KL ( •| •) is the KL-divergence. Proof: Recall the definition of Bayesian regret, BayesianRegret(T, A , p(M)) := E [Regret(T, A , M )] = E K k=1 ∆ k . Let's denote history at the beginning of episode k as H k . Then, we can rewrite the Bayesian regret as BayesianRegret(T, A , p(M)) = K k=1 E H k [E [∆ k |H k ]] . By doing so, we can bound each term E[∆ k |H k ] separately. For convenience, we define E k [∆ k ] := E[∆ k |H k ]. Then, by Lemma 1, we can further decompose it into, E[∆ k |H k ] = G k + D k , where G k := q k (π)p k (π) E k V M π,1 (s 1 )|π ⋆ = π -E k V M π,1 (s 1 ) dπ and D k := p k (π) -q k (π) p k (π)E k V M π,1 (s 1 )|π ⋆ = π + q k (π)E k V M π,1 (s 1 ) dπ. Then, it remains to bound K k=1 E[G k ] and K k=1 E[D k ]. By Lemma 2, we can bound the sum of expectation of D k by K k=1 E[D k ] ≤ 2HR max K k=1 E [d KL ( q k | p k )]. By Lemma 3, the upper bound for the sum of the expectation of G k is K k=1 E[G k ] ≤ CK ((HR max ) 2 /2) H (π ⋆ ). Hence, the term D k captures the regret incurred by the approximate inference error, and G k captures the standard regret for Thompson sampling, which is of order Õ( √ K). By combining them together, we finally arrive at the upper bound of the Bayesian regret BayesianRegret(T, A , p(M)) ≤ CK(HR max ) 2 H (π ⋆ ) + 2HR max K k=1 E d KL ( q k (π)| p k (π)) . ( ) □ G.2 SUPPORTING LEMMAS Lemma 1 For each time k = 1, ..., K, we have E [∆ k |H k ] = E V M π ⋆ ,1 (s 1 ) -V M π k ,1 (s 1 )|H k := E k V M π ⋆ ,1 (s 1 ) -V M π k ,1 (s 1 ) = G k + D k , (43) where G k := q k (π)p k (π) E k V M π,1 (s 1 )|π ⋆ = π -E k V M π,1 (s 1 ) dπ and D k := p k (π) -q k (π) p k (π)E k V M π,1 (s 1 )|π ⋆ = π + q k (π)E k V M π,1 (s 1 ) dπ. ( ) Proof: Conditioning on the history H k , we can write the conditional Bayesian regret as E k V M π ⋆ ,1 (s 1 ) -V M π k ,1 (s 1 ) (46) = p k (π)E k V M π,1 (s 1 )|π ⋆ = π dπ -q k (π)E k V M π,1 (s 1 )|π k = π dπ (47) = p k (π)E k V M π,1 (s 1 )|π ⋆ = π dπ -q k (π)E k V M π,1 (s 1 ) dπ (48) = G k + D k , where the second equality holds because the value function is independent of the instantiation of the policy π k when given the history H k . □ Lemma 2 For any k = 1, ..., K, we have K k=1 E[D k ] ≤ 2HR max K k=1 E [d KL ( q k | p k )]. Proof: Recall D k , D k := p k (π) -q k (π) p k (π)E k V M π,1 (s1)|π ⋆ = π + q k (π)E k V M π,1 (s1) dπ (51) By using the Cauchy-Schwarz inequality, we have D k ≤ p k (π) -q k (π) 2 dπ • p k (π)E V M π,1 (s 1 )|π ⋆ = π 2 dπ + q k (π)E k V M π,1 (s 1 ) 2 dπ . ( ) By the definition of Hellinger distance d H ( •| •) between two random variables, we have D k ≤ d H ( q k | p k ) p k (π)E V M π,1 (s 1 ) 2 |π ⋆ = π dπ + q k (π)E k V M π,1 (s 1 ) 2 dπ . (53) Since [d H ( •| •)] 2 ≤ d KL ( •| • ) and V M π,1 is a bounded random variable with HR max as its upper bound, we have D k ≤ 2d H ( q k | p k ) HR max ≤ 2 d KL ( q k | p k )HR max . Hence, we have K k=1 E[D k ] ≤ 2HR max K k=1 E [d KL ( q k | p k )].

□

Lemma 3 For each k = 1, ..., K, we have K k=1 E[G k ] ≤ CK ((HR max ) 2 /2) H (π ⋆ ). Proof: Recall the definition of G k , G k := q k (π)p k (π) E k V M π,1 (s 1 )|π ⋆ = π -E k V M π,1 (s 1 ) dπ. Since V M π,1 (here, we drop the dependency on s 1 for clearness) is a bounded random variable, and more specifically, it's ((HR max )/2)-sub-Gaussian. Hence, by Lemma 4, the following holds, E k V M π,1 |π ⋆ = π -E k V M π,1 ≤ HR max 2 2d KL p k (V M π,1 |π ⋆ = π) p k (V M π,1 ) . ( ) This gives us that G k ≤ q k (π)p k (π) HR max 2 2d KL p k (V M π,1 |π ⋆ = π) p k (V M π,1 ) dπ. Then, we can further rewrite the KL-divergence using the conditional mutual information I k (•; •) (i.e., conditioning on the history H k ), q k (π)p k (π ′ )d KL p k (V M π,1 |π ⋆ = π ′ ) p k (V M π,1 ) dπdπ ′ = q k (π)I k (π ⋆ ; V M π,1 )dπ. When conditioning on the history H k , the optimal policy π ⋆ and M is independent of the π k , hence we have q k (π)I k (π ⋆ ; V M π,1 )dπ = q k (π)I k (π ⋆ ; V M π k ,1 |π k = π)dπ = I k (π ⋆ ; V M π k ,1 |π k ). By the fact that π ⋆ is jointly independent of V M π k ,1 and π k when conditioning on the history H k , hence we have I k (π ⋆ ; V M π k ,1 |π k ) = I k (π ⋆ ; V M π k ,1 |π k ) + I k (π ⋆ ; π k ). ( ) By the chain rule of mutual information, we finally get I k (π ⋆ ; V M π k ,1 |π k ) + I k (π ⋆ ; π k ) = I k (π ⋆ ; (V M π k ,1 , π k )). Now, let's define the following function g k and C, g k (π, π ′ ) := q k (π)p k (π ′ ) E k V M π,1 |π ⋆ = π ′ -E k V M π,1 . ( ) C := max k∈Z+ ( g k (π, π)dπ) 2 [g k (π, π ′ )] 2 dπdπ ′ . Thus, we further have  I k (π ⋆ ; (V M π k ,1 , π k )) ≥ 2 (HR max ) 2 [g k (π, π ′ )] 2 dπdπ ′ . ( = CK ((HR max ) 2 /2) E K k=1 I k (π ⋆ ; (π k , V M π k ,1 )) ≤ CK ((HR max ) 2 /2) H (π ⋆ ), where the last inequality holds because of the chain rule of mutual information E K k=1 I k (π ⋆ ; (π k , V M π k ,1 )) = E K k=1 I(π ⋆ ; (π k , V M π k ,1 )|H k ) (75) = E I(π ⋆ ; (π 1 , V M π 1 ,1 , π 2 , V M π 2 ,1 , ..., π K , V M π K ,1 ) = H(π ⋆ ) -E[H(π ⋆ |(π 1 , V M π 1 ,1 , π 2 , V M π 2 ,1 , ..., π K , V M π K ,1 ))] (77) ≤ H(π ⋆ ). (78) Remark 1 When the number of policies |Π| is finite, and the value function V M π,1 is linear with its parameter lives in R d , then C can be upper bounded by d, i.e., C ≤ d. Lemma 4 (Russo & Van Roy (2016) ) Suppose that there is a H k -measurable random variable η, such that for each π ∈ Π, V M π,1 is a η-sub-Gaussian random variable when conditioned on H k , then for every π, π ′ ∈ Π, the following holds with probability 1, E k [V M π,1 |π ⋆ = π ′ ] -E k [V M π,1 ] ≤ η 2d KL p k (V M π,1 |π ⋆ = π ′ ) p k (V M π,1 ) .



For simplicity, we assume that there is only one optimal policy for each MDP. By mixed dataset λDE + (1 -λ)D n,m M , we mean that for each data point in the training batch, it is with probability of λ being sampled from the real data DE and probability of 1 -λ from the fictitious data D n,m M .



Figure 1: Graphical models for (a) the standard and (b) our posterior over policies π. Differences are shown in red.

Figure 2: A comparison of cumulative regret for different λ.

Figure 3: We consider seven tasks from three benchmarks: OpenAI Gym, DM Control and Metaworld. These seven tasks cover both dense reward and sparse reward tasks.

Figure 4: Comparisons on four tasks with dense rewards. The shaded region denotes the one-standard error. The dashed green curve corresponds to the asymptotic performance of SAC at 3M steps. PS-MBPO improves over MBPO across all of the four tasks, and the improvement is more significant on Ant and Walker2D. OPS-MBPO achieves similar sample efficiency with PS-MBPO.

Figure 5: Comparisons on three tasks with sparse rewards. PS-MBPO improves over MBPO, and OPS-MBPO further improves over PS-MBPO in terms of sample efficiency.

Figure 8: Visualization of the visited state space of PS-MBPO (top left) and MBPO (top right) on Window-open-v2 at the initial stage of training. We also present two representative trajectories of PS-MBPO (bottom left) and MBPO (bottom right), which are four frames taken from the videos.

Figure 7: Average reward for varying number of dynamics model (N ) and policies (M ).

, the bound was improved to Õ(H √ SAT ). For the continuous case, Osband & Van Roy (2014) provides the first regret bound Õ( √ d K d E T ) based on the eluder dimension d E and Kolmogorov dimension d K . More recently, Fan & Ming (2021) study the regret bound for PSRL under Gaussian process assumption, and obtain a regret bound of Õ(H 3/2 d √ T

Figure 9: Visualization of the optimistic weights of the first 40K iterations (left) and during the entire training process (middle), and the reward curve (right) on Hopper. Each chart in the left and middle figures corresponds to the weights of each single policy.

Figure 10: Visualization of the optimistic weights of the first 100K iterations (left) and during the entire training process (middle), and the reward curve (right) on Cartpole-Swingup. Each chart in the left and middle figures corresponds to the weights of each single policy.

Figure11: Ablation study on how the choice of the temperature will affect the performance on dense reward tasks. All the experimental setups are the same as those experiments in our main paper, except for the temperature term.

Figure 13: Ablation study on how the schedule of λ affect the performance with PS-MBPO. All the experimental setups are the same as the experiments in main paper.

Figure 15: Comparisons on four tasks with dense rewards including Halfcheetah, Ant, Hopper and Walker2D. MBPO ⋆ is the curve from the original paper by Janner et al. (2019). To be noted, in the original implementation of MBPO, they use 7 networks for the ensemble of dynamics model, whereas our implementation only uses 5 networks. But still, our implementation mostly reproduces their results and sometimes is even better.

Figure 16: Experiments of forced exploration on Walker2d, Hopper and Window-open-v2. The shaded region denotes the one-standard error. τ = 0 is the one without forced exploration.

Figure 17: Experiments with random function prior networks (RFPs) on Walker2d, Hopper and Window-open-v2. The shaded region denotes the one-standard error. β = 0 corresponds to the one without using random function prior networks.

Figure 18: Visualization of the Umap embeddings of PS-MBPO and MBPO from consecutive training periods on Hopper, Ant and Halfcheetah (from top to bottom).

Figure 19: Visualization of the Umap embeddings of PS-MBPO and MBPO from consecutive training periods on Walker2d, Ball-in-Cup and Cartpole-swingup (from top to bottom).

On the other hand, we can rewrite G k asG k = g k (π, π)dπ. (67)By rearranging the terms, we getG 2 k I k (π ⋆ ; (π k , V M π k ,1 )) ≤ (HR max ) 2 /2 ( g k (π, π)dπ) 2 [g k (π, π ′ )] 2 dπdπ ′ ≤ C (HR max ) 2 /2 . (68)Hence,G k ≤ C ((HR max ) 2 /2) I k (π ⋆ ; (π k , V M π k ,1 )). max ) 2 /2) I k (π ⋆ ; (π k , V M π k ,1 )) π ⋆ ; (π k , V M π k ,1))

Hyperparameters for each task. x → y over episodes a → b denotes a segment linear function, f (k) = ⌊min(max(x + (k -a)/(b -a) • (y -x), x), y)⌋. We use Ball, Cart, Cheetah, Walker and Window as abbreviations for Ball-in-Cup, Cartpole-Swingup, Halfcheetah, Walker2D and Window-open-v2 so as to make the table fit in the page.

1, 0.3, 1}. The results are reported in Figure17. Firstly, by properly choosing the value of β, RFPs slightly improve the performance on Hopper, and don't affect the performance a lot on Walker2D. However, for Window-open-v2, RFPs will hurt the performance a lot. We conjecture that this might because our choice of the prior function on Window-open-v2 is not suitable for the task, i.e., the reward is sparse in Window-open-v2, but the RFPs don't induce sparsity on the predictions.

