POSTERIOR SAMPLING MODEL-BASED POLICY OPTI-MIZATION UNDER APPROXIMATE INFERENCE Anonymous

Abstract

Model-based reinforcement learning algorithms (MBRL) hold tremendous promise for improving the sample efficiency in online RL. However, many existing popular MBRL algorithms cannot deal with exploration and exploitation properly. Posterior sampling reinforcement learning (PSRL) serves as a promising approach for automatically trading off the exploration and exploitation, but the theoretical guarantees only hold under exact inference. In this paper, we show that adopting the same methodology as in exact PSRL can be suboptimal under approximate inference. Motivated by the analysis, we propose an improved factorization for the posterior distribution of polices by removing the conditional independence between the policy and data given the model. By adopting such a posterior factorization, we further propose a general algorithmic framework for PSRL under approximate inference and a practical instantiation of it. Empirically, our algorithm can surpass baseline methods by a significant margin on both dense rewards and sparse rewards tasks from the Deepmind control suite, OpenAI Gym and Metaworld benchmarks.

1. INTRODUCTION

Model-based reinforcement learning has demonstrated great success in improving the sample efficiency of RL. However, many existing popular model-based algorithms (Kurutach et al., 2018; Chua et al., 2018; Janner et al., 2019) cannot deal with exploration and exploitation properly, and hence may lead to poor performance when exploration is crucial. To trade off exploration and exploitation, most of the existing algorithms can be categorized into 1) optimism-based (Jaksch et al., 2010; Pacchiano et al., 2021; Curi et al., 2020) ; 2) posterior-sampling-based (Strens, 2000; Osband et al., 2013; 2018; Fan & Ming, 2021) ; and 3) information-directed sampling (Russo & Van Roy, 2014) approaches. As shown by Osband & Van Roy (2017) , posterior sampling reinforcement learning (PSRL) can match the statistical efficiency (or regret bound) of optimism-based algorithms, but enjoys better computational efficiency. Information-directed sampling methods can be more statistically efficient when faced with complex information structure (Russo & Van Roy, 2014), but require estimators for the mutual information, which is difficult for high-dimensional random variables. Hence, in this paper, we focus on posterior sampling. For simplicity, we restrict attention to the episodic RL setting. Under the PSRL framework, one maintains a posterior p(M|D E ) of the Markov decision process (MDP) M given the observations D E collected in real environment E. At the beginning of each episode, an MDP is sampled from the posterior, and then we compute the optimal policy π(M) for the sampled model M. Equivalently, we can also view this policy as a sample from a "degenerate" posteriorfoot_0 over policies of the form p(π|D E ) = δ(π|M)p(M|D E )dM, where δ(π|M) = δ(ππ(M)) is a Dirac delta distribution. This policy is then executed in the real environment to collect new data. Theoretically, such a simple strategy has been shown to achieve a Bayesian regret of Osband et al., 2013; Osband & Van Roy, 2014) . However, the theoretical guarantees only hold under exact inference, i.e., when we have access to the true posterior over models p(M|D E ), and when we can compute the optimal policy, which is very unlikely in practice. learned by a neural network. In this way, the resulting policy is then sampled from q δ (π|D E ) = δ(π|M))q(M|D E )dM, where we replace p(M|D E ) with the approximate posterior q(M|D E ). Õ( √ K) for K episodes ( At first glance, such a heuristic choice is natural as it shares the same form as the true posterior p(π|D E ). However, in this paper, we prove that we can get lower regret if we replace the degenerate δ(π|M) with a non-degenerate distribution of the form q(π|M, D E ), that depends on the model as well as empirical data D E ; this is necessary to compensate for the fact that the posterior over models q(M|D E ) may be suboptimal. By tuning the relative strength of the dependence of π on D E and M, we can find a sweet spot between maximizing the data efficiency and minimizing the effect of approximate inference error. Furthermore, such a decomposition is guaranteed to be no worse than the standard approach of using q δ (π|D E ). Building upon the above results, we come up with a generic framework for PSRL under approximate inference. To implement the method in practice, we combine deep ensembles (Lakshminarayanan et al., 2017) and Model-based Policy Optimization (MBPO) (Janner et al., 2019) . We also propose two different sampling strategies for policy selection that exploit our posterior approximation. Empirically, our algorithm significantly outperforms the baselines on both dense reward and sparse reward tasks (Brockman et al., 2016; Tunyasuvunakool et al., 2020; Yu et al., 2020) . Additionally, we also conduct various ablation studies to provide a better understanding of our algorithm. In summary, our contributions are 1. We conduct a rigorous study on how approximate inference affects the Bayesian regret in PSRL, showing that adopting the same methodology as in exact PSRL may be suboptimal when the true posterior is unavailable (Section 2). 2. Motivated by our analysis, we develop a generic framework for PSRL under approximate inference as well as a practical version of it based on deep ensembles and (optimistic) sampling approaches for the policies (Section 3). 3. We present empirical results on DM control suite, OpenAI Gym and Metaworld benchmarks to demonstrate the efficacy of the proposed approach (Section 4).

2. PROBLEM STATEMENT AND THEORETICAL RESULTS

We start by introducing some notation, and summarizing prior work, before presenting our new theoretical results, which forms a basis for our algorithm. Notation. We consider the finite-horizon episodic Markov Decision Process (MDP) problem, of which we denote an instance as M := {S, A, r M , p M , H, ρ}. For each instance M, S and A denote the set of states and actions, respectively. r M : S × A → [0, R max ] is the reward function, p M is the transition distribution, H is the length of the episode, and ρ is the distribution of the initial state. We further define the value function of a policy π under MDP M at timestep i as V M π,i (s) := E M,π H t=i r M (s t , a t )| s i = s , where s t+1 ∼ p M (s|s t , a t ) and a t ∼ π(a|s t ). We define π ⋆ as the optimal policy for an MDP where s t+1 ∼ p M (s|s t , a t ) and a t ∼ π(a|s t ), (2) M if V M π ⋆ ,i (s) = max π V M π,i where the initial state is sampled from s 1 ∼ ρ(s). Regret. For a given MDP M, the regret is defined as the difference between value function of the optimal policy in hindsight and that of the actual policy executed by the algorithm A , Regret(T, A , M) := K k=1 ρ(s 1 ) V M π ⋆ ,1 (s 1 ) -V M π k ,1 (s 1 ) ds 1 :=∆ k ,



For simplicity, we assume that there is only one optimal policy for each MDP.



s) for all s ∈ S and i ∈ [1, H]. We define the cumulative reward obtained by policy π over H steps sampled from model M as follows:R M (π) = E M,π H t=1 r M (s t , a t )

