POSTERIOR SAMPLING MODEL-BASED POLICY OPTI-MIZATION UNDER APPROXIMATE INFERENCE Anonymous

Abstract

Model-based reinforcement learning algorithms (MBRL) hold tremendous promise for improving the sample efficiency in online RL. However, many existing popular MBRL algorithms cannot deal with exploration and exploitation properly. Posterior sampling reinforcement learning (PSRL) serves as a promising approach for automatically trading off the exploration and exploitation, but the theoretical guarantees only hold under exact inference. In this paper, we show that adopting the same methodology as in exact PSRL can be suboptimal under approximate inference. Motivated by the analysis, we propose an improved factorization for the posterior distribution of polices by removing the conditional independence between the policy and data given the model. By adopting such a posterior factorization, we further propose a general algorithmic framework for PSRL under approximate inference and a practical instantiation of it. Empirically, our algorithm can surpass baseline methods by a significant margin on both dense rewards and sparse rewards tasks from the Deepmind control suite, OpenAI Gym and Metaworld benchmarks.

1. INTRODUCTION

Model-based reinforcement learning has demonstrated great success in improving the sample efficiency of RL. However, many existing popular model-based algorithms (Kurutach et al., 2018; Chua et al., 2018; Janner et al., 2019) cannot deal with exploration and exploitation properly, and hence may lead to poor performance when exploration is crucial. To trade off exploration and exploitation, most of the existing algorithms can be categorized into 1) optimism-based (Jaksch et al., 2010; Pacchiano et al., 2021; Curi et al., 2020) ; 2) posterior-sampling-based (Strens, 2000; Osband et al., 2013; 2018; Fan & Ming, 2021) ; and 3) information-directed sampling (Russo & Van Roy, 2014) approaches. As shown by Osband & Van Roy (2017) , posterior sampling reinforcement learning (PSRL) can match the statistical efficiency (or regret bound) of optimism-based algorithms, but enjoys better computational efficiency. Information-directed sampling methods can be more statistically efficient when faced with complex information structure (Russo & Van Roy, 2014), but require estimators for the mutual information, which is difficult for high-dimensional random variables. Hence, in this paper, we focus on posterior sampling. For simplicity, we restrict attention to the episodic RL setting. Under the PSRL framework, one maintains a posterior p(M|D E ) of the Markov decision process (MDP) M given the observations D E collected in real environment E. At the beginning of each episode, an MDP is sampled from the posterior, and then we compute the optimal policy π(M) for the sampled model M. Equivalently, we can also view this policy as a sample from a "degenerate" posterior 1 over policies of the form p(π|D E ) = δ(π|M)p(M|D E )dM, where δ(π|M) = δ(ππ(M)) is a Dirac delta distribution. This policy is then executed in the real environment to collect new data. Theoretically, such a simple strategy has been shown to achieve a Bayesian regret of Õ( √ K) for K episodes (Osband et al., 2013; Osband & Van Roy, 2014) . However, the theoretical guarantees only hold under exact inference, i.e., when we have access to the true posterior over models p(M|D E ), and when we can compute the optimal policy, which is very unlikely in practice. A common heuristic approximation to PSRL (see e.g., Fan & Ming (2021)) is to replace the posterior over models with some approximation, such as Bayesian linear regression on top of the representations 1 For simplicity, we assume that there is only one optimal policy for each MDP. 1

