EFFICIENT EXPLORATION FOR MODEL-BASED REIN-FORCEMENT LEARNING WITH CONTINUOUS STATES AND ACTIONS

Abstract

Balancing exploration and exploitation is crucial in reinforcement learning (RL). In this paper, we study the model-based posterior sampling algorithm in continuous state-action spaces theoretically and empirically. First, we improve the regret bound: with the assumption that reward and transition functions can be modeled as Gaussian Processes with linear kernels, we develop a Bayesian regret bound of , where H is the episode length, d is the dimension of the stateaction space, and T indicates the total time steps. Our bound can be extended to nonlinear cases as well: using linear kernels on the feature representation φ, the Bayesian regret bound becomes Õ(H 3/2 d φ T ), where d φ is the dimension of the representation space. Moreover, we present MPC-PSRL, a model-based posterior sampling algorithm with model predictive control for action selection. To capture the uncertainty in models and realize posterior sampling, we use Bayesian linear regression on the penultimate layer (the feature representation layer φ) of neural networks. Empirical results show that our algorithm achieves the best sample efficiency in benchmark control tasks compared to prior model-based algorithms, and matches the asymptotic performance of model-free algorithms.

1. INTRODUCTION

In reinforcement learning (RL), an agent interacts with an unknown environment which is typically modeled as a Markov Decision Process (MDP). Efficient exploration has been one of the main challenges in RL: the agent is expected to balance between exploring unseen state-action pairs to gain more knowledge about the environment, and exploiting existing knowledge to optimize rewards in the presence of known data. To achieve efficient exploration, Bayesian reinforcement learning is proposed, where the MDP itself is treated as a random variable with a prior distribution. This prior distribution of the MDP provides an initial uncertainty estimate of the environment, which generally contains distributions of transition dynamics and reward functions. The epistemic uncertainty (subjective uncertainty due to limited data) in reinforcement learning can be captured by posterior distributions given the data collected by the agent. Posterior sampling reinforcement learning (PSRL), motivated by Thompson sampling in bandit problems (Thompson, 1933) , serves as a provably efficient algorithm under Bayesian settings. In PSRL, the agent maintains a posterior distribution for the MDP and follows an optimal policy with respect to a single MDP sampled from the posterior distribution for interaction in each episode. Appealing results of PSRL in tabular RL were presented by both model-based (Osband et al., 2013; Osband & Van Roy, 2017) and model free approaches (Osband et al., 2019) in terms of the Bayesian regret. For H-horizon episodic RL, PSRL was proved to achieve a regret bound of Õ(H √ SAT ), where S and A denote the number of states and actions, respectively. However, in continuous state-action spaces S and A can be infinite, hence the above results do not apply. Although PSRL in continuous spaces has also been studied in episodic RL, existing results either provide no guarantee or suffer from an exponential order of H. In this paper, we achieve the first Bayesian regret bound for posterior sampling algorithms that is near optimal in T (i.e.

√

T ) and

