EFFICIENT EXPLORATION FOR MODEL-BASED REIN-FORCEMENT LEARNING WITH CONTINUOUS STATES AND ACTIONS

Abstract

Balancing exploration and exploitation is crucial in reinforcement learning (RL). In this paper, we study the model-based posterior sampling algorithm in continuous state-action spaces theoretically and empirically. First, we improve the regret bound: with the assumption that reward and transition functions can be modeled as Gaussian Processes with linear kernels, we develop a Bayesian regret bound of , where H is the episode length, d is the dimension of the stateaction space, and T indicates the total time steps. Our bound can be extended to nonlinear cases as well: using linear kernels on the feature representation φ, the Bayesian regret bound becomes Õ(H 3/2 d φ T ), where d φ is the dimension of the representation space. Moreover, we present MPC-PSRL, a model-based posterior sampling algorithm with model predictive control for action selection. To capture the uncertainty in models and realize posterior sampling, we use Bayesian linear regression on the penultimate layer (the feature representation layer φ) of neural networks. Empirical results show that our algorithm achieves the best sample efficiency in benchmark control tasks compared to prior model-based algorithms, and matches the asymptotic performance of model-free algorithms.

1. INTRODUCTION

In reinforcement learning (RL), an agent interacts with an unknown environment which is typically modeled as a Markov Decision Process (MDP). Efficient exploration has been one of the main challenges in RL: the agent is expected to balance between exploring unseen state-action pairs to gain more knowledge about the environment, and exploiting existing knowledge to optimize rewards in the presence of known data. To achieve efficient exploration, Bayesian reinforcement learning is proposed, where the MDP itself is treated as a random variable with a prior distribution. This prior distribution of the MDP provides an initial uncertainty estimate of the environment, which generally contains distributions of transition dynamics and reward functions. The epistemic uncertainty (subjective uncertainty due to limited data) in reinforcement learning can be captured by posterior distributions given the data collected by the agent. Posterior sampling reinforcement learning (PSRL), motivated by Thompson sampling in bandit problems (Thompson, 1933) , serves as a provably efficient algorithm under Bayesian settings. In PSRL, the agent maintains a posterior distribution for the MDP and follows an optimal policy with respect to a single MDP sampled from the posterior distribution for interaction in each episode. Appealing results of PSRL in tabular RL were presented by both model-based (Osband et al., 2013; Osband & Van Roy, 2017) and model free approaches (Osband et al., 2019) in terms of the Bayesian regret. For H-horizon episodic RL, PSRL was proved to achieve a regret bound of Õ(H √ SAT ), where S and A denote the number of states and actions, respectively. However, in continuous state-action spaces S and A can be infinite, hence the above results do not apply. Although PSRL in continuous spaces has also been studied in episodic RL, existing results either provide no guarantee or suffer from an exponential order of H. In this paper, we achieve the first Bayesian regret bound for posterior sampling algorithms that is near optimal in T (i.e.

√

T ) and polynomial in the episode length H for continuous state-action spaces. We will explain the limitations of previous works in Section 1.1, then summarize our approach and contributions in Section 1.2.

1.1. LIMITATIONS OF PREVIOUS BAYESIAN REGRETS IN CONTINUOUS SPACES

The exponetial order of H: In model-based settings, Osband & Van Roy (2014) derive a regret bound of Õ(σ R d K (R)d E (R)T + E[L * ]σ p d K (P )d E (P )) , where L * is a global Lipschitz constant for the future value function defined in their eq. ( 3). However, L * is dependent on H: the difference between input states will propagate in H steps, which results in a term dependent of H in the value function. The authors do not mention this dependency, so there is no clear dependency on H in their regret. Moreover, they use the Lipschitz constant of the underlying value function as an upper bound of L * in the corollaries, which yields an exponential order in H. Take their Corollary 2 of linear quadratic systems as an example: the regret bound is Õ(σCλ 1 n 2 √ T ), where λ 1 is the largest eigenvalue of the matrix Q in the optimal value function V 1 (s) = s T Qs. 1 However, the largest eigenvalue of Q is actually exponential in H 2 . Even if we change the reward function from quadratic to linear,the Lipschitz constant of the optimal value function is still exponential in H 3 . Chowdhury & Gopalan (2019) maintains the assumption of this Lipschitz property, thus there exists E[L * ] with no clear dependency on H in their regret, and in their Corollary 2 of LQR, they follow the same steps as Osband & Van Roy (2014), and still maintain a term with λ 1 , which is actually exponential in H as discussed. Although Osband & Van Roy (2014) mentions that system noise helps to smooth future values, but they do not explore it although the noise is assumed to be subgaussian. The authors directly use the Lipschitz continuity of the underlying function in the analysis of LQR, thus they cannot avoid the exponential term in H. Chowdhury & Gopalan (2019) do not explore how the system noise can improve the theoretical bound either. In model-free settings, Azizzadenesheli et al. ( 2018) develops a regret bound of Õ(d φ √ T ) using a linear function approximator in the Q-network, where d φ is the dimension of the feature representation vector of the state-action space, but their bound is still exponential in H as mentioned in their paper.

High dimensionality:

The eluder dimension of neural networks in Osband & Van Roy (2014) can be infinite, and the information gain (Srinivas et al., 2012) used in Chowdhury & Gopalan (2019) yields exponential order of the state-action spaces dimension d if nonlinear kernels are used, such as SE kernels. However, linear kernels can only model linear functions, thus the representation power is highly restricted if the polynomial order of d is desired.

1.2. OUR APPROACH AND MAIN CONTRIBUTIONS

To further imporve the regret bound for PSRL in continuous spaces, especially with explicit dependency on H, we study model-based posterior sampling algorithms in episodic RL. We assume that rewards and transitions can be modeled as Gaussian Processes with linear kernels, and extend the assumption to non-linear settings utilizing features extracted by neural networks. For the linear case, we develop a Bayesian regret bound of Õ(H 3/2 d √ T ). Using feature embedding technique as mentioned in Yang & Wang (2019), we derive a bound of Õ(H 3/2 d φ √ T ). Our Bayesian regret is the best-known Bayesian regret for posterior sampling algorithms in continuous state-action spaces, and it also matches the best-known frequentist regret (Zanette et al. (2020) , will be discussed in Section 2). Explicitly dependent on d, H, T , our result achieves a significant improvement in terms of the Bayesian regret of PSRL algorithms compared to previous works: 1. We significantly improved the order of H to polynomial: In our analysis, we use the property of subgaussian noise, which is already assumed in Osband & Van Roy (2014) and Chowdhury & Gopalan (2019) , to develop a bound with clear polynomial dependency on H, without assuming the Lipschitz continuity of the underlying value function. More specifically, we prove Lemma 1, and use 1 V1 denotes the value function counting from step 1 to H within an episode, s is the initial state, reward at the i-th step ri = s T i P si + a T i Rai + P,i, and the state at the i + 1-th step si+1 = Asi + Bai + P,i , i ∈ [H]. 2 Recall the Bellman equation we have Vi(si) = mina i s T i P si + a T i Rai + P,i + Vi+1(Asi + Bai + P,i), VH+1(s) = 0 . Thus in V1(s), there is a term of (A H-1 s) T P (A H-1 s), and the eigenvalue of the matrix (A H-1 ) T P A H-1 is exponential in H. 3 For example, if ri = s T i P + a T i R + P,i, there would still exist term of (A H-1 s) T P in V1(s).

