EFFICIENT EXPLORATION FOR MODEL-BASED REIN-FORCEMENT LEARNING WITH CONTINUOUS STATES AND ACTIONS

Abstract

Balancing exploration and exploitation is crucial in reinforcement learning (RL). In this paper, we study the model-based posterior sampling algorithm in continuous state-action spaces theoretically and empirically. First, we improve the regret bound: with the assumption that reward and transition functions can be modeled as Gaussian Processes with linear kernels, we develop a Bayesian regret bound of , where H is the episode length, d is the dimension of the stateaction space, and T indicates the total time steps. Our bound can be extended to nonlinear cases as well: using linear kernels on the feature representation φ, the Bayesian regret bound becomes Õ(H 3/2 d φ T ), where d φ is the dimension of the representation space. Moreover, we present MPC-PSRL, a model-based posterior sampling algorithm with model predictive control for action selection. To capture the uncertainty in models and realize posterior sampling, we use Bayesian linear regression on the penultimate layer (the feature representation layer φ) of neural networks. Empirical results show that our algorithm achieves the best sample efficiency in benchmark control tasks compared to prior model-based algorithms, and matches the asymptotic performance of model-free algorithms.

1. INTRODUCTION

In reinforcement learning (RL), an agent interacts with an unknown environment which is typically modeled as a Markov Decision Process (MDP). Efficient exploration has been one of the main challenges in RL: the agent is expected to balance between exploring unseen state-action pairs to gain more knowledge about the environment, and exploiting existing knowledge to optimize rewards in the presence of known data. To achieve efficient exploration, Bayesian reinforcement learning is proposed, where the MDP itself is treated as a random variable with a prior distribution. This prior distribution of the MDP provides an initial uncertainty estimate of the environment, which generally contains distributions of transition dynamics and reward functions. The epistemic uncertainty (subjective uncertainty due to limited data) in reinforcement learning can be captured by posterior distributions given the data collected by the agent. Posterior sampling reinforcement learning (PSRL), motivated by Thompson sampling in bandit problems (Thompson, 1933) , serves as a provably efficient algorithm under Bayesian settings. In PSRL, the agent maintains a posterior distribution for the MDP and follows an optimal policy with respect to a single MDP sampled from the posterior distribution for interaction in each episode. Appealing results of PSRL in tabular RL were presented by both model-based (Osband et al., 2013; Osband & Van Roy, 2017) and model free approaches (Osband et al., 2019) in terms of the Bayesian regret. For H-horizon episodic RL, PSRL was proved to achieve a regret bound of Õ(H √ SAT ), where S and A denote the number of states and actions, respectively. However, in continuous state-action spaces S and A can be infinite, hence the above results do not apply. Although PSRL in continuous spaces has also been studied in episodic RL, existing results either provide no guarantee or suffer from an exponential order of H. In this paper, we achieve the first Bayesian regret bound for posterior sampling algorithms that is near optimal in T (i.e.

√

T ) and polynomial in the episode length H for continuous state-action spaces. We will explain the limitations of previous works in Section 1.1, then summarize our approach and contributions in Section 1.2.

1.1. LIMITATIONS OF PREVIOUS BAYESIAN REGRETS IN CONTINUOUS SPACES

The exponetial order of H: In model-based settings, Osband & Van Roy (2014) derive a regret bound of Õ(σ R d K (R)d E (R)T + E[L * ]σ p d K (P )d E (P )) , where L * is a global Lipschitz constant for the future value function defined in their eq. ( 3). However, L * is dependent on H: the difference between input states will propagate in H steps, which results in a term dependent of H in the value function. The authors do not mention this dependency, so there is no clear dependency on H in their regret. Moreover, they use the Lipschitz constant of the underlying value function as an upper bound of L * in the corollaries, which yields an exponential order in H. Take their Corollary 2 of linear quadratic systems as an example: the regret bound is Õ(σCλ 1 n 2 √ T ), where λ 1 is the largest eigenvalue of the matrix Q in the optimal value function V 1 (s) = s T Qs. 1 However, the largest eigenvalue of Q is actually exponential in H 2 . Even if we change the reward function from quadratic to linear,the Lipschitz constant of the optimal value function is still exponential in H 3 . Chowdhury & Gopalan (2019) maintains the assumption of this Lipschitz property, thus there exists E[L * ] with no clear dependency on H in their regret, and in their Corollary 2 of LQR, they follow the same steps as Osband & Van Roy (2014) , and still maintain a term with λ 1 , which is actually exponential in H as discussed. Although Osband & Van Roy (2014) mentions that system noise helps to smooth future values, but they do not explore it although the noise is assumed to be subgaussian. The authors directly use the Lipschitz continuity of the underlying function in the analysis of LQR, thus they cannot avoid the exponential term in H. Chowdhury & Gopalan (2019) do not explore how the system noise can improve the theoretical bound either. In model-free settings, Azizzadenesheli et al. (2018) develops a regret bound of Õ(d φ

√

T ) using a linear function approximator in the Q-network, where d φ is the dimension of the feature representation vector of the state-action space, but their bound is still exponential in H as mentioned in their paper.

High dimensionality:

The eluder dimension of neural networks in Osband & Van Roy (2014) can be infinite, and the information gain (Srinivas et al., 2012) used in Chowdhury & Gopalan (2019) yields exponential order of the state-action spaces dimension d if nonlinear kernels are used, such as SE kernels. However, linear kernels can only model linear functions, thus the representation power is highly restricted if the polynomial order of d is desired.

1.2. OUR APPROACH AND MAIN CONTRIBUTIONS

To further imporve the regret bound for PSRL in continuous spaces, especially with explicit dependency on H, we study model-based posterior sampling algorithms in episodic RL. We assume that rewards and transitions can be modeled as Gaussian Processes with linear kernels, and extend the assumption to non-linear settings utilizing features extracted by neural networks. For the linear case, we develop a Bayesian regret bound of Õ(H 3/2 d √ T ). Using feature embedding technique as mentioned in Yang & Wang (2019), we derive a bound of Õ(H 3/2 d φ

√

T ). Our Bayesian regret is the best-known Bayesian regret for posterior sampling algorithms in continuous state-action spaces, and it also matches the best-known frequentist regret (Zanette et al. (2020) , will be discussed in Section 2). Explicitly dependent on d, H, T , our result achieves a significant improvement in terms of the Bayesian regret of PSRL algorithms compared to previous works: 1. We significantly improved the order of H to polynomial: In our analysis, we use the property of subgaussian noise, which is already assumed in Osband & Van Roy (2014) and Chowdhury & Gopalan (2019) , to develop a bound with clear polynomial dependency on H, without assuming the Lipschitz continuity of the underlying value function. More specifically, we prove Lemma 1, and use 1 V1 denotes the value function counting from step 1 to H within an episode, s is the initial state, reward at the i-th step ri = s T i P si + a T i Rai + P,i, and the state at the i + 1-th step si+1 = Asi + Bai + P,i , i ∈ [H]. 2 Recall the Bellman equation we have Vi(si) = mina i s T i P si + a T i Rai + P,i + Vi+1(Asi + Bai + P,i), VH+1(s) = 0 . Thus in V1(s), there is a term of (A H-1 s) T P (A H-1 s), and the eigenvalue of the matrix (A H-1 ) T P A H-1 is exponential in H. 3 For example, if ri = s T i P + a T i R + P,i, there would still exist term of (A H-1 s)  T ), which is the best-known frequentist bound in model based settings. While our method is also model-based, we achieve a tighter regret bound when compared in the Bayesian view. In model-free settings, Jin et al. (2020)  developed a bound of Õ(H 3/2 d 3/2 φ √ T ). Zanette et al. (2020) further improved the regret to Õ(H 3/2 d φ √ T ) by the proposed an algorithm called ELEANOR, which achieves the best-known frequentist bound in model-free settings. They showed that it is unimprovable with the help of a lower bound established in the bandit literature. Despite that our regret is developed in model-based settings, it matches their bound with the same order of H, d φ and T in the Bayesian view. Moreover, their algorithm involves optimization over all MDPs in the confidence set, and thus can be computationally prohibitive. Our method is computationally tractable as it is much easier to optimize a single sampled MDP, while matching their regret bound in the Bayesian view.

3.1. PROBLEM FORMULATION

We model an episodic finite-horizon Markov Decision Process (MDP) M as {S, A, R M , P M , H, σ r , σ f , R max , ρ}, where S ⊂ R ds and A ⊂ R da denote state and action spaces, respectively. Each episode with length H has an initial state distribution ρ. At time step i ∈ [1, H] within an episode, the agent observes s i ∈ S, selects a i ∈ A, receives a noised reward r i ∼ R M (s i , a i ) and transitions to a noised new state s i+1 ∼ P M (s i , a i ). More specifically, r(s i , a i ) = rM (s i , a i ) + r and s i+1 = f M (s i , a i ) + f , where r ∼ N (0, σ 2 r ), f ∼ N (0, σ 2 f I ds ). Variances σ 2 r and σ 2 f are fixed to control the noise level. Without loss of generality, we assume the expected reward an agent receives at a single step is bounded |r M (s, a)| ≤ R max , ∀s ∈ S, a ∈ A.Let µ : S → A be a deterministic policy. Here we define the value function for state s at time step i with policy µ as V M µ,i (s) = E[Σ H j=i [r M (s j , a j )|s i = s], where s j+1 ∼ P M (s j , a j ) and a j = µ(s j ). With the bound expected reward, we have that |V (s)| ≤ HR max , ∀s. We use M * to indicate the real unknown MDP which includes R * and P * , and M * itself is treated as a random variable. Thus, we can treat the real noiseless reward function r * and transition function f * as random processes as well. In the posterior sampling algorithm π P S , M k is a random sample from the posterior distribution of the real unknown MDP M * in the kth episode, which includes the posterior samples of R k and P k , given history prior to the kth episode: H k := {s 1,1 , a 1,1 , r 1,1 , • • • , s k-1,H , a k-1,H , r k-1,H }, where s k,i , a k,i and r k,i indicate the state, action, and reward at time step i in episode k. We define the the optimal policy under M as µ M ∈ argmaxµ V M µ,i (s) for all s ∈ S and i ∈ [H]. In particular, µ * indicates the optimal policy under M * and µ k represents the optimal policy under M k . Let ∆ k denote the regret over the kth episode: ∆ k = ρ(s 1 )(V M * µ * ,1 (s 1 ) -V M * µ k ,1 (s 1 ))ds 1 (1) Then we can express the regret of π ps up to time step T as: Regret(T, π ps , M * ) := Σ T H k=1 ∆ k , Let BayesRegret(T, π ps , φ) denote the Beyesian regret of π ps as defined in Osband & Van Roy (2017) , where φ is the prior distribution of M * : BayesRegret(T, π ps , φ) = E[Regret(T, π ps , M * )]. 3.2 ASSUMPTIONS Generally, we consider modeling an unknown target function g : R d → R. We are given a set of noisy samples y = [y 1 ...., y T ] T at points X = [x 1 , ..., x T ] T , X ⊂ D, where D is compact and convex, y i = g(x i ) + i with i ∼ N (0, σ 2 ) i.i.d. Gaussian noise ∀i ∈ {1, • • • , T }. We model g as a sample from a Gaussian Process GP (µ(x), K(x, x )), specified by the mean function µ(x) = E[g(x) ] and the covariance (kernel ) function K(x, x ) = E[(g(x) -µ(x)(g(x ) -µ(x )]. Let the prior distribution without any data as GP (0, K(x, x )). Then the posterior distribution over g given X and y is also a GP with mean µ T (x), covariance K T (x, x ), and variance σ 2 T (x): µ T (x) = K(x, X)(K(X, X) + σ 2 I) -1 y, K T (x, x ) = K(x, x ) -K(X, x) T (K(X, X) + σ 2 I) -1 K(X, x), σ 2 T (x) = K T (x, x), where K(X, x) = [K(x 1 , x), ..., K(x T , x)] T , K(X, X) = [K(x i , x j )] 1≤i≤T,1≤j≤T . We model our reward function rM as a Gaussian Process with noise σ 2 r . For transition models, we treat each dimension independently: each f i (s, a), i = 1, .., d S is modeled independently as above, and with the same noise level σ 2 f in each dimension. Thus it corresponds to our formulation in the RL setting. Since the posterior covariance matrix is only dependent on the input rather than the target value, the distribution of each f i (s, a) shares the same covariance matrice and only differs in the mean function.

4.1. LINEAR CASE

Theorem 1 In the RL problem formulated in Section 3.1, under the assumption of Section 3.2 with linear kernelsfoot_0 , we have BayesRegret(T, π ps , M * ) = Õ(H 3/2 d √ T ), where d is the dimension of the state-action space, H is the episode length, and T is the time elapsed. Proof The regret in episode k can be rearranged as: ∆ k = ρ(s 1 )(V M * µ * , (s 1 ) -V M k µ k ,1 (s 1 )) + (V M k µ k ,1 (s 1 ) -V M * µ k ,1 (s 1 )))ds 1 (4) Note that conditioned upon history H k for any k, M k and M * are identically distributed. Osband & Van Roy (2014) showed that V M * µ * , -V M k µ k ,1 is zero in expectation, and that only the second part of the regret decomposition need to be bounded when deriving the Bayesian regret of PSRL. Thus we can focus on the policy µ k , the sampled M k and real environment data generated by M * . For clarity, the value function V M k µ k ,1 is simplified to V k k,1 and V M * µ k ,1 to V * k,1 . It suffices to derive bounds for any initial state s 1 as the regret bound will still hold through integration of the initial distribution ρ(s 1 ). We can rewrite the regret from concentration via the Bellman operator (see Section 5.1 in Osband et al. (2013) ): E[ ∆k |H k ] := E[V k k,1 (s 1 ) -V * k,1 (s 1 )|H k ] = E[r k (s 1 , a 1 ) -r * (s 1 , a 1 ) + P k (s |s 1 , a 1 )V k k,2 (s )ds -P * (s , |s 1 , a 1 )V * k,2 (s )ds |H k ] = E[Σ H i=1 rk (s i , a i ) -r * (s i , a i ) + Σ H i=1 ( (P k (s |s i , a i ) -P * (s |s i , a i ))V k k,i+1 (s )ds )|H k ] = E[ ∆k (r) + ∆k (f )|H k ] where a i = µ k (s i ), s i+1 ∼ P * (s i+1 |s i , a i ), ∆k (r) = Σ H i=1 rk (s i , a i ) -r * (s i , a i ), ∆k (f ) = Σ H i=1 ( (P k (s |s i , a i ) -P * (s |s i , a i ))V k k,i+1 s )ds ). Thus, here (s i , a i ) is the state-action pair that the agent encounters in the kth episode while using µ k for interaction in the real MDP M * . We can define V k,H+1 = 0 to keep consistency. Note that we cannot treat s i and a i as deterministic and only take the expectation directly on random reward and transition functions. Instead, we need to bound the difference using concentration properties of reward and transition functions modeled as Gaussian Processes (which also applies to any state-action pair), and then derive bounds of this expectation. For all i, we have  (P k (s |s i , a i ) -P * (s |s i , a i ))V k k,i+1 (s )ds ≤ max s |V k k,i+1 (x) respectively, x ∈ R d , |p 1 (x) -p 2 (x)|dx ≤ 2 πσ 2 ||µ -µ || 2 . The proof is in Appendix A.1. Clearly, this result can also be extended to sub-Gaussian noises. Recall that P k (s |s i , a i ) = N (f k (s i , a i ), σ 2 f I) and P * (s |s i , a i ) = N (f * (s i , a i ), σ 2 f I). By Lemma 1 we have |P k (s |s i , a i ) -P * (s |s i , a i )|ds ≤ 2 πσ 2 f ||f k (s i , a i ) -f * (s i , a i )|| 2 (6) Lemma 2 (Rigollet & Hütter, 2015) Let X 1 , ..., X N be N sub-Gaussian random variables with variance σ 2 (not required to be independent). Then for any t > 0, P(max 1≤i≤N |X i | > t) ≤ 2N e -t 2 2σ 2 . Given history H k , let f k (s, a) indicate the posterior mean of f k (s, a) in episode k, and σ 2 k (s, a) denotes the posterior variance of f k in each dimension. Note that f * and f k share the same variance in each dimension given history H k , as described in Section 3. Consider all dimensions of the state space, by Lemma 2, we have that with probability at least 1 -δ, max 1≤i≤ds |f k i (s, a) - k (s, a)log 2ds δ . Then we look at the sum of the differences over horizon H, without requiring each variable in the sum to be independent: f k i (s, a)| ≤ 2σ 2 k (s, P(Σ H i=1 ||f k (s i , a i ) -f * (s i , a i )|| 2 > Σ H i=1 2 2d s σ 2 k (s i , a i )log 2d s δ ) ≤ P( H i=1 {||f k (s i , a i ) -f * (s i , a i )|| 2 > 2 2d s σ 2 k (s i , a i )log 2d s δ }) ≤ Σ H i=1 P(||f k (s i , a i ) -f * (s i , a i )|| 2 > 2 2d s σ 2 k (s i , a i )log 2d s δ ) Thus, with probability at least 1 -2Hδ, we have Σ H i=1 ||f k (s i , a i ) -f * (s i , a i )|| 2 ≤ Σ H i=1 2 2d s σ 2 k (s i , a i )log 2ds δ . Let δ = 2Hδ, we have that with probability 1 -δ, Σ H i=1 ||f k (s i , a i ) - f * (s i , a i )|| 2 ≤ Σ H i=1 2 2d s σ 2 k (s i , a i )log 4Hds δ ≤ 2H 2d s σ 2 k (s kmax , a kmax )log 4Hds δ , where the index k max = arg max i σ k (s i , a i ), i = 1, ..., H in episode k. Here, since the posterior distribution is only updated every H steps, we have to use data points with the max variance in each episode to bound the result. Similarly, using the union bound for [ T H ] episodes, and let C = 2 πσ 2 f , we have that with probability at least 1 -δ, Σ [ T H ] k=1 [ ∆k (f )|H k ] ≤ Σ [ T H ] k=1 Σ H i=1 2CHR max ||f k (s i , a i ) -f * (s i , a i )|| 2 ≤ Σ [ T H ] k=1 4CH 2 R max 2d s σ 2 k (s kmax , a kmax )log 4T ds δ . In each episode k, let σfoot_1 k (s, a) denote the posterior variance given only a subset of data points {(s 1max , a 1max ), ..., (s k-1max , a k-1max )}, where each element has the max variance in the corresponding episode. By Eq.( 6) in Williams & Vivarelli (2000) , we know that the posterior variance reduces as the number of data points grows. Hence ∀(s, a), σ 2 k (s, a) ≤ σ 2 k (s, a). By Theorem 5 in Srinivas et al. (2012) which provides a bound on the information gain, and Lemma 2 in Russo & Van Roy (2014) that bounds the sum of variances by the information gain, we have that Σ [ T H ] k=1 σ 2 k (s kmax , a kmax ) = O((d s + d a )log[ T H ] ) for linear kernels with bounded variances. Note that the bounded variance property for linear kernels only requires the range of all state-action pairs actually encountered in M * not to expand to infinity as T grows, which holds in general episodic MDPs. Thus with probability 1 -δ, and let δ = 1 T , Σ [ T H ] k=1 [ ∆k (f )|H k ] ≤ Σ [ T H ] k=1 4CH 2 R max 2d s σ 2 k (s kmax , a kmax )log 4T d s δ ≤ Σ [ T H ] k=1 8CH 2 R max d s σ 2 k (s kmax , a kmax )log(2T d s ) ≤ 8CH 2 R max Σ [ T H ] k=1 σ 2 k (s kmax , a kmax ) [ T H ] d s log(2T d s ) = 8CH 3 2 R max √ T d s log(2T d s ) * O((d s + d a )log[ T H ]) = Õ((d s + d a )H 3 2 √ T ) (8) where Õ ignores logarithmic factors. Therefore, E[Σ  [ T H ] k=1 ∆k (f )|H k ] ≤ (1 -1 T ) Õ((d s + s a )H 3 2 T ) + 1 T 2HR max * [ T H ] = Õ(H 3 2 d √ T ),

5. ALGORITHM DESCRIPTION

In this section, we elaborate our proposed algorithm, MPC-PSRL, as shown in Algorithm 1.

5.1. PREDICTIVE MODEL

When model the rewards and transitions, we use features extracted from the penultimate layer of fitted neural networks, and perform Bayesian linear regression on the feature vectors to update posterior distributions. Feature representation: we first fit neural networks for transitions and rewards, using the same network architecture as Chua et al. (2018) . Let x i denote the state-action pair (s i , a i ) and y i denote the target value. Specifically, we use reward r i as y i to fit rewards, and we take the difference between two consecutive states s i+1 -s i as y i to fit transitions. The penultimate layer of fitted neural networks is extracted as the feature representation, denoted as φ f and φ r for transitions and rewards, respectively. Note that in the transition feature embedding, we only use one neural network to extract features of state-action pairs from the penultimate layer to serve as φ, and leave the target states without further feature representation (the general setting is discussed in Section 4.2 where feature representations are used for both inputs and outputs), so the dimension of the target in the transition model d(ψ) equals to d s . Thus we have a modified regret bound of Õ(H 3/2 dd φ T ). We do not find the necessity to further extract feature representations in the target space, as it might introduce additional computational overhead. Although higher dimensionality of the hidden layers might imply better representation, we find that only modifying the width of the penultimate layer to d φ = d s + s a suffices in our experiments for both reward and transition models. Note that how to optimize the dimension of the penultimate layer for more efficient feature representation deserves further exploration. Bayesian update and posterior sampling: here we describe the Bayesian update of transition and reward models using extracted features. Recall that Gaussian process with linear kernels is equivalent to Bayesian linear regression. By extracting the penultimate layer as feature representation φ, the target value y and the representation φ(x) could be seen as linearly related: y = w φ(x) + , where is a zero-mean Gaussian noise with variance σ 2 (which is σ 2 f for the transition model and σ 2 r for the reward model as defined in Section 3.1). We choose the prior distribution of weights w as zero-mean Gaussian with covariance matrix Σ p , then the posterior distribution of w is also multivariate Gaussian (Rasmussen ( 2003)): p(w|D) ∼ N σ -2 A -1 ΦY, A -1 where A = σ -2 ΦΦ + Σ -1 p , Φ ∈ R d×N is the concatenation of feature representations {φ(x i )} N i=1 , and Y ∈ R N is the concatenation of target values. At the beginning of each episode, we sample w from the posterior distribution to build the model, collect new data during the whole episode, and update the posterior distribution of w at the end of the episode using all the data collected. Besides the posterior distribution of w, the feature representation φ is also updated in each episode with new data collected. We adopt a similar dual-update procedure as Riquelme et al. (2018) : after representations for rewards and transitions are updated, feature vectors of all state-action pairs collected are re-computed. Then we apply Bayesian update on these feature vectors. See the description of Algorithm 1 for details.

5.2. PLANNING

During interaction with the environment, we use a MPC controller (Camacho & Alba ( 2013)) for planning. At each time step i, the controller takes state s i and an action sequence a i:i+τ = {a i , a i+1 , • • • , a i+τ } as the input, where τ is the planning horizon. We use transition and reward models to produce the first action a i of the sequence of optimized actions arg max ai:i+τ i+τ t=i E[r(s t , a t )], where the expected return of a series of actions can be approximated using the mean return of several particles propagated with noises of our sampled reward and transition models. To compute the optimal action sequence, we use CEM (Botev et al. (2013) ), which samples actions from a distribution closer to previous action samples with high rewards.

6. EXPERIMENTS

We compare our method with the following state-of-the art model-based and model-free algorithms on benchmark control tasks. Model-free: Soft Actor Critic (SAC) from Haarnoja et al. (2018) is an off-policy deep actor-critic algorithm that utilizes entropy maximization to guide exploration. Deep Deterministic Policy Gradient (DDPG) from Barth-Maron et al. (2018) is an off-policy algorithm that concurrently learns a Qfunction and a policy, with a discount factor to guide exploration. Model-based: Probabilistic Ensembles with Trajectory Sampling (PETS) from Chua et al. (2018) models the dynamics via an ensemble of probabilistic neural networks to capture epistemic uncertainty for exploration, and uses MPC for action selection, with a requirement to have access to oracle rewards for planning. Model-Based Policy Optimization (MBPO) from Janner et al. (2019) uses the same bootstrap ensemble techniques as PETS in modeling, but differs from PETS in policy optimization with a large amount of short model-generated rollouts, and can cope with environments with no oracle rewards provided. We do not compare with Gal et al. (2016) , which adopts a single Bayesian neural network (BNN) with moment matching, as it is outperformed by PETS that uses an ensemble of BNNs with trajectory sampling. And we don't compare with GP-based trajectory optimization methods with real rewards provided (Deisenroth & Rasmussen, 2011; Kamthe & Deisenroth, 2018) , which are not only outperformed by PETS, but also computationally expensive and thus are limited to very small state-action spaces. We use environments with various complexity and dimensionality for evaluation. Low-dimensional environments: continuous Cartpole (d s = 4, d a = 1, H = 200, with a continuous action space compared to the classic Cartpole, which makes it harder to learn) and Pendulum Swing Up (d s = 3, d a = 1, H = 200, a modified version of Pendulum where we limit the start state to make it harder for exploration). Trajectory optimization with oracle rewards in these two environments is easy and there is almost no difference in the performances for all model-based algorithms we compare, so we omit showing these learning curves. Higher dimensional environments: 7-DOF Reacher (d s = 17, d a = 7, H = 150) and 7-DOF pusher (d s = 20, d a = 7, H = 150) are two more challenging tasks as provided in Chua et al. (2018) , where we conduct experiments both with and without true rewards, to compare with all baseline algorithms mentioned. The learning curves of these algorithms are showed in Figure 1 . When the oracle rewards are provided in Pusher and Reacher, our method outperforms PETS and MBPO: it converges more quickly with similar performance at convergence in Pusher, while in Reacher, not only does it learn faster but also performs better at convergence. As we use the same planning method (MPC) as PETS, results indicate that our model better captures uncertainty, which is beneficial to improving sample efficiency. When exploring in environments where both rewards and transition are unknown, our method learns significantly faster than previous model-based and model-free methods which do no require oracle rewards. Meanwhile, it matches the performance of SAC at convergence. Moreover, the performances of our algorithm in environments with and without oracle rewards can be similar, or even faster convergence (see Pusher with and without rewards), indicating that our algorithm excels at exploring both rewards and transitions. From experimental results, it can be verified that our algorithm better captures the model uncertainty, and makes better use of uncertainty using posterior sampling. In our methods, by sampling from a Bayesian linear regression on a fitted feature space, and optimizing under the same sampled MDP in the whole episode instead of re-sampling at every step, the performance of our algorithm is guaranteed from a Bayesian view as analysed in Section 4. While PETS and MBPO use bootstrap ensembles of models with a limited ensemble size to "simulate" a Bayesian model, in which the convergence of the uncertainty is not guaranteed and is highly dependent on the training of the neural network. However, in our method there is a limitation of using MPC, which might fail in even higher-dimensional tasks shown in Janner et al. (2019) . Incorporating policy gradient techniques for action-selection might further improve the performance and we leave it for future work.

7. CONCLUSION

In our paper, we derive a novel Bayesian regret for PSRL algorithm in continuous spaces with the assumption that true rewards and transitions (with or without feature embedding) can be modeled by GP with linear kernels. While matching the best-known bounds in previous works from a Bayesian view, PSRL also enjoys computational tractability. Moreover, we propose MPC-PSRL in continuous environments, and experiments show that our algorithm exceeds existing model-based and model-free methods with more efficient exploration.

A APPENDIX

A.1 PROOF OF LEMMA 1 Here we provide a proof of Lemma 1. We first prove the results in R d with d = 1: p 1 (x) ∼ N (µ, σ 2 ), p 2 (x) ∼ N (µ , σ 2 ), without loss of generality, assume µ ≥ µ. The probabilistic distribution is symmetric with regard to µ+µ 2 . Note that p 1 (x) = p 2 (x) at x = µ+µ 2 . Thus the integration of absolute difference between pdf of p 1 and p 2 can be simplified as twice the integration of one side: ∞ -∞ |p 2 (x) -p 1 (x)|dx = 2 √ 2πσ 2 ∞ µ+µ 2 e -(x-µ ) 2 2σ 2 -e -(x-µ) 2 2σ 2 dx (9) let z 1 = x -µ, z 2 = x -µ , we have: 2 √ 2πσ 2 ∞ µ+µ 2 e -(x-µ ) 2 2σ 2 -e -(x-µ) 2 2σ 2 dx = 2 πσ 2 Now we extend the result to R d (d ≥ 2): p 1 (x) ∼ N (µ, σ 2 I), p 2 (x) ∼ N (µ , σ 2 I). We can rotate the coordinate system recursively to align the last axis with vector µ -µ , such that the coordinates of µ and µ can be written as (0, 0, • • • , 0, μ), and (0, 0, • • • , 0, μ ) respectively, with |μ -μ| = µ -µ 2 . Without loss of generality, let μ ≥ μ . Clearly, all points with equal distance to μ and μ define a hyperplane P : x d = μ+μ 2 where p 1 (x) = p 2 (x), ∀x ∈ P . More specifically, the probabilistic distribution is symmetric with regard to P . Similar to the analysis in R 1 :  ∞ -∞ ∞ -∞ • • • ∞ -∞ |p 1 (x) -p 2 (x)|dx 1 dx 2 • • • dx d = 2 (2π) d σ 2d ∞ -∞ ∞ -∞ • • • ∞ μ+ μ 2 e -x 2 1 2σ 2 • • • e -x 2 d-1 2σ 2 e -(x d -μ) 2 2σ 2 dx 1 dx 2 • • • dx d - 2 (2π) d σ 2d ∞ -∞ ∞ -∞ • • • ∞ μ+ μ 2 e -x 2 1 2σ 2 • • • e -x 2 d-1 2σ 2 e -(x d -μ ) 2 2σ 2 dx 1 dx 2 • • • dx d = 2 (2π) d σ 2d ∞ -∞ e -x 2 1 2σ 2 dx 1 ∞ -∞ e -x 2 2 2σ 2 dx 2 • • • ∞ -∞ e -x 2 d-1 2σ 2 dx d-1 ∞ μ+ μ 2 e -(x d -μ) 2 2σ 2 dx d - 2 (2π) d σ 2d ∞ -∞ e -x 2 1 2σ 2 dx 1 ∞ -∞ e -x 2 2 2σ 2 dx 2 • • • ∞ -∞ e -x 2 d-1 2σ 2 dx d-1



GP with linear kernel correspond to Bayesian linear regression f (x) = w T x, where the prior distribution of the weight is w ∼ N (0, Σp). d √ T ).



a)log 2ds δ . Also, we can derive an upper bound for the norm of the state difference||f k (s, a) -f k (s, a)|| 2 ≤ √ d s max 1≤i≤ds |f k i (s, a) -f k i (s,a)|, and so does ||f * (s, a)f k (s, a)|| 2 since f * and f k share the same posterior distribution. By the union bound, we have that with probability at least 1 -2δ ||f k (s, a) -f * (s, a)|| 2 ≤ 2 2d s σ 2

where 2HR max is the upper bound on the difference of value functions, and d = d s + d a . By similar derivation, E[Σ [ T H ] k=1 ∆k (r)|H k ] = Õ( √ dHT ). Finally, through the tower property we have BayesRegret(T, π ps , M * ) = Õ(H Algorithm 1 MPC-PSRL Initialize data D with random actions for one episode repeat Sample a transition model and a cost model at the beginning of each episode for i = 1 to H steps do Obtain action using MPC with planning horizon τ : a i ∈ arg max ai:i+τ i+τ t=i E[r(s t , a t )] D = D ∪ {(s i , a i , r i , s i+1 )} end for Train cost and dynamics representations φ r and φ f using data in D Update φ r (s, a), φ f (s, a) for all (s, a) collected Perform posterior update of w r and w f in cost and dynamics models using updated representations φ r (s, a), φ f (s, a) for all (s, a) collected until convergence 4.2 NONLINEAR CASE VIA FEATURE REPRESENTATION We can slightly modify the previous proof to derive the bound in settings that use feature representations. We can transform the state-action pair (s, a) to φ f (s, a) ∈ R d φ as the input of the transition model , and transform the newly transitioned state s to ψ f (s ) ∈ R d ψ as the target, then the transition model can be established with respect to this feature embedding. We further assume d ψ = O(d φ ) as Assumption 1 in Yang & Wang (2019). Besides, we assume d φ = O(d φ ) in the feature representation φ r (s, a) ∈ R d φ , then the reward model can also be established with respect to the feature embedding. Following similar steps, we can derive a Bayesian regret of Õ(H 3/2 d φ √ T ).

Figure 1: Training curves of MPC-PSRL (shown in red), and other baseline algorithms in different tasks. Solid curves are the mean of five trials, shaded areas correspond to the standard deviation among trials, and the doted line shows the rewards at convergence.

z 1 = x d -μ,z 2 = x d -μ , we have: 1 (x) -p 2 (x)|dx 1 dx 2

Now we present a lemma which enables us to derive a regret bound with explicit dependency on the episode length H. For two multivariate Gaussian distribution N (µ, σ 2 I), N (µ , σ 2 I) with probability density function p 1 (x) and p 2

• • • dx d ≤ Hyperparamters for MBPO And we provide hyperparamters for MPC and Neural Networks in PETS:

Hyperparamters for PETSHere are hyperparameters of our algorithm, which is similar with PETS, except for ensemble size(since we do not use ensembled models):

Hyperparamters for our methodFor SAC and DDPG, we use the open source code ( https://github.com/dongminlee94/ deep_rl) for implementation without changing their hyperparameters. We appreciate the authors for sharing the code!

