MODEL-BASED VALUE EXPLORATION IN ACTOR-CRITIC DEEP REINFORCEMENT LEARNING

Abstract

Off-policy method has demonstrated great potential on model-free deep reinforcement learning due to the sample-efficient advantage. However, it suffers extra instability due to some mismatched distributions from observations. Model-free onpolicy counterparts usually have poor sample efficiency. Model-based algorithms, in contrast, are highly dependent on the goodness of expert demonstrations or learned dynamics. In this work, we propose a method which involves training the dynamics to accelerate and gradually stabilize learning without adding samplecomplexity. The dynamics model prediction can provide effective target value exploration, which is essentially different from the methods on-policy exploration, by adding valid diversity of transitions. Despite the existence of model bias, the model-based prediction can avoid the overestimation and distribution mismatch errors in off-policy learning, as the learned dynamics model is asymptotically accurate. Besides, to generalize the solution to large-scale reinforcement learning problems, we use global gaussian and deterministic function approximation to model the transition probability and reward function, respectively. To minimize the negative impact of potential model bias brought by the estimated dynamics, we adopt one-step global prediction for the model-based part of target value. By analyses and proofs, we show how the model-based prediction provides value exploration and asymptotical performance to the overall network. It can also be concluded that the convergence of proposed algorithm only depends on the accuracy of learnt dynamics model.

1. INTRODUCTION

Model-free reinforcement learning (RL) algorithms have been applied to a wide range of tasks, ranging from simple games (Mnih et al., 2013; Oh et al., 2016) to robotic locomotion skills (Schulman et al., 2015) . To tackle the large-scale continuous control problems, the function approximators implement some neural networks to represent the high-dimensional state and action spaces in deep reinforcement learning (DRL). However, model-free DRL is notoriously expensive in terms of its sample efficiency, which is deadly difficult to be employed in reality where samples are valuable to achieve. Among the recent model-free DRL algorithms, on-policy methods (Schulman et al., 2015; 2017; Fujimoto et al., 2018) typically require multiple samples to be collected for each rollout at every gradient step, which is quite extravagant in consuming samples because multiplied data requirement does not necessarily bring corresponding performance gain. In comparison, off-policy methods aim to reuse the past experience by storing the collected observations in a memory buffer, typically, combining Q-learning with neural networks (Mnih et al., 2015) . Unfortunately, the combination of off-policy learning and high-dimensional, nonlinear function approximation are exposed to issues in terms of instability and divergence (Maei et al., 2009) . The causes for the emergent problems are very complicated, for example, some works (Fujimoto et al., 2018; 2019; Duan et al., 2021) blame them on the overestimation bias, which says the continually maximized value during the actor-critic optimization will accumulate the overestimation errors and break the training stability. Some others try to find extrapolation error induced by the mismatch between the distribution of sampled data from experience and true state-action visitation of the current policy (Fujimoto et al., 2019) . There have been several ways to tackle the distribution mismatch. The authors in (Wu et al., 2019) address the distribution errors by extra value penalty or policy regularization, (Wang & Ross, 2019) changes the rule of experience replay to reduce the distribution mismatch by sampling more aggressively from recent experience while ordering the updates to ensure that updates from old data do not overwrite updates from new data, and (Martin et al., 2021) relabels successful episodes as expert demonstrations for the agent to match. Despite their efforts, the overestimation bias and mismatched distribution from past experience can only be mitigated, and sometimes may induce new problems. The paper has the following contributions. First, instead of using immediate rewards or assuming known reward function, we adopt neural networks to approximate the reward function as part of dynamics. Meanwhile, we train the parameters of modeled transition probability and reward function based on the replay buffer from off-policy observations. Second, the prediction from the learned dynamics will be used to foresee the target value according to a certain percentage. Since the dynamics-prediction is essentially different from the observations from environment, it can provide extra exploration which is not conditioned on the state-action visitation history. Besides, a well trained dynamics model is free of overestimation and distribution mismatch errors, and can provide more accurate target value and stabilize the asymptotic performance. Third, the related algorithm is proposed and the final results prove good efficiency and stability of the proposed algorithm. Fourth, the accuracy of learned model is tested by setting a maximum online time step, which is the beginning of off-line planning that is isolated from the environment.

2. RELATED WORK

Due to the various problems arising from the sample complexity of model-free algorithms, taskspecific representations (Peters et al., 2010; Deisenroth et al., 2013) as well as the model-based algorithms (Deisenroth & Rasmussen, 2011; Levine et al., 2016; 2018; Kaiser et al., 2019) using planning, which optimize the policy under a learned or given dynamics model, are more preferable in real physical systems, such as robots and autonomous vehicles. However, task-specific representations have limited range of learnable tasks and greater requirement for domain knowledge. Model-based DRL algorithms are considered being more efficient (Deisenroth et al., 2013) , because it constructs a dynamic probabilistic model via lots of data and avoids interaction with the environment by training the strategy based on the learned dynamics model (Hua et al., 2021) , but it limits the policy to only be as good as the learned model (Gu et al., 2016) . For the model-free part, the agent needs to interact with the environment to collect enough knowledge for training, which poses the importance of the tradeoff between exploration and exploitation (Mnih et al., 2016) . Soft actor-critic (SAC) (Haarnoja et al., 2018a; b) achieves good performance on a set of continuous control tasks by adopting stochastic function approximation and maximum entropy for policy exploration. Among these techniques, stochastic policies have the advantage of allowing on-policy exploration and off-policy experience replay over deterministic counterparts (Heess et al., 2015) , and the maximum entropy exploration improves robustness and stability (Ziebart et al., 2008; Ziebart, 2010) . Overall, the existing exploration strategies are limited to the policy, which raises the concern about whether and how the value exploration can play a positive role in model-free learning. While some works combine both model-free and model-based DRL in the literature (Sutton, 1990; Lampe & Riedmiller, 2014) , the following works are particularly relevant to our work in this paper. Specifically, (Gu et al., 2016; Nagabandi et al., 2018) add synthetic imagination rollouts to an additional replay buffer for model-guided exploration in some off-policy methods at the price of much higher storage and computation costs. Besides, model ensembles are adopted in (Chua et al., 2018; Kurutach et al., 2018; Janner et al., 2019) to reduce misguided policy or inaccurate planning caused by model bias. Moreover, value expansion of fixed multi-step prediction by dynamics model is adopted in (Feinberg et al., 2018; Buckman et al., 2018) to make proper value expansion and control imagination depth. However, making multi-step prediction from a global dynamics model may suffer cumulative model estimation errors and is usually replaced by iteratively refitted time-varying linear models (Levine & Abbeel, 2014) . VIME (Houthooft et al., 2016) introduces maximization of information gain about the dynamics' certainty, which is overwhelmed by theoretical analyses and lack a bit intuitive judgement. In this paper, we adopt one-step prediction, which avoids the costs of storage and computation from multi-step synthetic rollouts, from a global dynamics model used for value exploration to achieve diversity, accuracy and generality.

3. PRELIMINARIES

We consider the extended Markov Decision Process (MDP) in continuous state and action spaces, denoted by the tuple (S, A, P, r) where S is the state space, A is the action space, S is the space of next state, P (s |s, a) is the transition denoting the conditional probability of the next state s ∈ S given the current state s ∈ S and action a ∈ A, and r ∈ S × A is the reward function which is connected with (s, a, s ) from the environment. The purpose of reinforcement learning (RL) is to optimize the policy by maximizing the reward return, which is denoted by the expected discounted cumulative reward of a rollout. DRL employs the function approximation to parameterize the reward return so that the optimization can work under the setting of continuous control. In DRL, an action a is sampled from the policy π and "judged" according to a value estimate determined by the observation s, then the next state s and the future reward r can be determined by the transition probability P (•|s, a) and the reward function r(s, a), respectively, which are combined to define the dynamics in this work. In recent researches of DRL, the action-value (Q-value) function with respect to the state-action pair is usually chosen as the surrogate of the reward return, in the form of Q π (s, a) = t E st∼P t ,at∼π γ t r(s t , a t )|s 0 = s, a 0 = a , where γ ∈ (0, 1) is the discount horizon factor for future rewards, π is the policy for action section at every time step, and P t is the distribution of s t , which is a joint distribution of transitions. If conditioned on initial state-action pair (s 0 , a 0 ), it is given by P t (s t |s 0 , a 0 ) = P (s 1 |s 0 , a 0 ) t-1 i=1 E si∼S,ai∼π P (s i+1 |s i , a i ). Then we have Proof See Appendix A (submitted in the supplementary material). P t (s t ) = E st-1∼S,at-1∼π P (s t |s t-1 , a t-1 )P t-1 (s t-1 ) . The Q-value function in Eq. ( 1) is a mapping from the input observation-action pair (s, a) to the Q-value, and it has the property of satisfying Bellman equation, so the temporal difference (TD) (Tesauro, 1995) is generally used to minimize Bellman errors by the transition tuple (s, a, r, s ) at every critic evaluation step, which is given by E (s,a,r,s ) (r + γQ t (s , π(s crap et al., 2015) , where Q t stands for a target Q network. In algorithms using the experience replay, (s, a, r, s ) will be stored in a replay buffer at every environment step, a is sampled from the experience pool, and the next action has to be judged by the current policy, represented as π(s ). In off-policy methods, the distribution of sampled action a is different from the current policy, which will cause distribution mismatch. In the context, we use the term of 'iteration' to represent the index of updates. In the actor-critic paradigm, each iteration contains the evaluation step and the policy improvement step, which are used to update Q-value function and optimize the policy, respectively. After minimizing the Bellman errors, the policy improvement is performed by maximizing the expected return )) -Q(s, a)) 2 (Lilli J(θ) = E s [Q π (s, π(s)]. In some algorithms, the policy regularization may be attached to the expected return for the stability of training (Kumar et al., 2019; Jaques et al., 2019) , which is aimed to restrain the policy gradient ∇ θ J(θ) to keep away from potential gradient vanishing or exploding problems as well as reducing the estimation variance.

4. MODEL-BASED ACTOR-CRITIC VALUE EXPLORATION WITH ASYMPTOTIC PLANNING

The actor-critic target value exploration with asymptotic planning is a method that blends the onestep global model-based prediction into target critic value, whose importance lies in how and why it can work well. The diversity of state distributions between model-based prediction and observations produces extra value exploration, and asymptotically accurate learned model has the potential to overcome errors from the overestimation and mismatched policy distributions, when experience replay is applied in off-policy algorithms. The asymptotic planning is realized by gradually increasing the weight of model-based prediction in the target critic value.

4.1. MODEL-BASED TARGET VALUE EXPLORATION

By instinct, the model-based target value is able to explore the future states from different viewpoints with some certainty. Without the model-based prediction, the target value usually chooses the input next states from random samples in replay buffer. The distributions of sampled next states which meet s t+1 ∼ P t+1 (s t+1 ) (2) are not really stationary in off-policy methods, since the policy will go through multiple updates in a rollout. These unstationary distributions can distort the real stationary transition probability from the view of the current policy, which means a false dynamics could be experienced or "felt" with the real observations from the replay buffer, which are achieved by interacting with the environment. The problem of distribution mismatch will be incurred when off-policy method is used, since actions (a t , • • • , a 0 ) are random samples from the replay buffer following different distributions. The analyses tell that observations from the real environment are not necessarily more accurate than the prophecy of a dynamics model. Based on the choice of target value, the evaluation step for the critic updates can be separated into the off-policy training and the on-policy planning. We introduce the backup operator for model-based Q-value prediction as P π Q(s t , a t ) = r µ (s t , a t , s p t+1 ) + γE st+1 Q(s p t+1 , a t+1 ) , where s p t+1 ∼ p(•|s t , a t ), a t ∼ π(•|s t ), a t+1 ∼ π(•|s t+1 ), and r µ and p are reward functions and transition probability learnt from the dynamics model, respectively. We use s p t+1 to distinguish the subtle difference between the model-based prediction and s t+1 ∼ P t+1 (s t+1 ), which follows a distribution determined by the complete rollout. And a t+1 ∼ π(•|s t+1 ) because of the limitation on one-step global prediction, otherwise, choosing a t+1 ∼ π(•|s p t+1 ) will induce cumulative model bias. If replacing r µ and s p t+1 with the immediate reward r and the next state s from past experience, (19) reduces to the target Q-value of deep deterministic policy gradient (DDPG) (Lillicrap et al., 2015) . Plus a regularization term concerning the entropy policy exploration, it then becomes the target Q-value of SAC. The change of target value seems subtle, however, the model-based prediction produces valid state and reward diversity for value exploration. Unlike the action which is usually bounded, the state space is continuously unbounded in many tasks like Gym environments, then the exploration strategy commonly used in policy exploration, for example the gaussian exploration noise, will be invalid in value exploration. Throughout the referred literature, few works have attempted to apply simple and/or feasible value exploration. More importantly, as the trained reward model and transition model grow more accurate, the on-policy prediction (19) will greatly reduce or avoid originally existing overestimation errors and distribution mismatch in target Q-value without value exploration. Although model bias is induced by model-based planning during learning, it can be controlled by asymptotically increased impact of model-based prediction. Lemma 2. Consider the sequence Q t+1 = P π Q t constructed by (19), then given the condition that the Q-values are bounded, i.e., |Q t (s, a)| < ∞, ∀(s, a) ∈ S × A, the sequence Q t will converge to a unique optimal value as t → ∞. The proof of Lemma 5 can be found in Appendix B. In this work, (19) will be combined with the target Q-value of SAC according to an asymptotically increasing percentage. To expand the value exploration, the current Q-value function to be predicted by the target Q-value is also divided into two parts sharing the same critic network parameter. However, they take actions following different distributions as inputs. By this means, the diversity can be enlarged and some convergence conditions can be satisfied, which will be shown later in Theorem 4. In the policy improvement step, we directly adopt the entropy policy exploration of SAC with the actor network, which does not involve computation on the dynamics. When the model-free DRL is applied to large-scale continuous control problems, the dynamics is unknown and the state-action spaces are continuous, the policy improvement over S × A at every iteration, which is called as the absolute policy improvement, cannot be guaranteed by estimation with distribution mismatch and estimation biases. Some researches on continuous control also show empirical results which degrade after reaching a good point. As told in related work part, there have been several methods trying to apply the model-based synthetic rollouts to the policy improvement step, however, they do not work well with the proposed value exploration strategy according to our practice since the one-step global prediction will be violated.

4.2. DYNAMICS LEARNING

Compared with descriptive models that are feasible in small state spaces (Deisenroth & Rasmussen, 2011; Khansari-Zadeh & Billard, 2011) , neural network approximation can scale better to highdimensional state spaces. We parameterize the dynamics p λ (s, a) and r µ (s, a, s ) with deep generative models (Moerland et al., 2020) , where the parameters λ and µ reparameterize the transition and the reward functions, respectively. Considering the fact that the transition function is difficult to train, we choose to learn a relative transition function to forecast the difference between the current state and next state, which is given by s p = s + p λ (s, a), and the state difference follows the relative transition probability density function (pdf) p λ (•|s, a). The use of s in does not induce the distribution mismatch since the relative transition function is unaffected by the policy. The representation of unknown reward function differs in the inputs among various tasks, for example, the information determining the rewards is not included in observations for the default setting of MuJoCo suite (Todorov et al., 2012; Brockman et al., 2016) . This complicates the training of the reward function, but we will show that taking inputs as (s, a, s ) is applicable in our selected benchmarks. Lemma 3. Assume the absolute value of expected reward function and the expected KL-divergence between the dynamics model and the real transition probability are respectively bounded by max s∼S |E a∼π r(s, a)| ≤ r m , max t E at∼π,st∼P t D T V (p(s t+1 |s t , a t )||P (s t+1 |s t , a t )) ≤ δ, then we have E s p t ∼p t ,at∼π [Q(s p t , a t )] -E st∼P t ,at∼π [Q(s t , a t )] ≤ O(δ). The proof of Lemma 6 can be found in Appendix C, which tells the distance between predicted Q-value and true Q-value is bounded linearly by δ. The dynamics Model is trained accompanying the iteration, using random samples from experience buffer. Given the four-tuple sample (s, a, r, s ), the surrogate objective of relative transition function is given by D(λ) = E (s,a,s )∼R 1 2 (s + p λ (s, a) -s ) 2 , ( ) where R represents the replay buffer that random samples come from. Then (8) can be optimized with stochastic gradient ˆ λ D(λ) = E (s,a,s ) (s + p λ (s, a) -s ) ˆ λ p λ (s, a) . Similarly, the surrogate objective for updating the reward function can be formulated as D(µ) = E (s,a,r,s )∼R 1 2 (r µ (s, a, s ) -r) 2 , ( ) and it gradient is computed as ˆ µ D(µ) = E (s,a,r,s ) (r µ (s, a, s ) -r) ˆ µ r µ (s, a, s ) . A well trained reward function is necessary for (19). Without the real-time model of reward function, sampled rewards from the replay buffer will induce distribution errors in model-based prediction.

4.3. MODEL-BASED ACTOR-CRITIC VALUE EXPLORATION ALGORITHM

As mentioned above, we adopt the minimized pairwise critics to serve as the target Q-value for the purpose of mitigating the effect of overestimation (Watkins, 1989) . Besides, current and target networks are separated to execute soft updates (Lillicrap et al., 2015; Haarnoja et al., 2018b) for all surrogate objectives, for the good of stability. In this work, the asymptotical model-based prediction based on ( 19) is merged into the target value of SAC, so the loss function for the update of critic parameters in the evaluation step can be estimated by L(ω i ) = E (s,a,r,s ) (kQ t (s p , a ) + (1 -k)Q t (s , a ) -kQ ω i (s, a) -(1 -k)Q ω i (s, a)) 2 , ( ) where i ∈ {1, 2}, a = π θ (s) and a = π θ (s ) are the on-policy actions chosen from the current policy and the target policy parameterized by θ and θ , respectively, (s, a, r, s ) is a tuple of history data sampled from the experience pool, and k is the asymptotic variable increasing from 0 to 1 as the time step proceeds. And Q t (s , a ) = r + γ min (Q ω 1 (s , a ), Q ω 2 (s , a )) - α 1 -k log(π θ (a |s )) , Q t (s p , a ) = γ min Q ω 1 (s p , a ), Q ω 2 (s p , a ) + r µ (s, a, s p ), where ω 1 , ω 2 , ω 1 and ω 2 parameterize two critic networks and their target estimates, respectively. Besides, s p = s+p λ (s, a) is the on-policy next state predicted by the transition model parameterized by λ, and π θ (•|s ) is the target policy distribution conditioned on the next state s . The action input of reward function r µ follows the current policy instead of being sampled from the replay buffer. By minimizing ( 24), the critic parameters can be updated for each evaluation step. Theorem 1. Assume E (s,a,r,s ) (Q t (s , a ) -Q ω i (s, a)) 2 ≤ for t > T 1 , then ∃T > 0 so that L(ω i ) ≤ 2 for t > T . The proof of Theorem 3 can be found in Appendix D. Theorem 3 means that the critic loss function (24) of this work will converge under the assumption that the critic loss of SAC converges. Besides the conclusion of convergence, we are more curious about how it converges or by what factors it is affected. From Theorem 4, we see both the accuracy of transition model and reward function model will affect the target value prediction, and the transition model influences more. Moreover, a well-trained model can guarantee its convergence. Theorem 2. Assume the absolute value of expected reward function, the expected KL-divergence between the dynamics model and the real transition probability, and the MSE of expected difference between the modeled reward function and the immediate reward are respectively bounded by max s∼S |E a∼π r(s, a)| ≤ r m , max t E at∼π,st∼P t D T V (p(s t+1 |s t , a t )||P (s t+1 |s t , a t )) ≤ δ, max t E (s,r) E s p t+1 ∼p t+1 ,at∼π r µ (s, a t , s p t+1 ) -r 2 ≤ ξ, then we have the MSE of target prediction error bounded by 2ξ + O(δ 2 ). Proof See Appendix E. This theorem interprets that the target prediction error does not suffer from overestimation bias and mismatched distribution, and can be negligible once the learnt model is accurate, then the prediction error distance is bounded by 2ξ + O(δ 2 ). Generally, the policy improvement step aims to maximize the current Q-value or it variant, which does not need the dynamics model to predict the reward or the next state. Then the surrogate objective function used to update the actor parameters can be directly given by J(θ) = E s [Q ω 1,2 (s, a) -α log(π θ (a|s))] , where Q ω 1,2 (s, a) = min (Q ω 1 (s, a), Q ω 2 (s, a) ), s comes from the tuple of history data, a = π θ (s) is the reparameterized action based on s and the policy network parameterized by θ, and π θ (•|s) is the current policy distribution conditioned on the current state s. By maximizing ( 16), the actor parameter can be updated for each policy improvement step. For updates of target critic parameters, we adopt "soft" target updates (Lillicrap et al., 2015) using a weighted factor 0 ≤ τ < 1 to control the speed of policy updates for the sake of small value error at each iteration. Except the critic parameters, we adopt immediate updates for other parameters in this work. In ( 9) and ( 11), the gradients in expectation forms are approximated by averaging over the sampled results of rollouts from past experience. We organize the above procedures as the model-based actor-critic value exploration with asymptotic planning (MAVE) algorithm, whose pseudocode is described by Algorithm 1. The algorithm alternates between running the environment steps to collect experience, and updating the network parameters using the stochastic gradients computed by the sampled batches from the experience pool. It is composed of online training and off-line planning, separated by the maximum online time step T 1 . At the stage of online training, Step 10 in Algorithm 1 requires tiny computation using the dynamics model without the burden of great computing and storage processing from virtual or synthetic rollouts, as explained by ( 14). Algorithm 1 MAVE Algorithm 1: Initialize parameters ω 1 ← ω 1 0 , ω 2 ← ω 2 0 , θ ← θ 0 , λ ← λ 0 , µ ← µ 0 2: Initialize target parameters ω 1 ← ω 1 0 , ω 2 ← ω 2 0 , θ ← θ 0 3: Initialize the learning rates l c , l a , l d for the critic, the actor and the dynamics model, the time step t ← 0, the asymptotic variable k ← 0, the soft update hyperparameter τ , the maximum online time step T 1 , the maximum overall time step T , the batch size B and the replay buffer R ← ∅. 4: while t < T 1 do 5: Select action a t ∼ π θt (a t |s t ) 6: Observe the reward and next state from the interaction feedback 7: Store transition R ← R ∪ {(s t , a t , r t , s t+1 )} for each time step do 10: ω i t+1 ← ω i t -l c ˆ ω i t L(ω i t ) for i ∈ {1, 2} following ˆ ω i L(ω i ) 11: θ t+1 ← θ t + l a ˆ θt J(θ t ) following ˆ θ i J(θ i ) 12: λ t+1 ← λ t -l d ˆ λ D(λ) following Eq. ( 9) 13: µ t+1 ← µ t -l d ˆ µ D(µ) following Eq. ( 11) 14:  ω i t+1 ← τ ω i t+1 + (1 -τ )ω i t for i ∈ {1, 2}

5.1. BENCHMARKS

The performance of our proposed method is compared with several prior model-free and modelbased reinforcement learning algorithms in the sample complexity and stability on a set of gym 

6. BASELINES

The baselines adopted for reference includes TD3, SAC, BRAC and Model-based policy optimization (MBPO) (Janner et al., 2019) . Before the appearance of SAC, DDPG is regarded as one of the most efficient off-policy DRL methods (Duan et al., 2016) , followed by TD3 as an extension. SAC has achieved state-of-the-art sample efficiency in multiple challenging continuous control domains (Christodoulou, 2019) , and BRAC can be regarded as a variant of SAC by adopting an extra policy regularization based on the KL divergence between updated and older policy. In this work, we adopt (8) to train the transition probability instead of maximum likelihood since the logarithm of transition probability parameterized by the global gaussian network tends to be unbounded. We apply the shared hyperparameters to our proposed algorithm with other baselines for every benchmark to keep fairness. In the process of collecting the off-policy rollouts, the gaussian exploration noise is added to every time step with a fixed variance 0.2 when choosing the action, and then the noisy action is clipped within the set boundary to avoid out-of-distribution (OOD) actions (Kumar et al., 2019; 2020) . The discount horizon factor is selected as 0.99, and all algorithms adopt stochastic policies and maximum prior action entropy except for TD3. The stochastic policies follow gaussian distributions with mean and variance parameterized by fully connected networks with two hidden layers, each of which has 256 units. TD3 uses a deterministic policy, also parameterized by fully connected networks with two hidden layers. We organize the network architectures and hyperparameters in Appendix F and G, respectively. The Adam optimizer (Kingma & Ba, 2014) is used to update the network parameters.

6.1. RESULTS

We run 10 seeds numbered from 0 to 9 for each algorithm to keep a fair comparison. After every 500 iterations (time steps), we launch a evaluation procedure, which averages 10 rollouts for a test. The average reward of a test will be recorded at every evaluation procedure, and all tests throughout the time step scale give the result of each algorithm. The average rewards of algorithms tested in chosen benchmarks are shown Fig. 2 with standard deviation as the confidence interval (CI). From Figs. 2(a), 2(b) and 2(c), we can observe higher converged value and smaller standard deviation of MAVE at late time steps over other baselines. At early stage before 1 million time steps, MAVE vibrates due to the training dynamics, as we note from Figs. 2(a), 2(c) and 2(d). In Hopper environment, since the converged value is far lower than other benchmarks, the tolerance for the fluctuation around convergence is much lower, which causes the instability problems of tested baselines. However, MAVE shows strong robustness and has a converged value up to 3700 compared with other baselines, which outweighs the limits of state-of-the-art results in Hopper task, as shown in 2(c). In Fig. 2(d ), MAVE has a relatively stable performance over 5000. For Humanoid with high-dimensional action space, Fig. 2 (e) shows that MAVE is much better than other baselines and can converge around the score of 6200. Over all figures in Fig. 2 , SAC and BRAC both have their up and downs, and TD3 gives the worst performance, considering its lack of adopting the policy exploration. We also show the goodness of trained dynamics model by plotting the results of off-line learning without interacting with the environment after T 1 in Fig. 2 , labeled by 'MAVE-P', which is analyzed in Appendix H. By the way, due to the differences in code details and the Mujoco version (which we use version 3), the converged maximum may be a bit different from those of (Haarnoja et al., 2018b) and (Feinberg et al., 2018) . For example, the best result of Halfcheetah is 8000 in (Feinberg et al., 2018) , however, it is up to 15000, which is 12000 in our case. In contrast, the best result of Ant is 6000 in (Haarnoja et al., 2018b) , which is lower than 7000 in our work and similar to Janner et al. (2019) . Except for Ant, the best results in Janner et al. (2019) is closer to ours'. Due to the page limit, the ablation studies can be found in Appendix I.

7. CONCLUSION

In this paper, we proposed a method that combines notations of dynamics training, model Proof E st∼S p t -P t =E st∼S,st-1∼S,at-1∼π p(s t |s t-1 , a t-1 )p t-1 -P (s t |s t-1 , a t-1 )P t-1 ≤E st∼S,st-1∼S,at-1∼π p(s t |s t-1 , a t-1 )[p t-1 -P t-1 ] + P t-1 [p(s t |s t-1 , a t-1 ) -P (s t |s t-1 , a t-1 )] ≤E st-1∼S,at-1∼π p t-1 -P t-1 + E st∼S,st-1∼P t-1 ,at-1∼π |p(s t |s t-1 , a t-1 ) -P (s t |s t-1 , a t-1 )| =E st-1∼S p t-1 -P t-1 + 2E at-1∼π,st-1∼P t-1 D T V (p(s t |s t-1 , a t-1 )||P (s t |s t-1 , a t-1 )) ≤E st-1∼S p t-1 -P t-1 + 2δ ≤ |p(s 0 ) -P (s 0 )| + 2tδ =2tδ, where p t = p t (s t ) and P t = P t (s t ), and the last equality holds because the initial distribution is not affected by transitions. This proof is partly referred to Janner et al. (2019) . B PROOF OF LEMMA 5 P π Q(s t , a t ) = r µ (s t , a t , s p t+1 ) + γE st+1 Q(s p t+1 , a t+1 ) , Lemma 5. Consider the sequence Q t+1 = P π Q t constructed by (19), then given the condition that the Q-values are bounded, i.e., |Q t (s, a)| < ∞, ∀(s, a) ∈ S × A, the sequence Q t will converge to a unique optimal value as t → ∞. Proof |P π Q(s t , a t ) -P π Q (s t , a t )| ≤γ E st+1 [Q(s p t+1 , a t+1 ) -Q (s p t+1 , a t+1 )] ≤γ max st+1 Q(s p t+1 , a t+1 ) -Q (s p t+1 , a t+1 ) =γ Q -Q ∞ , where • ∞ means the max norm. Since the Q-value is assumed to be bounded, the second inequality holds. We reach a conclusion that ∀(s t , a t ) ∈ S × A, (20) holds, which can be rewritten as  T π Q - T π Q ∞ ≤ γ Q -Q ∞ , which means P π Q(s, ≤r m t E st∼S γ t p t (s t ) -P t (s t ) ≤2r m δ t i=0 [iγ i ] ≤ 2r m δγ (1 -γ) 2 , ( ) where the second last inequality is due to Lemma 4.

D PROOF OF THEOREM 3

Theorem  3. Assume E (s,a,r,s ) (Q t (s , a ) -Q ω i (s, a)) 2 ≤ for t > T 1 , then ∃T so that L(ω i ) ≤ 2 for t > T . Proof According to Lemma 5, ∃T 2 > 0, E (s,a,r,s ) (Q t (s , a ) -Q ω i (s, a)) 2 ≤ for t > T 2 , then for T > max T 1 , T 2 L(ω i ) ≤ 2E (s,a,r,s ) k 2 (Q t (s p , a ) -Q ω i (s, a)) 2 + (1 -k) 2 (Q t (s , a ) -Q ω i (s, a)) 2 , ≤ ) + γQ(s p t+1 , a t+1 ) -Q(s, a t ) 2 =E (s,r) E s p t+1 ∼p t+1 ,st+1∼P t+1 ,at∼π,at+1∼π r µ -r + γQ(s p t+1 , a t+1 ) -γQ(s t+1 , a t+1 ) 2 ≤2ξ + 2γ 2 E s p t+1 ∼p t+1 ,st+1∼P t+1 ,at+1∼π Q(s p t+1 , a t+1 ) -Q(s t+1 , a t+1 ) 2 ≤2ξ + 2γ 2 E s p t+1 ∼p t+1 ,st+1∼P t+1 ,at+1∼π Q(s p t+1 , a t+1 ) -Q(s t+1 , a t+1 ) 2 ≤2ξ + 8r 2 m δ 2 γ 4 (1 -γ) 4 , where (s, r) is the random sample from the replay buffer, and the last inequality can be referred to Lemma 6. Then the prediction error is bounded by 2ξ + 8r 2 m δ 2 γ 4 (1-γ) 4 .

F NETWORK ARCHITECTURE

We construct the critic network using a fully-connected MLP with two hidden layers. The input is composed of the state and action, outputting a value representing the Q-value. The ReLU functions are adopted to activate the two hidden layers. The setting of policy network follows normal random distribution, whose expectation and variance are fully-connected networks fed only by the state. Both of them have two hidden layers activated by the ReLU function. After the hidden layers, a Tanh function and a Softplus function follows to form the expectation and variance, respectively. With the expectation and variance, a normal distribution can be achieved to represent the random policy. The network of transition probability is constructed similarly to that of the policy without of the Tanh clipping, except for the input, which is composed of the state and action instead. And the network of reward function is similar to that of the critic, with the input composed of the state, action and the next state. The architecture of networks are plotted in Fig. 3 . For simplicity, we omit the illustration of reward network in this figure. The above mentioned network architecture is adopted for the random policy. For the algorithm using the deterministic policy, the critic is constructed in the same way, however, the actor network is deterministic as the fully connected dense layer. the learning rate of dynamics including the transition probability and the reward. τ a and τ c represent soft update hyperparameter of the actor and the critic, respectively, and τ a = 1 means we adopt immediate update for the actor. The symbol var represents the variance of gaussian exploration noise, and α is the fixed temperature hyperparameter for the term of maximum posteriori action entropy, which is applied in algorithms except DDPG and TD3. α d represents the Wight factor of KL divergence for policy regularization applied in BRAC, β is the temperature hyperparameter to tune the impact of posteriori transition entropy in MAVE, and η is the asymptotic rise rate for k t = 1 -β t . 



Figure 1: (a) Ant-v3; (b) Halfcheetah-v3; (c) Hopper-v3; (d) Walker2d-v3; (e) Humanoid-v3

of transitions B = (s, a, r, s )

Figure 2: Average reward of off-line training and online-planning versus time step in (a) Ant-v3; (b) Halfcheetah-v3; (c) Hopper-v3; (d) Walker2d-v3; (e) Humanoid-v3

Figure 3: Architecture of networks.

Figure 4: Ablation Study on 3 components in (a) Ant-v3; (b) Halfcheetah-v3; (c) Hopper-v3; (d) Walker2d-v3; (e) Humanoid-v3

:

prediction, off-line training and on-line planning, which jointly deduce a simple solution to the value exploration. Our work is sensitive to the dynamics precision, especially to the transition model, however, it is free from costs on extra storage and computation and can greatly reduce estimation errors and distribution mismatches. The minor costs of our work are the necessary networks for the dynamics and a hyperparameter (given in Appendix G) to control the speed of asymptotical dynamics training.Lemma 4. Assume the expected KL-divergence between two transition distributions is bounded by maxt E at∼π,st∼P t D T V (p(s t+1 |s t , a t )||P (s t+1 |s t , a t )) ≤ δ,(17)then we have E st∼S |p t (s t ) -P t (s t )| ≤ 2tδ.

a) is a max-norm contraction mapping. According to the contraction property, the sequence Q k+1 = P π Q k will converge to a unique fixed point. Lemma 6. Assume the absolute value of expected reward function and the expected KL-divergence between the dynamics model and the real transition probability are respectively bounded by max

Theorem 4. Assume the absolute value of expected reward function, the expected KL-divergence between the dynamics model and the real transition probability, and the MSE of expected difference between the modeled reward function and the immediate reward are respectively bounded by max E at∼π,st∼P t D T V (p(s t+1 |s t , a t )||P (s t+1 |s t , a t )) ≤ δ,

lists the common hyperparameters shared by all experiments and their respective settings. In this table, L a means the learning rate of the actor, L c means the learning rate of critics, and L d means

H OFF-LINE PERFORMANCE

In Halfcheetah environment, the maximum online time step T 1 is set as 1 million, which means from 1 million to 3 million steps, the agent stops interacting with the environment and performs the training and planning totally based on the reserved fixed experience buffer, more specifically, based on the initial states s in the experience four-tuple slots (s, a, r, s ). From Fig. 2 (b), MAVE-P represents the off-line performance after T 1 , which shows that the model can preserve the performance before the online training is stopped, when MAVE still has great potential to continue improving performance. Similar phenomena can be observed from Figs. 2(a), 2(d) and 2(e), with the same T 1 for Ant, Walker2d and Humanoid environments, respectively. The maximum online time step in Hopper is set as 0.5 million, which is smaller than other tasks because the upper converged value in Hopper is much lower. Besides, we note that Hopper does not reach a good point after stoping the on-line training, however, MAVE-P still manages to converge to 3500 solely counting on off-line training.

I ABLATION STUDY

To investigate the contribution of individual parts in the proposed value exploration, we replace the modeled reward function with the immediate reward, replace the model predicted next state s p with buffer-sampled s , and replace the on-policy current action a with buffer-sampled a in Q-value function, labeled as 'MAVE-R', 'MAVE-P' and 'MAVE-Q', respectively. The figures of the three conditions are compared with MAVE in Fig. 4 . From these figures, we see these fragments are all necessary. Specifically, lacking of the modeled reward function as part of the dynamics and the predicted next state foreseen by the modeled transition probability will induce distribution mismatches for MAVE-R and MAVE-P, respectively, and MAVE-Q will cause estimation error, which can be analyzed by the proof of Theorem 4.

