MAKING BETTER DECISION BY DIRECTLY PLANNING IN CONTINUOUS CONTROL

Abstract

By properly utilizing the learned environment model, model-based reinforcement learning methods can improve the sample efficiency for decision-making problems. Beyond using the learned environment model to train a policy, the success of MCTS-based methods shows that directly incorporating the learned environment model as a planner to make decisions might be more effective. However, when action space is of high dimension and continuous, directly planning according to the learned model is costly and non-trivial. Because of two challenges: (1) the infinite number of candidate actions and (2) the temporal dependency between actions in different timesteps. To address these challenges, inspired by Differential Dynamic Programming (DDP) in optimal control theory, we design a novel Policy Optimization with Model Planning (POMP) algorithm, which incorporates a carefully designed Deep Differential Dynamic Programming (D3P) planner into the model-based RL framework. In D3P planner, (1) to effectively plan in the continuous action space, we construct a locally quadratic programming problem that uses a gradient-based optimization process to replace search. (2) To take the temporal dependency of actions at different timesteps into account, we leverage the updated and latest actions of previous timesteps (i.e., step 1, • • • , h -1) to update the action of the current step (i.e., step h), instead of updating all actions simultaneously. We theoretically prove the convergence rate for our D3P planner and analyze the effect of the feedback term. In practice, to effectively apply the neural network based D3P planner in reinforcement learning, we leverage the policy network to initialize the action sequence and keep the action update conservative in the planning process. Experiments demonstrate that POMP consistently improves sample efficiency on widely used continuous control tasks. Our code is released at https://github.com/POMP-D3P/POMP-D3P. The full version of the related work is in Appendix A, we briefly introduce several highly related works here. In general, model-based RL for solving decision-making problems can be divided into three perspectives: model learning, policy learning, and decision-making. Moreover, optimal control theory also concerns the decision-making problem and is deeply related to model-based RL. Model learning: How to learn a good model to support decision-making is crucial in model-based RL. There are two main aspects of the work: the model structure designing (Chua et al., 2018; Zhang 

1. INTRODUCTION

Model-based reinforcement learning (RL) (Janner et al., 2019a; Yu et al., 2020; Schrittwieser et al., 2020; Hafner et al., 2021) has shown its promise to be a general-purpose tool for solving sequential decision-making problems. Different from model-free RL algorithms (Mnih et al., 2015; Haarnoja et al., 2018) , for which the controller directly learns a complex policy from real off-policy data, model-based RL methods first learn a predictive model about the unknown dynamics and then leverage the learned model to help the policy learning. With several key innovations (Janner et al., 2019a; Clavera et al., 2019) , model-based RL algorithms have shown outstanding data efficiency and performance compared to their model-free counterparts, which make it possible to be applied in real-world physical systems when data collection is arduous and time-consuming (Moerland et al., 2020) . There are mainly two directions to leverage the learned model in model-based RL, though not mutually exclusive. In the first class, the models play an auxiliary role to only affect the decision-making by helping the policy learning (Janner et al., 2019b; Clavera et al., 2019) . In the second class, the model is used to sample pathwise trajectory and then score this sampled actions (Schrittwieser et al., 2020) . Our work falls into the second class to directly use the model as a planner (rather than only help the policy learning). Some recent papers (Dong et al., 2020; Hubert et al., 2021; Hansen et al., 2022b) have started walking in this direction, and they have shown some cases to support the motivation behind it. For example, in some scenarios (Dong et al., 2020) , the policy might be very complex while the model is relatively simple to be learned. These idea is easy to be implemented in the discrete action space where MCTS is powerful to do the planning by searching (Silver et al., 2016; 2017; Schrittwieser et al., 2020; Hubert et al., 2021) . However, when the action space is continuous, the tree-based search method can not be applied trivially. There are two key challenges. (1) Continuous and high-dimensional actions imply that the number of candidate actions is infinite. (2)The temporal dependency between actions implies that the action update in previous timesteps can influence the later actions. Thus, trajectory optimization in continuous action space is still a challenge and lacks enough investigation. To address the above challenges, in this paper, we propose a Policy Optimization with Model Planning (POMP) algorithm in the model-based RL framework, in which a novel Deep Differentiable Dynamic Programming (D3P) planner is designed. Since model-based RL is closely related to the optimal control theory, the high efficiency of differential dynamic programming (DDP) (Pantoja, 1988; Tassa et al., 2012) algorithm in optimal control theory inspires us to design an algorithm about dynamic programming. However, since the DDP requires a known model and a high computational cost, applying the DDP algorithm to DRL is nontrivial. The D3P planner aims to optimize the action sequence in the trajectory. The key innovation in D3P is that we leverage first-order Taylor expansion of the optimal Bellman equation to get the action update signal efficiently, which intuitively exploits the differentiability of the learned model. We can theoretically prove the convergence rate of D3P under mild assumptions. Specifically, (1) D3P uses the first-order Taylor expansion of the optimal Bellman equation but still constructs a local quadratic objective function. Thus, by leveraging the analytic formulation of the minimizer of the quadratic function, D3P can efficiently get the local optimal action. (2) Besides, a feedback term is introduced in D3P with the help of the Bellman equation. In this way, D3P updates the action in current step by considering the action update in previous timesteps during planning. Note that D3P is a plug-and-play algorithm without introducing extra parameters. When we integrate the D3P planner into our POMP algorithm under the model-based RL framework, the practical challenge is that the neural network-based learned model is always highly nonlinear and with limited generalization ability. Hence the planning process may be misled when the initialization is bad or the action is out-of-distribution. Therefore, we propose to leverage the learned policy to provide the initialization of the action before planning and provide a conservative term at the planning to admit the conservation principle, in order to keep the small error of the learned model along the planning process. Overall speaking, our POMP algorithm integrates the learned model, the critic, and the policy closely to make better decisions. For evaluation, we conduct several experiments on the benchmark MuJoCo continuous control tasks. The results show our proposed method can significantly improve the sample efficiency and asymptotic performance. Besides, comprehensive ablation studies are also performed to verify the necessity and effectiveness of our proposed D3P planner. The contributions of our work are summarized as follows: (1) We theoretically derive the D3P planner and prove its convergence rate. (2) We design a POMP algorithm, which refines the actions in the trajectory with the D3P planner in an efficient way. (3) Extensive experimental results demonstrate the superiority of our method in terms of both sample efficiency and asymptotic performance. et al., 2021; 2020; Hafner et al., 2021; Chen et al., 2022) and the loss designing (D'Oro et al., 2020; Farahmand et al., 2017; Li et al., 2021) . Policy learning: Two methods are always used to learn the policy by using the learned model. One is to serve the learned model as a black-box simulator to generate the data (Janner et al., 2019b; Yu et al., 2020; Lee et al., 2020) . Another way is to use the learned model to calculate the policy gradient (Heess et al., 2015b; Clavera et al., 2019; Amos et al., 2021) . Decision-making: When making the decision, we need to generate the actions that can achieve our goal. Many of the model-based RL methods make the decision by using the learned policy solely (Hafner et al., 2021) . Similar to our paper, some works also try to make decisions by using the learned model, but the majority only focus on the discrete action space. The well-known MCTS method achieves a lot of success. For example, the well-known Alpha Zero (Silver et al., 2017) , MuZero (Schrittwieser et al., 2020) . There are only a few works that study the continuous action space, such as the Continuous UCT (Couëtoux et al., 2011) , the sampled MuZero (Hubert et al., 2021 ), the TreePI (Springenberg et al., 2020) , and the TD-MPC (Hansen et al., 2022a) . Optimal control theory: Beyond deep RL, optimal control also considers the decision-making problem but rather relies on the known and continuous transition model. In modern optimal control, Model Predictive Control (MPC) (Camacho & Alba, 2013) framework is always adopted when the environment is highly non-linear. In MPC, the action is planned during the execution by using the model, and such a procedure is called trajectory optimization. Plenty of previous works (Byravan et al., 2021; Chua et al., 2018; Pinneri et al., 2021; Nagabandi et al., 2020) use MPC framework to solve the continuous control tasks, but most of them are based on zero-order or sample-based method to do the planning. The most relevant works are DDP (Murray & Yakowitz, 1984) , iLQR (Li & Todorov, 2004) , and iLQG (Todorov & Li, 2005; Tassa et al., 2012) . We discuss the detailed differences between our method and these methods in Appendix A. Since our planning algorithm relies on the learned model and learned policy, we build our algorithm based on these works on model learning and policy learning. Our POMP algorithm tries to solve a more challenging task compared to the related work on decision-making: efficiently optimize the trajectory in continuous action space when the environment model is unknown. Different from our works, the MPC with DDP as trajectory optimizer from optimal control theory requires the known environment model, and also requires the hessian matrix for online optimization from scratch.

3. PRELIMINARIES

Reinforcement Learning. We consider a discrete-time Markov Decision Process (MDP) M, defined by the tuple (X , A, f, r, γ), where X is the state space, A is the action space, f : x t+1 = f (x t , a t ) is the transition model, r : X × A → R is the reward function, γ is the discount factor. We denote the future discounted return at time t as R t = ∞ t ′ =t γ t ′ -t r t ′ , and Reinforcement Learning (RL) aims to find a policy π θ : X × A → R + that can maximize the expected return J. where max θ J(θ) = max θ E π θ R t = max θ E π θ ∞ t ′ =t γ t ′ -t r(x t ′ , a t ′ ) . Bellman Equation. We define the optimal value function V * (x) = max E[Rt|xt = x]. The optimal value function obeys an important identity known as the Bellman optimality equation V * (x) = max at E r(x t , a t |x t = x) + γV * (x t+1 ) . The idea behind this equation is that if we know the r(x t , a t ) for any a t and next step value function V * (x t+1 ) for any s t+1 , we can recursively select the action a t which maximizes r(x t , a t |x t = x) + γV * (x t+1 ). Similarly, we can denote the optimal action-value function Q * (x, a) = max E[R t |x t = x, a t = a], and it obeys a similar Bellman optimility equation Q * (x, a) = maxa t+1 E r(xt, at|xt = x, at = a) + γQ * (xt+1, at+1) . Model-based RL. Model-based RL method distinguishes itself from model-free counterparts by using the data to learn a transition model. Following Janner et al. (2019a) and Clavera et al. (2019) , we use parametric neural networks to approximate the transition function, reward function, policy function and Q-value function with the following objective function to be optimized J f (ψ) = E log f (x t+1 |x t , a t ) , J r (ω) = E log r(r t |x t , a t ) , J π (θ) = E H-1 t=0 γ t r(x t , a t ) + γ H Q(x H , a H ) and J Q = E ∥Q(x t , a t ) -(r + Q(x t+1 , a t+1 ))∥ 2 , respectively. In J π (θ), we truncate the trajectory in horizon H to avoid long time model rollout. Notations. For one-dimensional state and action case, we denote the partial differentiation of function by using its output with subscripts, e.g., r x ≜ ∂r(x,a) ∂x , r a ≜ ∂r(x,a) ∂a , a) ∂x and Q a ≜ ∂Q(x,a) ∂a . See Appendix E for the multi-dimension case. f x ≜ ∂f (x,a) ∂x , f a ≜ ∂f (x,a) ∂a , Q x ≜ ∂Q(x,

4. PLANNING IN CONTINUOUS ACTION SPACE

In this section, we present our POMP algorithm and the D3P planner in detail. First, we derive the D3P planner which relies on the Bellman equation. Then, we theoretically prove its convergence property. Finally, we show how to effectively apply D3P planner in our POMP algorithm in RL.

4.1. DEEP DIFFERENTIAL DYNAMIC PROGRAMMING

In this subsection, we will theoretically derive the D3P planner and prove its convergence property. There are mainly two challenges in continuous action space planning: (1) the infinite number of candidate actions, and (2) the temporal dependency between actions in different timesteps. Here, we briefly introduce the main idea of our D3P planner to solve the above challenges. We first define an objective function and formulate it as an optimization problem based on the Bellman equation. Then, we convert it to a local optimization problem and approximate the objective function via Taylor expansion. To avoid the computation of the hessian matrix, we use the first-order Taylor expansion to construct a quadratic function. Since the analytical solution of a quadratic function is easy to get, we can efficiently get the local optimal action sequence and thus overcome the challenge (1) to some extent. To get over challenge (2), we introduce a feedback term into the objective function to depict the state change induced by the action update in prior timesteps. By considering the feedback term that explicitly involves the information of prior action updates, we can correct the action update in time. The remaining question is whether the D3P planner can indeed optimize the original objective after we make several approximations when deriving the algorithm. Through theoretical analysis, we show that the convergence rate of the proposed algorithm can be guaranteed. We now introduce how we derive the D3P planner. For clarification, we use the finite horizon MDP as a proof of concept setting. The state and action are one-dimensional variables. The infinite horizon MDP with multi-dimensional state and action can be derived similarly and we put it in Appendix E. Recall the goal of RL methods, our planning algorithm aims to find the action sequences {a 1 , • • • a H } that can maximize the value function V (x 1 , 1) ≜ max a1,•••a H H h=1 r(x h , a h ), where x h+1 = f (x h , a h ). Due to challenge (1), such an optimal action sequence is in general hard to find. Hence our D3P planner treats this optimal action sequence searching problem as an optimization problem that leverages the optimal Bellman equation to formulate the following objective function, V (x h , h) = max a h [r(x h , a h ) + V (f (x h , a h ), h + 1)]. (1) Since the reward function and the transition function is unknown, we will use neural network to approximate them. However, the optimization problem is highly non-convex. Thus, we consider an auxiliary goal that is to find the local optimal a + δa in the neighbourhood of current action a to improve the action from a to a + δa. Denote Q(x h , a h ) = r(x h , a h ) + V (f (x h , a h ), h + 1), our goal can be re-expressed as δa h = arg max δa [Q(x h , a h + δa)]. To accelerate the optimization process, D3P planner constructs a quadratic objective function to get the local optimal action analytically. Specifically, we propose to use the first-order Taylor expansion to avoid computing the hessian matrix. However, the first-order Taylor expansion can not lead to a quadratic objective function directly, hence we first seek a surrogate objective function D(x, a) ≜ (Q(x, a) -V max ) 2 , where V max is a constant and set to larger than the upper bond of Q(x, a). It is easy to check that arg min δa D(x, a + δa) ≜ arg max δa Q(x, a + δa). For challenge (2), intuitively, after updating the action a t in prior timestep, state x t+1 will change and we should update the action a t+1 accordingly. Such a manner is often called "feedback". Calculate r i = r ω (x i , a i ), x i+1 = f ψ (x i , a i ). 3: end for 4: for i = 1, • • • , N d do # Optimize the trajectory. 5: Calculate Q x (x H , a H ), Q a (x H , a H ). # Backward process. 6: for j = H -1, • • • , 1 do 7: Calculate r a , r x , f a , f x . 8: Calculate Q a , Q x , k, K, V x using Equation 3, 4, 5 and 9. 9: end for 10: δx 1 = 0. # Forward process. 11: for j = 1, • • • , H do 12: Calculate δa j using Equation 3, and a j ← a j + δa j . 13: Calculate x j+1 ← f ψ (x j , a j ), and δx j+1 = x j+1 -x j . 14: end for 15: end for 16: return The last best action a 1 . achieve the feedback control, we now consider Q(x + δx, a + δa), in which δx represents the state change due to the prior action update. Applying first-order Taylor expansion for the Q function in D function we can get a quadratic function of δa(recall the notations in Preliminary) D(x + δx, a + δa) = (Q(x, a) + Q a (x, a)δa + Q x (x, a)δx -V max ) 2 . (2) we now get the optimal action update δa * as a function of the feedback δx, denote k h = Q(x h ,a h )-Vmax Qa(x h ,a h ) and K h = Qx(x h ,a h ) Qa(x h ,a h ) , δa * h = -k h -K h δx h = - Q(x h , a h ) -V max Q a (x h , a h ) - Q x (x h , a h ) Q a (x h , a h ) δx h . The remaining part is how to calculate the Q x (x, a), Q a (x, a) in the update rule, Q a (x h , a h ) = r a (x h , a h ) + V x (f (x h , a h ), h + 1) • f a (x h , a h ), (4) Q x (x h , a h ) = r x (x h , a h ) + V x (f (x h , a h ), h + 1) • f x (x h , a h ). (5) By leveraging the differentiable model including the reward and transition function, only the gradient of value function V x (f (x h , a h ), h + 1) is hard to calculate. We use the Bellman equation and Taylor expansion once again to calculate V x (f (x h , a h ), h + 1). Putting δa * h into Bellman equation (1) and using Taylor expansion , V (x h + δx h , h) = Q(x h + δx h , a h + δa * h ) (6) ≈ Q(x h , a h ) + Q x (x h , a h )δx h + Q a (x h , a h )δa * h (7) = (Q(x h , a h ) -Q a (x h , a h )k h ) zero-order term + (Q x (x h , a h ) -Q a (x h , a h )K h )δx h . first-order term (8) We can now use the coefficient of the first-order term in Taylor expansion of V (x h + δx h , h) to calculate the V x V x = Q x (x h , a h ) -Q a (x h , a h )K h . The whole D3P planner is shown in Algorithm 1. Noting that the current presentation of our method is applied in the deterministic environment, but our D3P planner can be easily extended to the stochastic environment with reparameterization tricks (such as normal distribution noise in Kingma & Welling ( 2013)). Since we adopt some approximation in the derivation of the algorithm, we need some convergence guarantee.  ∥a ′ h -a * h ∥ ≤ C H k=1 ∥a k -a * k ∥ 2 + B H k=1 ∥a k -a * k ∥, where C proportional to the Lipschitz (denoted L 1 ) and smoothness (denoted L 2 ) constant of the transition function and reward function C = O(L 1 , L 2 ), B proportional to the scale of the second order derivation of the transition and reward function B = O(f aa , f ax , f xx , r aa , r ax , r xx ). The above theorem shows that if we can choose a good initialization point for the planning process, we can guarantee the asymptotic convergence of the planning process. For the finite sample case, the convergence rate is at least linear convergence. If the second derivative of the transition function is near zero (B is sufficient small), the convergence rate is near quadratic convergence. The intuition is shown in Lemma 2. In this situation, the 2nd order derivative of D can be approximated by the multiplication of the 1st order derivative of Q and thus of f and r. For example D aa ≈ Q a Q a . We further analyze the influence of the feedback term in terms of the convergence rate. Corollary 1. If we do not consider the feedback term (δx = 0), the convergence rate is ∥a ′ h -a * h ∥ ≤ C H k=1 ∥a k -a * k ∥ 2 +B H k=1 ∥a k -a * k ∥+ Qx(x h ,a h ) Qa(x h ,a h ) 1 i=h-1 Π h-1 j=i+1 fx(xj, aj) fa(xi, ai)δai + Cδa 2 i . The corollary shows that if we do not consider the temporal dependency between actions in different timestep, or in other words δx = 0, the convergence rate will be slower than Equation ( 12) with an extra error term. The intuition is, since we are optimizing the action sequence along a trajectory, the action update will change the trajectory. Given our objective is a function of state and action, the different states will lead to the different optimal actions. Therefore, if we do not consider the state change due to the action update in the previous timesteps, the action update direction will not be toward the true gradient direction. Besides, the influence is proportional to the magnitude of the state change which is determined by the system property (f x , f a ) and previous action update δa i .

4.2. POLICY OPTIMIZATION WITH MODEL PLANNING: A PRACTICAL IMPLEMENTATION

In this subsection, we show how we apply our D3P planner to the deep RL framework. Since the D3P planner is a plug-and-play algorithm, compared to the traditional model-based RL algorithm like MAAC (Clavera et al., 2019) , only the decision-making parts are different. The POMP algorithm is summarized in Appendix B. Note that D3P planner module does not introduce any additional neural networks. All network structure, including model, critic, and policy are the same as MAAC (Clavera et al., 2019) and MBPO (Janner et al., 2019b) . One key problem that needs to be resolved before applying the D3P planner is how to avoid misleading planning due to the limited generalization ability of the learned model. Such a problem can not be ignored as long as the ground-truth model is unknown, which can only be learned by data with function approximation. We consider two components in the algorithm to alleviate the effect of the model error: the initialization strategy and the conservative planner objective. For the initialization strategy, we propose to use the policy network and learned model to initialize the state-action trajectory. That is, the initial action used by D3P planner is the output of the learned policy. The motivations are as follows. (1) Since the policy is trained to maximize the return-togo as general model-based RL, the proposed action would be reasonable and competitive, which is better than random initialization. (2) Since the data used to train policy is sampled from the replay buffer, the action outputted by the policy network should lead to a small model prediction error. For the conservative planner objective, constraining the actions outputted by D3P planner near the training data can keep the model prediction error small and provide an additional regularization for the planner. Specifically, since the policy output is a multivariate Gaussian, we can easily calculate the log-likelihood logP(x i , a i ) for a given state action pair. The log-likelihood is used as an auxiliary reward, and we add it to the output of the reward function when doing planning in the evaluation phase. Specifically, we add an additional reward at the first step, and the optimization objective of D3P becomes J c ({a i , • • • , a i+H-1 }) = i+H-2 h=i r(x h , a h ) + Q(x i+H-1 , a i+H-1 ) + α logP(x i , a i ), where α is a hyper-parameter. Please note that we only use this conservative term during evaluation, as we want to encourage exploration when training.

5. EXPERIMENTS

In this section, we aim to answer the following questions: (1) Compared to state-of-the-art methods, how does our method perform on benchmark continuous control tasks? (2) Is planning necessary to make a better decision in continuous control? (3) Is our D3P planner advantageous in continuous control? (4) How the learned model quality affects decision-making? (5) Does our D3P efficiently optimize the trajectory quality? (6) Is the policy network necessary in our framework? To answer the above questions, we evaluate our method on continuous control benchmark tasks in the MuJoCo simulator (Todorov et al., 2012) . Our method is built on top of MAAC (Clavera et al., 2019) , which means the procedure of model learning, policy optimization, and the corresponding hyperparameters are the same as MAAC. More details are left in Appendix C.3. Due to space limitation, we leave the detailed description of the baseline methods in Appendix C.4. To answer the first question, we compare our method with six SOTA baseline methods, and the results are shown in Fig. 1 . Specifically, no matter on asymptotic performance or on the sample efficiency, our method shows a significant performance improvement against MAAC, of which our method is built on top, on all six tasks. Moreover, on two control tasks with high-dimensional action space, Ant and Humanoid, the improvement of our method are more obvious. In general, our method achieves better performance than all other model-based and model-free baseline methods, which demonstrates the effectiveness and generality of our method. Note that in humanoid task, MAGE achieves better sample efficiency than ours in early training phase, but our method achieves a better final result than MAGE and MAGE is worse than our method on all other tasks.

5.2. ABLATION STUDIES

In this section, we conduct several ablation experiments to answer questions (2) (6) posted before and show the necessity and effectiveness of the proposed components in our method. Is planning necessary to make a better decision in continuous control? We design experiments to verify the effectiveness of two possible ways to make a better decision: (1) Using the model to do planning and (2) Increasing the N p in Algorithm 2, which is the number of update times of the policy net after we collect 1 data from the real environment, and then relying on the policy to make the decision. Here we increase N p from 10 (in MAAC original implementation) to {20, 50, 100} to see whether increasing the update times of the policy could help policy optimization, and the results are presented in Figure 2 . As shown in the figure, N p = 10 in the original MAAC is a rather good choice, and increasing N p even would harm the policy optimization. However, our method, which uses the learned model as a planner could consistently improve the policy. Is our D3P planner advantageous in continuous control? D3P planner considers the temporal dependency and constructs a local quadratic objective function to optimize the initial trajectory proposed by the policy network. To validate the advantage of our method, we replace the D3P planner in our method with an SGD-like planner, which directly optimizes the action sequence with gradient ascend; a random-shooting planner (Press et al., 2007) , which randomly samples some actions in the entire action space and then scores these actions according to the reward and critic function; a cross-entropy method (CEM) planner (Rubinstein & Kroese, 2004; Hansen et al., 2022a) , which adaptively and iteratively adjusts the sampling distribution in a sophisticated manner. Noting that we only change the planner in all these variants, and keep the model and policy learning unchanged for a fair comparison. The results are shown in Figure 3 , and we can see that SGD-like planner (denoted by POMP with SGD planner) performs similarly to policy network (denoted by MAAC) and the improvement over policy (MAAC) is limited. Our method (denoted by POMP with D3P planner) is more effective than SGD-like planner. Moreover, the gaps between our method and the CEM planner (denoted by CEM), the random-shooting planner (denoted by Random-shooting) clearly show the efficiency of the first-order method (compared to the zero-order method). How the learned model quality affect decision-making? As our method optimizes the trajectory via planning in a learned environment model, a key part is to see how the learned model quality affects the planning results. To answer this question, we pick 4 types of the learned model with different amount of training data ( the more training data, the better the quality of the learned model). Then we cluster the policy network according to their performance into 6 groups. Finally, we combine the different quality models with each policy group to see the average performance improvement after we applying the D3P planner on the learned model and policy. First, for each model and each policy, we evaluate the average return using 10 trajectories. Then, we cluster the learned model and policy according to their training data and the average return and then calculate the average performance improvement in each cluster. From the result shown in Figure 4 (a): (1) the improvement of the model trained on only 10k train data is similar to those of models trained by more data (except 5k∼6k is slightly worse), which means it is enough to use an early stage model in our D3P planner; (2) our D3P planner could consistently improve the performance of the decision made by policy network directly, especially in early and middle stage. Does our D3P efficiently optimize the trajectory quality? Similarly, we cluster the learned model according to their used training data, and combine it with a fixed policy (with an average return about 4k) and see the impact of different iteration numbers N d used in our D3P planner. From the results shown in Figure 4(b): (1) the performance improvements increase as we use more iteration numbers, which shows the effectiveness of our method; (2) the improvements are almost the same after N d >= 6, , and we do not need more iterations, which demonstrate the efficiency of our method; (3) the results also show that the early stage model is enough for our D3P planner. Is the policy network necessary in our framework? There are two usages for the policy network in our D3P planner: (1) initialize the trajectory to be optimized, (2) add a conservative term as an auxiliary reward during evaluation. We conduct an ablation experiment to verify the necessity of the policy network in our method, and the results are shown in Figure 4 (c). First, when we use a trajectory randomly generated rather than proposed by a policy network, the D3P failed to find any meaningful action (denoted by "RAND"), which proves the importance of trajectory initialization. Second, as we increase the iteration number in D3P planner, the performance with our conservative term is consistently better than those without it, especially at the later stage when the policy network is near optimal. This means the generality of the learned model is limited when we use a large iteration number N d , and we need to constrain the optimization space of the method.

6. CONCLUSIONS AND FUTURE WORK

In this work, we first derived the D3P planner which is effective and efficient for continuous control and proved its convergence rate. Then, we proposed the POMP algorithm, which leverages our D3P planner in a practical model-based RL framework. Extensive experiments and ablation studies on benchmark continuous control tasks demonstrate the effectiveness of our method and show the benefit of utilizing the model planning in continuous control. For future work, given the model uncertainty can effectively trade-off the exploration and exploitation, how to properly estimate and incorporate the uncertainty of the learned model into planning is a meaningful topic.

A RELATED WORK

Model-based RL methods for solving decision-making problems focus on three key perspectives: how to learn the model? how to use the learned model to learn the policy? And how to make the decision using the learned model and policy? Besides, decision-making that relies on the model is also investigated in the optimal control theory field which is deeply related to model-based RL. Model learning: How to learn a good model to support decision-making is a crucial problem in model-based RL. There are two main aspects of the work: the model structure designing and the loss designing. For model structure designing, ensemble-based model (Chua et al., 2018) , dropout mechanisms (Zhang et al., 2021) , auto-regressive structure et al., 2020), stochastic hidden model (Hafner et al., 2021) , and transformer based model (Chen et al., 2022) Decision-making: When making the decision, we need to generate the actions that can achieve our goal. Most of the model-based RL methods make the decision by using the learned policy solely (Janner et al., 2019b; Yu et al., 2020; Clavera et al., 2019; Hafner et al., 2021) . Similar to our paper, some works also try to make decisions by using the learned model, but the majority only focus on the discrete action space. For example, the well-known Alpha Zero system (Silver et al., 2017) uses MCTS to derive the action by using the known model. In MuZero and (Schrittwieser et al., 2020) , the authors propose to use a learned model combined with an MCTS planner to achieve significant performances in a broad range of tasks within discrete action space. There are only a few works that study the continuous action space. Couëtoux et al. (2011) extends the MCTS framework to continuous action space but also needs to know the real model and handle the model. In Hubert et al. (2021) , the author proposed a sampled MuZero algorithm to handle the complex action space by planning over sampled actions. In Hansen et al. (2022a) , the authors propose to learn a value function that can be used as long term return in the Cross-Entropy (CE) method for planning. Optimal control: Beyond deep RL, optimal control also considers the decision-making problem but rather relies on the known and continuous transition model. In modern optimal control theory, Model Predictive Control (MPC) (Camacho & Alba, 2013) framework is always adopted when the environment is highly non-linear. In MPC, the action is planned during the execution by using the model, and such a procedure is called trajectory optimization. There are plenty of previous works that use the MPC framework to solve continuous control tasks. For example, Byravan et al. (2021) proposes to use sampling-based MPC for high-dimensional continuous control tasks with learned models and a learned policy as a proposal distribution. Pinneri et al. (2021) proposes an improved version of the Cross-Entropy Method for efficient planning. Nagabandi et al. (2020) proposes a PDDM method that uses a gradient-free planner algorithm combined with online MPC method to learn flexible contact-rich dexterous manipulation skills. Differential Dynamical Programming: The most relevant works are DDP (Murray & Yakowitz, 1984) , iLQR (Li & Todorov, 2004) , and iLQG (Tassa et al., 2012) . Differentiable Dynamic Programming (DDP) (Tassa et al., 2012) employs the Bellman equation structure (Murray & Yakowitz, 1984; Pantoja, 1988; Aoyama et al., 2021) and has fast convergence property. It becomes more and more popular in the control field. iLQR (Li & Todorov, 2004) , and iLQG (Tassa et al., 2012; Todorov & Li, 2005) are two variants of the DDP. In iLQR and iLQG, the second-order derivative of the environment model is ignored (set as zero). Therefore, iLQR and iLQG are more computationally efficient compared to the original DDP method. Since both iLQG and our D3P planner are motivated by DDP, they look similar naturally. But our method has several key differences compared with theirs, and these differences are well-designed to incorporate the neural network model. (1) DDP, iLQR, and iLQG are both pure planning algorithms that require a known environment model. ( 2) Computing the second-order derivative of the neural network based model is computationally costly (Hessian matrix). In our method, we only rely on the first-order derivative of the model. ( 3) The previous methods use the second-order Talyor expansion of the Q-value function to handle the local optimization problem. But it is hard to guarantee that the hessian matrix is a negative definite matrix, which is a necessary condition for convergence. Here, we construct an auxiliary target function D and use a first-order Talyor expansion for the Q function inside of the D function to guarantee the non-positive definite matrix.

B POMP ALGORITHM

In this subsection, we present the details of POMP algorithm. Overall speaking, POMP algorithm learn three components: model, critic, and actor with the neural network function approximator, and leverage the D3P planner module to integrate all three components to make a better decision. The POMP algorithm runs as follows. We first learn the model, the policy, and the critic using pre-given or random-policy-generated data. Then, we leverage the D3P planner to generate actions based on the model, the critic, and the policy network to interact with the environment and add these data to the true replay buffer. Next, we will use the data from the true replay buffer to train the model. We also generate fake data by using the learned model and add these data to the fake replay buffer. After that, we will sample the data from both real buffer and fake buffer to train the critic and policy. We will repeat the training process until certain convergence conditions are satisfied. When doing planning and rollout with the learned model to generate fake data, we follow the method used by Janner et al. (2019a) ; Clavera et al. (2019) to truncate the trajectory and use Q-function to approximate the return after the truncation. When updating the policy, we calculate the policy gradient by back-propagating through the model which is inspired by Clavera et al. (2019) . When updating the critic, we follow the SAC (Haarnoja et al., 2018) to construct two Q-functions with two target Q-functions and apply the soft Q-update. Algorithm 2 POMP Require: Policy update times N p , total interaction number N , model train frequency k. 1: Initialize the learnable model f ψ , the reward function r ω , the policy network π θ , the critic Q ϕ , true replay buffer D env ← ∅, fake replay buffer D model ← ∅. 2: for i = 1, • • • , N do 3: Initialize the action sequence using policy net π θ , and learned model f ψ .

4:

Interact with real environment E real using D3P planner (Algorithm 1), and add the transition to D env . 5: if i mod k == 0 then 6: repeat 7: Update ψ ← ψ -α f ∇ ψ J f , ω ← ω -α r ∇ ω J r using data from D env . 8: until The learnable model and reward function converge.  for j = 1, • • • , N p do 13: Update θ ← θ -α π ∇ θ J π using data from D. 14: Update ϕ ← ϕ -α Q ∇ ϕ J Q using data from D. 15: end for 16: end for 17: return Optimal parameters ψ ⋆ , ω ⋆ , θ ⋆ and ϕ ⋆ .

C EXPERIMENT SETUP C.1 IMPLEMENTATION DETAILS

How to set V max -Q(x, a)? In our D3P method, we introduce a constant V max and set it larger than the upper bound of Q(x, a). However, we can not know the true value of the upper bound of Q(x, a), and setting a too large or small V max is not perfect for planning. In our implementation, we fist define a maximum expected improvement ∆ and then grid search V max -Q(x, a) := {exp ( log ∆ K × i)|i = 1, • • • , K} to get the best V max according to our planning objective function. Please note that the grid search are implemented in parallel.

C.2 DESCRIPTIONS OF OUR EXPERIMENT ENVIRONMENTS

Following prior model-based RL work, we conduct our experiments on 6 classical continuous control tasks from MuJoco (Todorov et al., 2012) , and the descriptions of these environments are summarized as followsfoot_0 : 1. Inverted Pendulum: This environment involves a cart that can be moved linearly, with a pole fixed on it at one end and having another end free. The cart can be pushed left or right, and the goal is to balance the pole on the top of the cart by applying forces on the cart. The action space dimension and state space dimension are 1 and 4, respectively.

2.. Hopper:

The hopper is a two-dimensional one-legged figure that consists of four main body parts -the torso at the top, the thigh in the middle, the leg in the bottom, and a single foot on which the entire body rests. The goal is to make hops that move in the forward (right) direction by applying torques on the three hinges connecting the four body parts. The action space dimension and state space dimension are 3 and1 11, respectively.

3.. Walker2D:

The walker is a two-dimensional two-legged figure that consists of four main body parts -a single torso at the top (with the two legs splitting after the torso), two thighs in the middle below the torso, two legs in the bottom below the thighs, and two feet attached to the legs on which the entire body rests. The goal is to make coordinate both sets of feet, legs, and thighs to move in the forward (right) direction by applying torques on the six hinges connecting the six body parts. The action space dimension and state space dimension are 6 and 17, respectively.

4.. Half Cheetah:

The HalfCheetah is a 2-dimensional robot consisting of 9 links and 8 joints connecting them (including two paws). The goal is to apply a torque on the joints to make the cheetah run forward (right) as fast as possible, with a positive reward allocated based on the distance moved forward and a negative reward allocated for moving backward. The torso and head of the cheetah are fixed, and the torque can only be applied on the other 6 joints over the front and back thighs (connecting to the torso), shins (connecting to the thighs), and feet (connecting to the shins). The action space dimension and state space dimension are 6 and 17, respectively.

5.. Ant:

The ant is a 3D robot consisting of one torso (free rotational body) with four legs attached to it with each leg having two links. The goal is to coordinate the four legs to move in the forward (right) direction by applying torques on the eight hinges connecting the two links of each leg and the torso (nine parts and eight hinges). The action space dimension and state space dimension are 8 and 27, respectively. 6. Humanoid: The 3D bipedal robot is designed to simulate a human. It has a torso (abdomen) with a pair of legs and arms. The legs each consist of two links, and so do the arms (representing the knees and elbows respectively). The goal of the environment is to walk forward as fast as possible without falling over. The action space dimension and state space dimension are 17 and 376, respectively.

C.3 EXPERIMENTAL DETAILS

In our method, for a fair comparison, except the D3P planning, we keep the model learning , policy learning, and Q-function learning to be the same as prior work (Janner et al., 2019b; Clavera et al., 2019) . Specifically, the learnable prediction model is parameterized by an ensemble of 7 individual 5-layer MLPs, and is trained by Adam optimizer (Kingma & Ba, 2014) with all history transition data from replay buffer after certain hundreds of timesteps (the timesteps vary depending on the specific task); after each interaction with the environment, the policy is optimized using the pathwise derivative of the imagined trajectory produced by the learned model and the learned policy; the Q-function is learned by minimizing the TD-error for each history data saved in replay buffer and imagined data from learned model and policy function. Noting that our planner is built upon the framework of MBPO and MAAC. Therefore, the sample efficiency of our method is comparable with MBPO and MAAC which also used the same state augmentation strategy. So, the improvement of the sample efficiency is not relevant to the state augmentation strategy.

C.4 THE DESCRIPTION OF BASELINE METHODS

To show the effectiveness of our algorithm, we compare our method on six classical continuous control tasks against the following state-of-the-art model-free and model-based RL algorithms: (i) Soft Actor-Critic (SAC) (Haarnoja et al., 2018) , a popular off-policy actor-critic RL algorithm based on maximum entropy RL framework; (ii) SVG(1) (Heess et al., 2015a) , which first uses dynamics derivatives in model-based RL; (iii) STochastic Ensemble Value Expansion (STEVE) method (Buckman et al., 2018) , which utilizes the learned models only when the uncertainty of the learned model is not too high; (iv) Model-based Action-Gradient-Estimator policy optimization (MAGE) method (D'Oro & Jaśkowski, 2020), which computes gradient targets in temporal difference learning by backpropagating through the learned dynamics; (v) Model-Based Policy Optimization (MBPO) method (Janner et al., 2019b) , which shows that using short model-generated rollouts branched from real data could benefit model-based algorithms; (vi) Model-Augmented Actor-Critic (MAAC) (Clavera et al., 2019) method, which exploits the learned model by computing the analytic gradient of the returns with respect to the policy.

D MORE EXPERIMENTAL RESULTS

D.1 STUDIES ON THE ROBUSTNESS OF OUR METHOD. We test the sensitivity of our method when we change the hyperparameter used in the training phase. The ablation studies about iteration number N d used in our training phase and the maximum expected improvement V max -Q(x, a) (which we denote by ∆) are shown in Figure 5 and Figure 6 , respectively. We can see that our method consistently outperforms MAAC, and the hyper-parameter choice is not much sensitive to our method.

D.2 COMPARISON WITH CONTINUOUS MUZERO

MuZero (Schrittwieser et al., 2020 ) is a successful model-based RL method for discrete action tasks, which carefully trades off the exploitation and exploration. To compare these tree-based methods with our gradient-based method, we conduct a comparison with MuZero combined with Continuous UCT (Progressive Widening (Couëtoux et al., 2011) in our experiments)foot_1 . We gird search several important hyper-parameters for the continuous MuZero variant, and the detailed hyper-parameters are summarized in Table 2 . The results are shown in Figure 7 . From this figure, we can see that as the dimension increases, the gap between our method and the continuous MuZero variant is more and more obvious, which shows the advantage of our method. This also implies that employing Muzero in continuous domain effectively is non-trivial. Since UCT is a principled way to do the exploration in the discrete domain, combining it with our D3P planner for continuous domain will be an interesting research direction in the future.

Hyper-parameter Values

α in Progressive widening {0.3, 0.4, 0.5, 0.6, 0.7, 0.8 } c 1 in UCB {1.0, 1.25, 1.5, 2.0 } Simulation step l in MCTS { 64, 128, 256, 512} 

D.3 STUDIES ON THE PLANNING HORIZON

We fix the planning horizon H to be the same as those in MAAC Clavera et al. (2019) , since they have systematically studied this hyper-parameter in Section 5.2 of their paper: the gradient error scales poorly with the horizon, and large horizons are detrimental since it magnifies the error on the models. We also add an ablation study to show how the planning horizon influence the performance of our method in Figure 8 . The results are consistent with prior work Janner et al. (2019b) ; Clavera et al. (2019) .  Ant MAAC H = 1 H = 2 H = 4 H = 6 H = 8 H = 10

D.4 PLOTTING RESULTS OF DIFFERENT RANDOM SEEDS

Since all the RL literature compare different methods by plotting the mean and standard deviation in their paper, we follow the common practice in our paper. Besides, we also provide the individual run curve in Figure 9 . Obviously, if we plot individual runs for each method, it will be messy and unclear for visualization. 

D.5 THE IMPACT OF THE NUMBER OF EXPERIMENT RUNS

We have shown the performance of our method with 10 seeds and plotted the mean curve and shaded region with deviation in Figure 1 (the individual 10 runs are also shown in Figure 9 ). One may still wonder whether the limited number of runs would influence the experimental results. Thus, we run each task with another 20 more seeds (30 seeds, totally), and show the results in Figure 11 . Comparing the results of 30 seeds with the results of 10 seeds (shown in Figure 10 ), we can see that the impact of the number of experiment runs is limited to our method, which does not alter our experimental conclusion. Last, as the RL committee always shows the results with the mean and deviation values, we acknowledge that more runs of each task are needed to show robust and consistent experimental results for RL algorithms. 

E VECTOR FORM OF OUR D3P PLANNER

For brevity and clear clarification, we treat the action and state as one-dimensional scalars in our main paper. Here we provide the vector form of the derivation of the D3P algorithm. Consider the state and action are both multi-dimensional vector with dimension d x and d a . The transition function is now a mapping: R dx+da → R dx , the reward function is now a mapping: R dx+da → R 1 . In this situation, f a is the Jacobin matrix of shape d x × d a , whose (i, j)-th entry is f aij = dfi daj . Similarly f x is the Jacobin matrix of shape d x × d x , whose (i, j)-th entry is f xij = dfi dxj . r a is the Jacobin matrix of shape 1 × d a , whose (1, j)-th entry is r a1j = dr daj . r x is the Jacobin matrix of shape 1 × d x , whose (1, j)-th entry is r x1j = dr dxj . The objective function of our D3P planner is V (x, h) = max a h [r(x h , a h ) + V (f (x h , a h ), h + 1)]. ( ) Denote Q(x h , a h ) = r(x h , a h ) + V (f (x h , a h ), h 1), our goal can be re-expressed as δa h = arg max δa [Q(x h , a h + δa)] . We seek a surrogate objective function D(x, a) ≜ (Q(x, a) -V max ) 2 , and we then apply first-order Taylor expansion for the Q function Q(x, a) in D(x, a), D(x, a + δa) = (Q(x, a) + Q a (x, a)δa -V max ) 2 . ( ) So, the optimal action update is δa * = -(Q(x, a) -V max )(Q ⊤ a (x, a)Q a (x, a)) -1 Q ⊤ a (x, a). Then we introduce a feedback term δx, denote k = (Q(x, a) -V max )(Q ⊤ a (x, a)Q a (x, a)) -1 Q ⊤ a (x, a); K = (Q ⊤ a (x, a)Q a (x, a)) -1 Q ⊤ a (x, a)Q x (x, a), where the shape of k is d a × 1 and the shape of K is d a × d x . The update rule of the action is given by: δa * h = -k -Kδx. The update rule of Q a (x h , a h ) and Q x (x h , a h ) is Q a (x h , a h ) = r a (x h , a h ) + V x (f (x h , a h ), h + 1) • f a (x h , a h ); Q x (x h , a h ) = r x (x h , a h ) + V x (f (x h , a h ), h + 1) • f x (x h , a h ), and we can calculate V x by V x = Q x (x h , a h ) -Q a (x h , a h )K h . F PROOF OF THEOREM In this section, we present the proof of the Theorem 1 and Corollary 1 in Section 4. First of all, we summarize the necessary assumptions here. Assumption 1. The transition f (x, a) and reward function r(x, a) are both continuous and with continuous first and second order derivative. The first and second order derivative are bounded by L 1 and L 2 respectively. ∥f x ∥ + ∥f a ∥ + ∥r x ∥ + ∥r a ∥ ≤ L 1 (18) ∥f xx ∥ + ∥f xa ∥ + ∥f aa ∥ + ∥r xx ∥ + ∥r xa ∥ + ∥r aa ∥ ≤ L 2 (19) Assumption 2. The variables Q a calculated in the iteration of D3P are always non-zero.

F.1 PROOF OF THE THEOREM 1

Overall speaking, we will use the mathematical induction method to prove the theorem. We will first prove the convergence rate given the trajectory length H = 2 . Then, we assume the theorem is true in trajectory with length H = l, and prove it still holds in trajectory with length H = l + 1. In the proof, we denote the trajectory length as H, and denote the location in the trajectory using h where h ∈ {1, 2, • • • , H}. We denote the action in h after update as a ′ h where a ′ h = a h + δa h . We denote the optimal action as a * h where a * h = arg max a h r(x * h , a h ) + V (f (x * h , a h ), h + 1) where x * h+1 = f (x * h , a * h ) and x * 1 = x 1 . In the proof, we will use A with subscript like A 1 to denote some formulation for simplicity and we will give its expression in before we use it. We will use B with subscript like B 1 to denote the term related to the error due to using the first-order derivative to approximate the second order derivative. We will use C with subscript like C 1 to denote the general constant. Before the proof, we first recall the update rule of the D3P planner. δa h = -k h -K h δx h = - Q(x h , a h ) -V max Q a (x h , a h ) - Q x (x h , a h ) Q a (x h , a h ) δx h , Q a (x h , a h ) = r a (x h , a h ) + V x (x h+1 , h + 1)f a (x h , a h ), (21) Q x (x h , a h ) = r x (x h , a h ) + V x (x h+1 , h + 1)f x (x h , a h ), (22) V x (x h , a h ) = Q x (x h , a h ) -Q a (x h , a h )K = Q x (x h , a h ) -Q a (x h , a h ) Q x (x h , a h ) Q a (x h , a h ) . ( ) First of all, we consider the case when trajectory length H=2. We calculate the error of a ′ 1 and a ′ 2 in terms of its a ′ 2 -a * 2 = a 2 -a * 2 + δa 2 (24) = a 2 -a * 2 - Q(x 2 , a 2 ) -V max Q a (x 2 , a 2 ) - Q x (x 2 , a 2 ) Q a (x 2 , a 2 ) δx (25) = 1 Q 2 a (x 2 , a 2 ) Q 2 a (x 2 , a 2 )(a 2 -a * 2 ) -Q a (x a , a 2 ) (Q(x 2 , a 2 ) -V max ) -Q a (x 2 , a 2 )Q x (x 2 , a 2 )δx 2 . ( ) Denote D(x 2 , a 2 ) = 1 2 (Q(x 2 , a 2 ) -V max ) 2 . Given the H = 2, we have Q(x 2 , a 2 ) = r(x 2 , a 2 ). Therefore, we have Q a (x a , a 2 ) (Q(x 2 , a 2 ) -V max ) = D a (x 2 , a 2 ). Also, Q a (x * 2 , a * 2 ) = 0, according to the definition of a * h , By using lemma 1, we have that D a (x 2 , a 2 ) = D a (x 2 , a 2 ) -D a (x * 2 , a * 2 ) (27) = 1 0 D aa (x 2 , a * 2 -t(a * 2 -a 2 ))(a 2 -a * 2 ) + D ax (x * 2 -t(x * 2 -x 2 ), x 2 )(x 2 -x * 2 )dt. ( ) Denote A 1 = 1 0 D aa (x 2 , a * 2 -t(a * 2 -a 2 ))(a 2 -a * 2 )dt and A 2 = 1 0 D ax (x * 2 -t(x * 2 -x 2 ), x 2 )(x 2 - x * 2 )dt and consider the first and second term in equation 26, we have Q 2 a (x 2 , a 2 )(a 2 -a * 2 ) -Q a (x a , a 2 ) (Q(x 2 , a 2 ) -V max ) (29) =Q 2 a (x 2 , a 2 )(a 2 -a * 2 ) -D a (x 2 , a 2 ) (30) =Q 2 a (x 2 , a 2 )(a 2 -a * 2 ) -A 1 -A 2 . ( ) We first consider the A 1 term. Denote h 1 (x, a) = Q 2 a (x, a) -D aa (x, a). Denote B 1 = 1 0 h 1 (x 2 , a * 2 -t(a * 2 -a 2 ))dt. Q a (x l-1 , a l-1 )) = R a (x l-1 , a l-1 ) + V x (x l , l)f a (x l-1 , a l-1 ) (64) = r a (x l-1 , a l-1 ) -r a (x l , a l ) Q x (x l , a l ) Q a (x l , a l ) f a (x l-1 , a l-1 ) + V x (x l , l)f a (x l-1 , a l-1 ) (65) = r a (x l-1 , a l-1 ) -r a (x l , a l ) Q x (x l , a l ) Q a (x l , a l ) f a (x l-1 , a l-1 ) + r x (x l , a l )f a (x l-1 , a l-1 ). (66) Q x (x l-1 , a l-1 )) = R x (x l-1 , a l-1 ) + V x (x l , l)f x (x l-1 , a l-1 ) (67) = r x (x l-1 , a l-1 ) -r a (x l , a l ) Q x (x l , a l ) Q a (x l , a l ) f x (x l-1 , a l-1 ) + V x (x l , l)f x (x l-1 , a l-1 ) (68) = r x (x l-1 , a l-1 ) -r a (x l , a l ) Q x (x l , a l ) Q a (x l , a l ) f x (x l-1 , a l-1 ) + r x (x l , a l )f x (x l-1 , a l-1 ). (69) It can be easily verified, δa l-1 = δa l-1 .

F.2 PROOF OF THE COROLLARY 1

If we do not consider the feedback term, K = 0. The new update rule will be δa h = -k h = - Q(x h , a h ) -V max Q a (x h , a h ) Q a (x h , a h ) = r a (x h , a h ) + V x (x h+1 , h + 1)f a (x h , a h ) (71) Q x (x h , a h ) = r x (x h , a h ) + V x (x h+1 , h + 1)f x (x h , a h ) (72) V x (x h , a h ) = Q x (x h , a h ). (73) According the proof of Theorem 1, for ∀h, a ′ h -a * h ≤ C 1 H k=1 ∥a k -a * k ∥ 2 + C 2 H k=1 ∥a k -a * k ∥ + Q x (x h , a h ) Q a (x h , a h ) x ′ h -x h (74) ≤ C 1 H k=1 ∥a k -a * k ∥ 2 + C 2 H k=1 ∥a k -a * k ∥ + Q x (x h , a h ) Q a (x h , a h ) ∥x ′ h -x h ∥ . ( ) where x ′ h = f (x ′ h-1 , a h-1 + δa h-1 ), x h = f (x h-1 , a h-1 ). Using Taylor expansion, there exist a constant C such that x ′ h -x h =f x (x h-1 , a h-1 )(x ′ h-1 -x h-1 ) + f a (x h-1 , a h-1 )(δa h-1 ) + C((x ′ h-1 -x h-1 ) 2 + (a ′ h-1 -a h-1 ) 2 ). (76) If x ′ h-1 -x h-1 ≤ 1, we can ignore the error of the first-order Taylor expansion, x ′ h -x h = 1 i=h-1 Π h-1 j=i+1 f x (x j , a j ) f a (x i , a i )δa i + C(δa 2 i ) . ( ) And the corollary can be proved. Lemma 1. Denote the function f (x) have continues derivative. Denote the first order derivative of function f (x) as f x (x) . Then we have f (x 2 ) -f (x 1 ) = 1 0 f x (x 1 -t(x 1 -x 2 ))(x 2 -x 1 )dt.



Please refer to https://www.gymlibrary.dev/environments/mujoco/ for more details. We use a commonly used public code https://github.com/werner-duvaud/ muzero-general/tree/continuous.



Figure1: Learning curves of our method and other baseline methods on six continuous control tasks. The solid lines represent the mean of 10 (for our method)/5 (for other baseline methods) trails with different seeds, and the shaded regions correspond to STD among trials. Our method achieves the best results among these strong model-free and model-based reinforcement learning methods.

Figure2: Ablation about the update times N p of policy in each iteration. We can see that increasing N p cannot help policy optimization.

Figure3: Ablation studies about D3P planner. We replace the D3P planner in our method with a SGD-like planner, a CEM planner, and a random-shooting planner, the results show the advantage of our D3P planner.

Figure4: (a) The improvement of applying learned model with different training steps on policy with different quality. "Improvement" means the evaluation return using our D3P planner to subtract the return that without our D3P planner. "Policy quality" means the average episode return of the policy when applying the policy in the environment, and "ik∼(i + 1)k" denotes the policy cluster whose average return lies in ik∼(i + 1)k. "Model ik" denotes the learned model which is trained using ik data. (b) The improvement of different iteration number N d in D3P (Line 4 in Algorithm 1). "Model quality" means the number of training data used to train the model, and "ik∼jk" denotes the learned model with ik∼jk training data. (c) Ablation about the policy usage in our method. "RAND" denotes POMP with a randomly initialized trajectory rather than a policy generated trajectory in D3P. "N d = i" denotes POMP with iteration number i and "N d = i w/o cons" denotes POMP with iteration number i and without the conservative term when evaluation.

f ψ , and add them to D model . 11: D ← D env ∪ D model 12:

Figure 5: Ablation studies on hyperparameter iteration number N d in training.

Figure7: The comparison of a continuous MuZero variant with our method. The dimension of action space for Swimmer, Hopper and Walker2d are 2, 3, and 4, respectively. We can see that as the dimension increases, the gap between of our method and the continuous MuZero variant are more obvious, which shown the advantage of our method.

Figure 8: The studies on planning horizon H.

Figure 9: The individual 10 runs of our method.

Figure 10: The experimental results with 10 seeds of our method.

Initial action sequences {at } t=1•••H , initial state x 1 , iteration number N d , valid horizon H, maximum expected improvement V max -Q(x, a).

are always considered to improve the model robustness and prediction accuracy. For loss designing, decision awareness(D'Oro et al., 2020; Farahmand et al., 2017)  and gradient awareness(Li et al., 2021) are always considered to reduce the gap between model learning and model utilization. Two methods are always used to learn the policy by using the learned model. One is to serve the learned model as a black-box simulator to generate the data.Janner et al. (2019b)   is a representing work of this line.Yu et al. (2020),Lee et al. (2020) also follow such a manner by extending it to offline-RL setting. Another way is to use the learned model to calculate the policy gradient.Heess et al. (2015b)  presents an algorithm to calculate the policy gradient by backpropagating through the model.Clavera et al. (2019) andAmos et al. (2021) share similar methods but use promising actor and critic learning strategy to achieve better performance.

The detailed hyper-parameters in our experiments.

The detailed hyper-parameters are summarized in Table 1, and refer to Janner et al. (2019b); Clavera et al. (2019) for more details.

We grid search several important hyper-parameters for the continuous MuZero variant.

ACKNOWLEDGMENTS

This work was supported in part by NSFC under Contract 61836011, and in part by the Fundamental Research Funds for the Central Universities under contract WK3490000007.

annex

Now, we will consider the A 2 term and the third term in equation 26. DenoteSummarize the conclusion, we can prove nowThe next task is to prove the a ′We have Q a (x 1 , a * 1 ) = 0 according to the definition of a * 1 .Using Lemma 1, denoteThus, we haveFor simplicity, we can writeUp to here, we prove the theorem is true in trajectory with horizon H = 2. Now, using induction method, suppose the theorem is true for H = l -1. The induction hypothesis means that for the following problem (denote as P (l -1)), there exist two constant C and B, such that for ∀h ∈ {1, 2,What we need to prove is for the new problem with H = l (denote as p(l)), the theorem still holds.The main idea is to merge the reward function in last two timesteps into one, and then prove the δa l-1 is the same as the one in the problem P (l -1) which is denoted as δa l-1 . Then, according to the update rule, for h < l -1, the δa h = δa h also holds. For h = l, the theorem can be proved using the exact the same process as we prove a ′ 2 -a * 2 in H = 2. Combining all these conclusions, we can then prove the theorem holds for the problem p(l) and thus the proof finished.Here we show how can we construct a new reward function by merge two reward function.In new problem, the update for action(61)For multi-variable function f (x, y), we have(79)Proof of the Lemma 1. We first prove the single-variable version. Denote g(t) = f (x 1 -t(x 1 -x 2 )), it is easy to verify thatAccording to the fundamental theorem of calculus, we haveThen, we prove the multi-variable version. Denote gLemma 2. Denote D(x, a) = 1 2 (Q(x, a) -V m ax) 2 , denote h 1 (x, a) = D aa (x, a) -Q 2 a (x, a), h 2 (x, a) = D ax -Q a (x, a)Q x (x, a), we haveProof of Lemma 2.Similarly, we can prove that

