MAKING BETTER DECISION BY DIRECTLY PLANNING IN CONTINUOUS CONTROL

Abstract

By properly utilizing the learned environment model, model-based reinforcement learning methods can improve the sample efficiency for decision-making problems. Beyond using the learned environment model to train a policy, the success of MCTS-based methods shows that directly incorporating the learned environment model as a planner to make decisions might be more effective. However, when action space is of high dimension and continuous, directly planning according to the learned model is costly and non-trivial. Because of two challenges: (1) the infinite number of candidate actions and (2) the temporal dependency between actions in different timesteps. To address these challenges, inspired by Differential Dynamic Programming (DDP) in optimal control theory, we design a novel Policy Optimization with Model Planning (POMP) algorithm, which incorporates a carefully designed Deep Differential Dynamic Programming (D3P) planner into the model-based RL framework. In D3P planner, (1) to effectively plan in the continuous action space, we construct a locally quadratic programming problem that uses a gradient-based optimization process to replace search. (2) To take the temporal dependency of actions at different timesteps into account, we leverage the updated and latest actions of previous timesteps (i.e., step 1, • • • , h -1) to update the action of the current step (i.e., step h), instead of updating all actions simultaneously. We theoretically prove the convergence rate for our D3P planner and analyze the effect of the feedback term. In practice, to effectively apply the neural network based D3P planner in reinforcement learning, we leverage the policy network to initialize the action sequence and keep the action update conservative in the planning process. Experiments demonstrate that POMP consistently improves sample efficiency on widely used continuous control tasks. Our code is released at https://github.com/POMP-D3P/POMP-D3P.

1. INTRODUCTION

Model-based reinforcement learning (RL) (Janner et al., 2019a; Yu et al., 2020; Schrittwieser et al., 2020; Hafner et al., 2021) has shown its promise to be a general-purpose tool for solving sequential decision-making problems. Different from model-free RL algorithms (Mnih et al., 2015; Haarnoja et al., 2018) , for which the controller directly learns a complex policy from real off-policy data, model-based RL methods first learn a predictive model about the unknown dynamics and then leverage the learned model to help the policy learning. With several key innovations (Janner et al., 2019a; Clavera et al., 2019) , model-based RL algorithms have shown outstanding data efficiency and performance compared to their model-free counterparts, which make it possible to be applied in real-world physical systems when data collection is arduous and time-consuming (Moerland et al., 2020) . There are mainly two directions to leverage the learned model in model-based RL, though not mutually exclusive. In the first class, the models play an auxiliary role to only affect the decision-making by helping the policy learning (Janner et al., 2019b; Clavera et al., 2019) . In the second class, the model is used to sample pathwise trajectory and then score this sampled actions (Schrittwieser et al., 2020) . Our work falls into the second class to directly use the model as a planner (rather than only help the policy learning). Some recent papers (Dong et al., 2020; Hubert et al., 2021; Hansen et al., 2022b) have started walking in this direction, and they have shown some cases to support the motivation behind it. For example, in some scenarios (Dong et al., 2020) , the policy might be very complex while the model is relatively simple to be learned. These idea is easy to be implemented in the discrete action space where MCTS is powerful to do the planning by searching (Silver et al., 2016; 2017; Schrittwieser et al., 2020; Hubert et al., 2021) . However, when the action space is continuous, the tree-based search method can not be applied trivially. There are two key challenges. (1) Continuous and high-dimensional actions imply that the number of candidate actions is infinite. (2)The temporal dependency between actions implies that the action update in previous timesteps can influence the later actions. Thus, trajectory optimization in continuous action space is still a challenge and lacks enough investigation. To address the above challenges, in this paper, we propose a Policy Optimization with Model Planning (POMP) algorithm in the model-based RL framework, in which a novel Deep Differentiable Dynamic Programming (D3P) planner is designed. Since model-based RL is closely related to the optimal control theory, the high efficiency of differential dynamic programming (DDP) (Pantoja, 1988; Tassa et al., 2012) algorithm in optimal control theory inspires us to design an algorithm about dynamic programming. However, since the DDP requires a known model and a high computational cost, applying the DDP algorithm to DRL is nontrivial. The D3P planner aims to optimize the action sequence in the trajectory. The key innovation in D3P is that we leverage first-order Taylor expansion of the optimal Bellman equation to get the action update signal efficiently, which intuitively exploits the differentiability of the learned model. We can theoretically prove the convergence rate of D3P under mild assumptions. Specifically, (1) D3P uses the first-order Taylor expansion of the optimal Bellman equation but still constructs a local quadratic objective function. Thus, by leveraging the analytic formulation of the minimizer of the quadratic function, D3P can efficiently get the local optimal action. (2) Besides, a feedback term is introduced in D3P with the help of the Bellman equation. In this way, D3P updates the action in current step by considering the action update in previous timesteps during planning. Note that D3P is a plug-and-play algorithm without introducing extra parameters. When we integrate the D3P planner into our POMP algorithm under the model-based RL framework, the practical challenge is that the neural network-based learned model is always highly nonlinear and with limited generalization ability. Hence the planning process may be misled when the initialization is bad or the action is out-of-distribution. Therefore, we propose to leverage the learned policy to provide the initialization of the action before planning and provide a conservative term at the planning to admit the conservation principle, in order to keep the small error of the learned model along the planning process. Overall speaking, our POMP algorithm integrates the learned model, the critic, and the policy closely to make better decisions. For evaluation, we conduct several experiments on the benchmark MuJoCo continuous control tasks. The results show our proposed method can significantly improve the sample efficiency and asymptotic performance. Besides, comprehensive ablation studies are also performed to verify the necessity and effectiveness of our proposed D3P planner. The contributions of our work are summarized as follows: (1) We theoretically derive the D3P planner and prove its convergence rate. (2) We design a POMP algorithm, which refines the actions in the trajectory with the D3P planner in an efficient way. (3) Extensive experimental results demonstrate the superiority of our method in terms of both sample efficiency and asymptotic performance.

2. RELATED WORK

The full version of the related work is in Appendix A, we briefly introduce several highly related works here. In general, model-based RL for solving decision-making problems can be divided into three perspectives: model learning, policy learning, and decision-making. Moreover, optimal control theory also concerns the decision-making problem and is deeply related to model-based RL. Model learning: How to learn a good model to support decision-making is crucial in model-based RL. There are two main aspects of the work: the model structure designing (Chua et al., 2018; Zhang 

