MAKING BETTER DECISION BY DIRECTLY PLANNING IN CONTINUOUS CONTROL

Abstract

By properly utilizing the learned environment model, model-based reinforcement learning methods can improve the sample efficiency for decision-making problems. Beyond using the learned environment model to train a policy, the success of MCTS-based methods shows that directly incorporating the learned environment model as a planner to make decisions might be more effective. However, when action space is of high dimension and continuous, directly planning according to the learned model is costly and non-trivial. Because of two challenges: (1) the infinite number of candidate actions and (2) the temporal dependency between actions in different timesteps. To address these challenges, inspired by Differential Dynamic Programming (DDP) in optimal control theory, we design a novel Policy Optimization with Model Planning (POMP) algorithm, which incorporates a carefully designed Deep Differential Dynamic Programming (D3P) planner into the model-based RL framework. In D3P planner, (1) to effectively plan in the continuous action space, we construct a locally quadratic programming problem that uses a gradient-based optimization process to replace search. (2) To take the temporal dependency of actions at different timesteps into account, we leverage the updated and latest actions of previous timesteps (i.e., step 1, • • • , h -1) to update the action of the current step (i.e., step h), instead of updating all actions simultaneously. We theoretically prove the convergence rate for our D3P planner and analyze the effect of the feedback term. In practice, to effectively apply the neural network based D3P planner in reinforcement learning, we leverage the policy network to initialize the action sequence and keep the action update conservative in the planning process. Experiments demonstrate that POMP consistently improves sample efficiency on widely used continuous control tasks. Our code is released at https://github.com/POMP-D3P/POMP-D3P.

1. INTRODUCTION

Model-based reinforcement learning (RL) (Janner et al., 2019a; Yu et al., 2020; Schrittwieser et al., 2020; Hafner et al., 2021) has shown its promise to be a general-purpose tool for solving sequential decision-making problems. Different from model-free RL algorithms (Mnih et al., 2015; Haarnoja et al., 2018) , for which the controller directly learns a complex policy from real off-policy data, model-based RL methods first learn a predictive model about the unknown dynamics and then leverage the learned model to help the policy learning. With several key innovations (Janner et al., 2019a; Clavera et al., 2019) , model-based RL algorithms have shown outstanding data efficiency and performance compared to their model-free counterparts, which make it possible to be applied in real-world physical systems when data collection is arduous and time-consuming (Moerland et al., 2020) . There are mainly two directions to leverage the learned model in model-based RL, though not mutually exclusive. In the first class, the models play an auxiliary role to only affect the decision-making by helping the policy learning (Janner et al., 2019b; Clavera et al., 2019) . In the second class, the model is used to sample pathwise trajectory and then score this sampled actions (Schrittwieser et al., 

