Extracting Strong Policies for Robotics Tasks from Zero-Order Trajectory Optimizers

Abstract

Solving high-dimensional, continuous robotic tasks is a challenging optimization problem. Model-based methods that rely on zero-order optimizers like the crossentropy method (CEM) have so far shown strong performance and are considered state-of-the-art in the model-based reinforcement learning community. However, this success comes at the cost of high computational complexity, being therefore not suitable for real-time control. In this paper, we propose a technique to jointly optimize the trajectory and distill a policy, which is essential for fast execution in real robotic systems. Our method builds upon standard approaches, like guidance cost and dataset aggregation, and introduces a novel adaptive factor which prevents the optimizer from collapsing to the learner's behavior at the beginning of the training. The extracted policies reach unprecedented performance on challenging tasks like making a humanoid stand up and opening a door without reward shaping. Figure 1: Environments and exemplary behaviors of the learned policy using APEX. From left to right:

1. INTRODUCTION

The general purpose of model-based and model-free reinforcement learning (RL) is to optimize a trajectory or find a policy that is fast and accurate enough to be deployed on real robotic systems. Policies optimized by model-free RL algorithms achieve outstanding results for many challenging domains (Heess et al., 2017; Andrychowicz et al., 2020) , however, in order to converge to the final performance, they require a large number of interactions with the environment and can hardly be used on real robots, which have a limited lifespan. Moreover, real robotic systems are high-dimensional and have a highly non-convex optimization landscape, which makes policy gradient methods prone to converge to locally optimal solutions. In addition, model-free RL methods only gather task-specific information, which inherently limits their generalization performance to new situations. On the other hand, recent advances in model-based RL show that it is possible to match modelfree performance by learning uncertainty-aware system dynamics (Chua et al., 2018; Deisenroth & Rasmussen, 2011; Du et al., 2019) . The learned model can then be used within a model-predictive control framework for trajectory optimization. Zero-order optimizers are gaining a lot of traction in the model-based RL community (Chua et al., 2018; Wang & Ba, 2020; Williams et al., 2015) since they can be used with any choice of model and cost function, and can be surprisingly effective in finding high-performance solutions (Pinneri et al., 2020) (close to a global optimum) in contrast to their gradient-based counterparts, which are often highly dependent on hyperparameter tuning (Henderson et al., 2017) . One of the most popular optimizers is the Cross-Entropy Method (CEM), originally introduced in the 90s by Rubinstein & Davidson (1999) . Despite their achievements, using zero-order methods for generating action sequences is time consuming in complex high-dimensional environments, due to the extensive sampling, making it hard to deploy them for real-time applications. Extracting a policy from powerful zero-order optimizers like CEM would bridge the gap between model-based RL in simulation and real-time robotics. As of today, this is still an open challenge (Wang & Ba, 2020) . We analyze this issue and showcase several approaches for policy extraction from CEM. In particular, we will use the sample-efficient modification of CEM (iCEM) presented in Pinneri et al. (2020) . Throughout the paper, we will call these optimizers "experts" as they provide demonstration trajectories. To isolate the problem of bringing policy performance close to the expert's one, we consider the true simulation dynamics as our forward model. Our contributions can be summarized as follows: • pinpointing the issues that arise when trying to distill a policy from a multimodal, stochastic teacher; • introducing APEX, an Adaptive Policy EXtraction procedure that integrates iCEM with DAgger and a novel adaptive variant of Guided Policy Search; • our specific integration of methods produces an improving adaptive teacher, with higher performance than the original iCEM optimizer; • obtaining strong policies for hard robotic tasks in simulation (HUMANOID STANDUP, FETCH PICK&PLACE, DOOR), where model-free policies would usually just converge to local optima. Videos showing the performance of the extracted policies and other information can be found at https://martius-lab.github.io/APEX.

2. RELATED WORK

Our objective is to extract high-performing policies from CEM experts that can operate with a few planning samples to make iterative learning fast. Other kinds of zero-order optimizers have been used to generate control sequences (Williams et al., 2015; Lowrey et al., 2019) but they still have to evaluate thousands of trajectories for each time step. Even simple random shooting has been used as a trajectory optimizer to bootstrap a model-free policy (Nagabandi et al., 2018) . To train policies from optimal control solutions, it was shown that the expert optimizers need to be guided towards the learning policy -known as guided policy search (GPS) (Levine & Koltun, 2013; Levine & Abbeel, 2014) . In our work, the expert does not come from optimal control but is the stochastic iCEM optimizer, which we will also refer to as teacher. & Ba, 2020) for policy distillation from CEM, but only largely sub-optimal policies could be extracted. When the policy is used alone at test time and not in combination with the MPC-CEM optimizer, its performance drops



, such as an adaptive cost formulation that we propose here, together with expert warm-starting via distribution initialization and additional samples from the policy. A simple form of warm-starting was already done in Wang & Ba (2020).

funding

FETCH PICK&PLACE (* equal contribution. We acknowledge the support from the German Federal Ministry of Education and Research (BMBF) through the Tübingen AI Center (FKZ: 01IS18039B) and from the Max Planck ETH Center for Learning Systems.

