Extracting Strong Policies for Robotics Tasks from Zero-Order Trajectory Optimizers

Abstract

Solving high-dimensional, continuous robotic tasks is a challenging optimization problem. Model-based methods that rely on zero-order optimizers like the crossentropy method (CEM) have so far shown strong performance and are considered state-of-the-art in the model-based reinforcement learning community. However, this success comes at the cost of high computational complexity, being therefore not suitable for real-time control. In this paper, we propose a technique to jointly optimize the trajectory and distill a policy, which is essential for fast execution in real robotic systems. Our method builds upon standard approaches, like guidance cost and dataset aggregation, and introduces a novel adaptive factor which prevents the optimizer from collapsing to the learner's behavior at the beginning of the training. The extracted policies reach unprecedented performance on challenging tasks like making a humanoid stand up and opening a door without reward shaping. Figure 1: Environments and exemplary behaviors of the learned policy using APEX. From left to right:

1. INTRODUCTION

The general purpose of model-based and model-free reinforcement learning (RL) is to optimize a trajectory or find a policy that is fast and accurate enough to be deployed on real robotic systems. Policies optimized by model-free RL algorithms achieve outstanding results for many challenging domains (Heess et al., 2017; Andrychowicz et al., 2020) , however, in order to converge to the final performance, they require a large number of interactions with the environment and can hardly be used on real robots, which have a limited lifespan. Moreover, real robotic systems are high-dimensional and have a highly non-convex optimization landscape, which makes policy gradient methods prone to converge to locally optimal solutions. In addition, model-free RL methods only gather task-specific information, which inherently limits their generalization performance to new situations. On the other hand, recent advances in model-based RL show that it is possible to match modelfree performance by learning uncertainty-aware system dynamics (Chua et al., 2018; Deisenroth & Rasmussen, 2011; Du et al., 2019) . The learned model can then be used within a model-predictive control framework for trajectory optimization. Zero-order optimizers are gaining a lot of traction in

funding

FETCH PICK&PLACE (* equal contribution. We acknowledge the support from the German Federal Ministry of Education and Research (BMBF) through the Tübingen AI Center (FKZ: 01IS18039B) and from the Max Planck ETH Center for Learning Systems.

