C3PO: LEARNING TO ACHIEVE ARBITRARY GOALS VIA MASSIVELY ENTROPIC PRETRAINING

Abstract

Given a particular embodiment, we propose a novel method (C3PO) that learns policies able to achieve any arbitrary position and pose. Such a policy would allow for easier control, and would be re-useable as a key building block for downstream tasks. The method is two-fold: First, we introduce a novel exploration algorithm that optimizes for uniform coverage, is able to discover a set of achievable states, and investigates its abilities in attaining both high coverage, and hard-to-discover states; Second, we leverage this set of achievable states as training data for a universal goal-achievement policy, a goal-based SAC variant. We demonstrate the trained policy's performance in achieving a large number of novel states. Finally, we showcase the influence of massive unsupervised training of a goal-achievement policy with state-of-the-art pose-based control of the Hopper, Walker, Halfcheetah, Humanoid and Ant embodiments.

1. INTRODUCTION

Reinforcement learning (RL) has shown great results in optimizing for single reward functions (Mnih et al., 2013; Silver et al., 2017; Vinyals et al., 2019) , that is when a controller has to solve a specific task and/or the task is known beforehand. If the task is not known a priori, or is likely to be often re-configured, then re-training a new policy from scratch can be very expensive and looks as a waste of resources. In the case of multipurpose systems deployed in contexts where they will likely be required to perform a large range of tasks, investing significant resources into training a high-performance general goal-based controller beforehand makes sense. We propose an approach allowing for training a universal goal achievement policy, a policy able to attain any arbitrary state the system can take. Goal-conditioned RL is such a setting where a single policy function can be prompted to aim for a particular goal-state (Kaelbling, 1993; Schaul et al., 2015) . One important issue with goal-conditioned RL is that goals that are useful for training the policy are generally unknown, and even in the case of humans this is a key part of learning general controllers (Schulz, 2012; Smith and Gasser, 2005) . Several approaches exist in the literature. Adversarial methods build out a goal-based curriculum (Mendonca et al., 2021; Eysenbach et al., 2018; OpenAI et al., 2021; Florensa et al., 2018) through various ad-hoc 2-player games. Other recent approaches (Kamienny et al., 2021; Campos et al., 2020) explicitly optimize for uniform state coverage with the goal of learning a general goal-conditioned policy, but are still tied to learning a policy function to actually implement the exploration strategy in the environment. Although not explicitly geared towards goal-based learning, many reward-free RL (Laskin et al., 2021) approaches are geared towards learning policies that provide good state coverage (Bellemare et al., 2016; Ostrovski et al., 2017; Burda et al., 2018; Houthooft et al., 2016; Badia et al., 2020) , however primarily with the intent of fine-tuning the learnt exploration policy rather than leveraging its state coverage. Our proposed approach, Entropy-Based Conditioned Continuous Control Policy Optimization (C3PO), is based on the hypothesis that disentangling the exploration phase from the policy learning phase can lead to simpler and more robust algorithms. It is composed of two steps: • Goal Discovery: generating a set of achievable states, as diverse as possible to maximize coverage, while being as uniform as possible to facilitate interpolation. • Goal-Conditioned Training: leveraging these states to learn to reach arbitrary goals.

