C3PO: LEARNING TO ACHIEVE ARBITRARY GOALS VIA MASSIVELY ENTROPIC PRETRAINING

Abstract

Given a particular embodiment, we propose a novel method (C3PO) that learns policies able to achieve any arbitrary position and pose. Such a policy would allow for easier control, and would be re-useable as a key building block for downstream tasks. The method is two-fold: First, we introduce a novel exploration algorithm that optimizes for uniform coverage, is able to discover a set of achievable states, and investigates its abilities in attaining both high coverage, and hard-to-discover states; Second, we leverage this set of achievable states as training data for a universal goal-achievement policy, a goal-based SAC variant. We demonstrate the trained policy's performance in achieving a large number of novel states. Finally, we showcase the influence of massive unsupervised training of a goal-achievement policy with state-of-the-art pose-based control of the Hopper, Walker, Halfcheetah, Humanoid and Ant embodiments.

1. INTRODUCTION

Reinforcement learning (RL) has shown great results in optimizing for single reward functions (Mnih et al., 2013; Silver et al., 2017; Vinyals et al., 2019) , that is when a controller has to solve a specific task and/or the task is known beforehand. If the task is not known a priori, or is likely to be often re-configured, then re-training a new policy from scratch can be very expensive and looks as a waste of resources. In the case of multipurpose systems deployed in contexts where they will likely be required to perform a large range of tasks, investing significant resources into training a high-performance general goal-based controller beforehand makes sense. We propose an approach allowing for training a universal goal achievement policy, a policy able to attain any arbitrary state the system can take. Goal-conditioned RL is such a setting where a single policy function can be prompted to aim for a particular goal-state (Kaelbling, 1993; Schaul et al., 2015) . One important issue with goal-conditioned RL is that goals that are useful for training the policy are generally unknown, and even in the case of humans this is a key part of learning general controllers (Schulz, 2012; Smith and Gasser, 2005) . Several approaches exist in the literature. Adversarial methods build out a goal-based curriculum (Mendonca et al., 2021; Eysenbach et al., 2018; OpenAI et al., 2021; Florensa et al., 2018) through various ad-hoc 2-player games. Other recent approaches (Kamienny et al., 2021; Campos et al., 2020) explicitly optimize for uniform state coverage with the goal of learning a general goal-conditioned policy, but are still tied to learning a policy function to actually implement the exploration strategy in the environment. Although not explicitly geared towards goal-based learning, many reward-free RL (Laskin et al., 2021) approaches are geared towards learning policies that provide good state coverage (Bellemare et al., 2016; Ostrovski et al., 2017; Burda et al., 2018; Houthooft et al., 2016; Badia et al., 2020) , however primarily with the intent of fine-tuning the learnt exploration policy rather than leveraging its state coverage. Our proposed approach, Entropy-Based Conditioned Continuous Control Policy Optimization (C3PO), is based on the hypothesis that disentangling the exploration phase from the policy learning phase can lead to simpler and more robust algorithms. It is composed of two steps: • Goal Discovery: generating a set of achievable states, as diverse as possible to maximize coverage, while being as uniform as possible to facilitate interpolation. • Goal-Conditioned Training: leveraging these states to learn to reach arbitrary goals. To address the goal discovery step in C3PO, we propose the Chronological Greedy Entropy Maximization (ChronoGEM) algorithm, designed to exhaustively explore reachable states, even in complex high dimensional environments. ChronoGEM does not rely on any form of trained policy and thus doesn't require any interaction with the environment to learn to explore. Instead it uses a highlyparallelized random-branching policy to cover the environment, whose branching tree is iteratively re-pruned to maintain uniform leaf coverage. This iterative pruning process leverages learnt density models and inverse sampling to maintain a set of leaf states that are as uniform as possible over the state space. Training the goal-conditioned policy is then performed by leveraging the uniform states generated by ChronoGEM as a dataset of goals that provide well-distributed coverage over achievable states. We perform two types of experiments to illustrate C3PO's benefits over similar methods: First, we evaluate entropy upper bounds on ChronoGEM's generated state distribution compared to other reference exploration methods such as RND (Burda et al., 2018) and SMM (Lee et al., 2019) as described in Section 2.1.2. Second, we compare the full C3PO approach to ablated versions that leverage datasets generated by SMM and RND and a random walk. We do this by performing cross-validating of goal-conditioned policies across methods. By training a policy on one method's dataset and evaluating its goal-achievement capabilities on datasets generated by other methods, we can observe which of the methods gives rise to the most general policy. Through these two empirical studies, we illustrate the superiority of ChronoGEM compared to RND and SMM. Finally, we investigate C3PO's abilities in achieving arbitrary poses on five continuous control environments: Hopper, Walker2d, HalfCheetah, Ant and Humanoid. Videos of the resulting behaviours reaching the goal poses are available as gif files in our supplementary material.

2. CONDITIONED CONTINUOUS CONTROL POLICY OPTIMIZATION (C3PO)

The optimal universal goal-achievement policy for a given embodiment should allow an agent to achieve any reachable position and pose in the environment as quickly as possible. Learning such a policy necessarily requires a significant amount of exploration and training to both find and learn to attain a large enough number of states to generalize across goal-states. However, in the context of a simulator, which allows for both parallelization and arbitrary environment resets, covering a massive amount of the state space is doable. In our case, we consider 2 17 parallel trajectories, that are re-sampled every step to maintain an high coverage of the reachable space. Once such large coverage of the state space is achieved, learning a goal-achievement policy can be done with a relatively straight-forward learning algorithm that aims at attaining goals from this high-coverage set of states. If the states are well-enough distributed, and if the learning algorithm is sufficiently efficient, we may expect the final policy to achieve universal goal-achievement.

2.1. MASSIVELY ENTROPIC PRE-TRAINING

As described above, the first step is to discover the set of achievable goals. This collection will be the key of the effectiveness of the resulting policy: We want it as uniform as possible such that no reachable region is neglected. Therefore, without any prior, the ideal set of goals should be uniformly sampled from the manifold of states that are reachable in a given number of steps (T ). Since the shape



Figure 1: Left: Examples of the C3PO (in beige) achieving a pose (in red). Right: Entropy-Weighted Goal Achievement of C3PO vs. other methods, averaged over Walker, Hopper, Halfcheetah and Ant (see Section 3.3).

