LEARN GOAL-CONDITIONED POLICY WITH INTRINSIC MOTIVATION FOR DEEP REINFORCEMENT LEARNING Anonymous

Abstract

It is of significance for an agent to learn a widely applicable and general-purpose policy that can achieve diverse goals including images and text descriptions. Considering such perceptually-specific goals, the frontier of deep reinforcement learning research is to learn a goal-conditioned policy without hand-crafted rewards. To learn this kind of policy, recent works usually take as the reward the nonparametric distance to a given goal in an explicit embedding space. From a different viewpoint, we propose a novel unsupervised learning approach named goal-conditioned policy with intrinsic motivation (GPIM), which jointly learns both an abstract-level policy and a goal-conditioned policy. The abstract-level policy is conditioned on a latent variable to optimize a discriminator and discovers diverse states that are further rendered into perceptually-specific goals for the goal-conditioned policy. The learned discriminator serves as an intrinsic reward function for the goal-conditioned policy to imitate the trajectory induced by the abstract-level policy. Experiments on various robotic tasks demonstrate the effectiveness and efficiency of our proposed GPIM method which substantially outperforms prior techniques.

1. INTRODUCTION

Reinforcement learning (RL) makes it possible to drive agents to achieve sophisticated goals in complex and uncertain environments, from computer games (Badia et al., 2020; Berner et al., 2019) to real robot control (Lee et al., 2018; Lowrey et al., 2018; Vecerik et al., 2019; Popov et al., 2017) , which usually involves learning a specific policy for individual task relying on task-specific reward. However, autonomous agents are expected to exist persistently in the world and have the ability to solve diverse tasks. To achieve this, one needs to build a universal reward function and design a mechanism to automatically generate diverse goals for training. Raw sensory inputs such as images have been considered as common goals for agents to practice on and achieve (Watter et al., 2015; Florensa et al., 2019; Nair et al., 2018; 2019) , which further exacerbates the challenge for designing autonomous RL agents that can deal with such perceptually-specific inputs. Previous works make full use of a goal-achievement reward function as available prior knowledge (Pong et al., 2018) , such as Euclidean distance. Unfortunately, this kind of measurement in original space is not very effective for visual tasks since the distance between images does not correspond to meaningful distance between states (Zhang et al., 2018) . Further, the measure function is applied in the embedding space, where the representations of raw sensory inputs are learned by means of using a latent variable model like VAE (Higgins et al., 2017b; Nair et al., 2018) or using the contrastive loss (Sermanet et al., 2018; Warde-Farley et al., 2018) . We argue that these approaches taking prior non-parametric reward function in original or embedding space as above may limit the repertoires of behaviors and impose manual engineering burdens (Pong et al., 2019) . In the absence of any prior knowledge about the measure function, standard unsupervised RL methods learn a latent-conditioned policy through the lens of empowerment Salge et al. (2014) ; Eysenbach et al. (2018); Sharma et al. (2019) or the self-consistent trajectory autoencoder (Co-Reyes et al., 2018; Hausman et al., 2018) . However, the learned policy is conditioned on the latent variables rather than perceptually-specific goals. Applying these procedures to goal-reaching tasks, similar to parameter initialization or the hierarchical RL, needs an external reward function for the new tasks; otherwise the learned latent-conditioned policy cannot be applied directly to user-specified goals.

