LEARN GOAL-CONDITIONED POLICY WITH INTRINSIC MOTIVATION FOR DEEP REINFORCEMENT LEARNING Anonymous

Abstract

It is of significance for an agent to learn a widely applicable and general-purpose policy that can achieve diverse goals including images and text descriptions. Considering such perceptually-specific goals, the frontier of deep reinforcement learning research is to learn a goal-conditioned policy without hand-crafted rewards. To learn this kind of policy, recent works usually take as the reward the nonparametric distance to a given goal in an explicit embedding space. From a different viewpoint, we propose a novel unsupervised learning approach named goal-conditioned policy with intrinsic motivation (GPIM), which jointly learns both an abstract-level policy and a goal-conditioned policy. The abstract-level policy is conditioned on a latent variable to optimize a discriminator and discovers diverse states that are further rendered into perceptually-specific goals for the goal-conditioned policy. The learned discriminator serves as an intrinsic reward function for the goal-conditioned policy to imitate the trajectory induced by the abstract-level policy. Experiments on various robotic tasks demonstrate the effectiveness and efficiency of our proposed GPIM method which substantially outperforms prior techniques.

1. INTRODUCTION

Reinforcement learning (RL) makes it possible to drive agents to achieve sophisticated goals in complex and uncertain environments, from computer games (Badia et al., 2020; Berner et al., 2019) to real robot control (Lee et al., 2018; Lowrey et al., 2018; Vecerik et al., 2019; Popov et al., 2017) , which usually involves learning a specific policy for individual task relying on task-specific reward. However, autonomous agents are expected to exist persistently in the world and have the ability to solve diverse tasks. To achieve this, one needs to build a universal reward function and design a mechanism to automatically generate diverse goals for training. Raw sensory inputs such as images have been considered as common goals for agents to practice on and achieve (Watter et al., 2015; Florensa et al., 2019; Nair et al., 2018; 2019) , which further exacerbates the challenge for designing autonomous RL agents that can deal with such perceptually-specific inputs. Previous works make full use of a goal-achievement reward function as available prior knowledge (Pong et al., 2018) , such as Euclidean distance. Unfortunately, this kind of measurement in original space is not very effective for visual tasks since the distance between images does not correspond to meaningful distance between states (Zhang et al., 2018) . Further, the measure function is applied in the embedding space, where the representations of raw sensory inputs are learned by means of using a latent variable model like VAE (Higgins et al., 2017b; Nair et al., 2018) or using the contrastive loss (Sermanet et al., 2018; Warde-Farley et al., 2018) . We argue that these approaches taking prior non-parametric reward function in original or embedding space as above may limit the repertoires of behaviors and impose manual engineering burdens (Pong et al., 2019) . In the absence of any prior knowledge about the measure function, standard unsupervised RL methods learn a latent-conditioned policy through the lens of empowerment Salge et al. ( 2014 et al., 2018; Hausman et al., 2018) . However, the learned policy is conditioned on the latent variables rather than perceptually-specific goals. Applying these procedures to goal-reaching tasks, similar to parameter initialization or the hierarchical RL, needs an external reward function for the new tasks; otherwise the learned latent-conditioned policy cannot be applied directly to user-specified goals. Different from previous works, a novel unsupervised RL scheme is proposed in this paper to learn goal-conditioned policy by jointly learning an extra abstract-level policy conditioned on latent variables. The abstract-level policy is trained to generate diverse abstract skills while the goalconditioned policy is trained to efficiently achieve perceptually-specific goals that are rendered from the states induced by the corresponding abstract skills. Specifically, we optimize a discriminator in an unsupervised manner for the purpose of reliable exploration (Salge et al., 2014) to provide the intrinsic reward for the abstract-level policy. Then the learned discriminator serves as an intrinsic reward function for the goal-conditioned policy to imitate the trajectory induced by the abstract-level policy. In essence, the abstract-level policy can reproducibly influence the environment, and the goal-conditioned policy perceptibly imitates these influences. To improve the generalization ability of goal-conditioned policy in dealing with perceptually-specific inputs, a latent variable model is further considered in the goal-conditioned policy to disentangle goals into latent generative factors. The main contribution of our work is an unsupervised RL method that can learn perceptually-specific goal-conditioned policy without the prior reward function for autonomous agents. We propose a novel training procedure for this model, which provides an universal and effective reward function for various perceptual goals, e.g. images and text descriptions. Furthermore, we introduce a latent variable model for learning the representations of high-dimensional goals, and demonstrate the potential of our model to generalize behaviors across new tasks. Extensive experiments and detailed analysis demonstrate the effectiveness and efficiency of our proposed method.

2. PRELIMINARIES

RL: An agent interacts with an environment and selects actions in RL so as to maximize the expected amount of reward received in the long run (Sutton & Barto, 2018) , which can be modeled as a Markov decision process (MDP) (Puterman, 2014 ). An MDP is defined as a tuple M = (S, A, p, R, γ), where S and A are state and action spaces, p(•|s, a) gives the next-state distribution upon taking action a in state s, R(s, a, s ) is a random variable representing the reward received at transition s a → s , and γ ∈ [0, 1) is a discount factor. Intrinsic Motivation: RL with intrinsic motivation obtains the intrinsic reward by maximizing the mutual information between latent variables ω and agent's behaviors τ : I(ω; τ ), where the specific manifestation of τ can be an entire trajectory (Achiam et al., 2018) , an individual state (Eysenbach et al., 2018) or a final state (Gregor et al., 2016) ; and the specific implementation includes reverse and forward forms (Campos et al., 2020) . Please refer to Aubret et al. (2019) for more details. Disentanglement: Given an observation x with a dimension of N , VAE (Kingma & Welling, 2013; Higgins et al., 2017a ) is a latent model that pairs a top-down encoder q(z|x) with bottom-up decoder network p(x|z) by introducing a latent factor z, where dim(z) < N . To encourage the inferred latent factor z to capture the generative factor of x in a disentangled manner, an isotropic unit Gaussian prior p(z) ∼ N (0; I) is commonly used to control the capacity of the information bottleneck (Burgess et al., 2018) by minimizing the KL divergence between q(z|x) and p(z).

3. THE METHOD

In this section, we firstly formalize the problem and introduce the framework. Secondly, we elaborate on the process of how to jointly learn the goal-conditioned policy and abstract-level policy. Thirdly, disentanglement is applied in our setting to improve the generalization ability.

3.1. OVERVIEW

Given perceptually-specific goal g, our objective is to learn a goal-conditioned policy π θ (a|s, g) that inputs state s and g and outputs action a as shown in Fig. 1 . The abstract-level policy π µ (a|s, ω) takes as input state s and a latent variable ω and outputs action a, where ω corresponds to diverse latent skills. The discriminator q φ is firstly trained at the abstract-level for reliable exploration, then it provides the reward signal for the goal-conditioned policy to imitate the trajectory induced by the abstract-level policy. The abstract-level policy is able to generate diverse states s that are further rendered as diverse perceptually-specific goal g = Render(s). On this basis, π θ (a|s, g)



); Eysenbach et al. (2018); Sharma et al. (2019) or the self-consistent trajectory autoencoder (Co-Reyes

