UNIVERSAL VALUE DENSITY ESTIMATION FOR IMI-TATION LEARNING AND GOAL-CONDITIONED REIN-FORCEMENT LEARNING

Abstract

This work considers two distinct settings: imitation learning and goal-conditioned reinforcement learning. In either case, effective solutions require the agent to reliably reach a specified state (a goal), or set of states (a demonstration). Drawing a connection between probabilistic long-term dynamics and the desired value function, this work introduces an approach that utilizes recent advances in density estimation to effectively learn to reach a given state. We develop a unified view on the two settings and show that the approach can be applied to both. In goalconditioned reinforcement learning, we show it to circumvent the problem of sparse rewards while addressing hindsight bias in stochastic domains. In imitation learning, we show that the approach can learn from extremely sparse amounts of expert data and achieves state-of-the-art results on a common benchmark.

1. INTRODUCTION

Effective imitation learning relies on information encoded in the demonstration states. In the past, successful and sample-efficient approaches have attempted to match the distribution of demonstrated states (Ziebart et al., 2008; Ho and Ermon, 2016; Schroecker et al., 2019) , reach any state that is part of the demonstrations (Wang et al., 2019; Reddy et al., 2019) , or track a reference trajectory to reproduce a specific sequence of states (Peng et al., 2018; Aytar et al., 2018; Pathak et al., 2018) . Attempting to reproduce demonstrated states directly allows the agent to exploit structure induced by environment dynamics and to accurately reproduce expert behavior with only a very small number of demonstrations. Commonly, this is framed as a measure to avoid covariate shift (Ross et al., 2011) but the efficacy of such methods on even sub-sampled trajectories (e.g. Ho and Ermon, 2016) and their ability to learn from observation alone (e.g. Torabi et al., 2018; Schroecker et al., 2019) show benefits beyond the avoidance of accumulating errors. Despite significant progress in the field, reproducing a set of demonstrated states efficiently, accurately and robustly remains an open field of research. To address the problem, we may first ask the question of how to reach arbitrary states, a question that has thus far been considered separately in the field of goal-conditioned reinforcement learning. In this paper, we introduce a unified view to goal-conditioned reinforcement learning and imitation learning. We will first address the question of how to reach single states in the former setting and then extend this notion to an imitation learning approach and reproduce distributions of state-action pairs. Despite significant achievements in the field (Schaul et al., 2015; Andrychowicz et al., 2017; Nair et al., 2018; Sahni et al., 2019) , learning to achieve arbitrary goals remains an extremely difficult challenge. In the absence of a suitably shaped reward function, the signal given to the agent can be as sparse as a constant reward if the goal is achieved and 0 otherwise. Hindsight Experience Replay (HER) (Andrychowicz et al., 2017) constitutes an effective and popular heuristic to address this problem; however, the formulation is ad-hoc and does not lend itself readily to the probabilistic reasoning required in a distribution-matching imitation learning approach. Furthermore, HER can be shown to suffer from bias in stochastic domains or when applied to some actor-critic algorithms as we will discuss in Section 3.1. To address this challenge, we introduce Universal Value Density Estimation (UVD). By considering an important subset of goal-conditioned value functions similar to similar to Andrychowicz et al. (2017) , namely those corresponding to reward functions that have an optimal agent reach a specific state, we can observe that the value of a state conditioned on a goal is equivalent to the likelihood of the agent reaching the goal from that state. We use normalizing flows

