UNIVERSAL VALUE DENSITY ESTIMATION FOR IMI-TATION LEARNING AND GOAL-CONDITIONED REIN-FORCEMENT LEARNING

Abstract

This work considers two distinct settings: imitation learning and goal-conditioned reinforcement learning. In either case, effective solutions require the agent to reliably reach a specified state (a goal), or set of states (a demonstration). Drawing a connection between probabilistic long-term dynamics and the desired value function, this work introduces an approach that utilizes recent advances in density estimation to effectively learn to reach a given state. We develop a unified view on the two settings and show that the approach can be applied to both. In goalconditioned reinforcement learning, we show it to circumvent the problem of sparse rewards while addressing hindsight bias in stochastic domains. In imitation learning, we show that the approach can learn from extremely sparse amounts of expert data and achieves state-of-the-art results on a common benchmark.

1. INTRODUCTION

Effective imitation learning relies on information encoded in the demonstration states. In the past, successful and sample-efficient approaches have attempted to match the distribution of demonstrated states (Ziebart et al., 2008; Ho and Ermon, 2016; Schroecker et al., 2019) , reach any state that is part of the demonstrations (Wang et al., 2019; Reddy et al., 2019) , or track a reference trajectory to reproduce a specific sequence of states (Peng et al., 2018; Aytar et al., 2018; Pathak et al., 2018) . Attempting to reproduce demonstrated states directly allows the agent to exploit structure induced by environment dynamics and to accurately reproduce expert behavior with only a very small number of demonstrations. Commonly, this is framed as a measure to avoid covariate shift (Ross et al., 2011) but the efficacy of such methods on even sub-sampled trajectories (e.g. Ho and Ermon, 2016) and their ability to learn from observation alone (e.g. Torabi et al., 2018; Schroecker et al., 2019) show benefits beyond the avoidance of accumulating errors. Despite significant progress in the field, reproducing a set of demonstrated states efficiently, accurately and robustly remains an open field of research. To address the problem, we may first ask the question of how to reach arbitrary states, a question that has thus far been considered separately in the field of goal-conditioned reinforcement learning. In this paper, we introduce a unified view to goal-conditioned reinforcement learning and imitation learning. We will first address the question of how to reach single states in the former setting and then extend this notion to an imitation learning approach and reproduce distributions of state-action pairs. Despite significant achievements in the field (Schaul et al., 2015; Andrychowicz et al., 2017; Nair et al., 2018; Sahni et al., 2019) , learning to achieve arbitrary goals remains an extremely difficult challenge. In the absence of a suitably shaped reward function, the signal given to the agent can be as sparse as a constant reward if the goal is achieved and 0 otherwise. Hindsight Experience Replay (HER) (Andrychowicz et al., 2017) constitutes an effective and popular heuristic to address this problem; however, the formulation is ad-hoc and does not lend itself readily to the probabilistic reasoning required in a distribution-matching imitation learning approach. Furthermore, HER can be shown to suffer from bias in stochastic domains or when applied to some actor-critic algorithms as we will discuss in Section 3.1. To address this challenge, we introduce Universal Value Density Estimation (UVD). By considering an important subset of goal-conditioned value functions similar to similar to Andrychowicz et al. (2017) , namely those corresponding to reward functions that have an optimal agent reach a specific state, we can observe that the value of a state conditioned on a goal is equivalent to the likelihood of the agent reaching the goal from that state. We use normalizing flows (see Sec. 2.4) to estimate this likelihood from roll-outs and thereby obtain an estimate of the value function. As we will show in Section 5.1, density estimation does not sample goals independently at random and therefore provides a dense learning signal to the agent where temporal-difference learning alone fails due to sparse rewards. This allows us to match the performance of HER in deterministic domains while avoiding bias in stochastic domains and while providing the foundation for our imitation learning approach. We will introduce this approach in Section 3.4 and show that it performs as well as the state-of-the-art on the most common benchmark tasks while significantly outperforming this baseline on a simple stochastic variation of the same domain (Sec. 5.1). Returning to the imitation learning setting, we propose to extend UVD to match a distribution of expert states and introduce Value Density Imitation Learning (VDI). Our goal is to design an imitation learning algorithm that is able to learn from minimal amounts of expert data using self-supervised environment interactions only. Like prior methods (see Sec. 2.3), VDI's objective is to match the expert's state-action distribution. We achieve this by sampling expert states that the agent is currently not likely to visit and using a goal-conditioned value-function to guide the agent towards those states. We will show in Sec. 4 that this minimizes the KL divergence between the expert's and the agent's state-action distributions and therefore provides an intuitive and principled imitation learning approach. Note that unlike most prior method, expert demonstrations are used in VDI to generate goals rather than to train an intermediate network such as a discriminator or reward function. The value function and density estimate are trained using self-supervised roll-outs alone. This makes intermediate networks much less prone to overfitting and we show in Sec. 5.2 that VDI uses demonstrations significantly more efficiently than the current state-of-the-art in common benchmarks.

2.1. MARKOV DECISION PROCESSES

Here, we briefly lay out notation while referring the reader to Puterman (2014) for a detailed review. Markov Decision Processes (MDPs) define a set of states S, a set of actions A, a distribution of initial states d 0 (s), Markovian transition dynamics defining the probability (density) of transitioning from state s to s when taking action a as p(s |s, a), and a reward function r(s, a). In reinforcement learning, we usually try to find a parametric stationary policy µ θ : S → A.foot_0 An optimal policy maximizes the long-term discounted reward J r γ (θ) = E [ ∞ t=0 γ t r(s t , a t )|s 0 ∼ d 0 , µ θ ] given a discount factor γ or, sometimes, the average reward J r (θ) = d π θ (s)r(s, µ θ (s))ds, a, where the stationary state distribution d µ (s) and the stationary state-action distribution ρ µ (s, a) are uniquely induced by µ under mild ergodicity assumptions. A useful concept to this end is the value function V µ (s) = E [ ∞ t=0 γ t r(s t , a t )|s 0 = s, µ] or Q function Q µ (s, a) = E [ ∞ t=0 γ t r(s t , a t )|s 0 = s, a 0 = a, µ] which can be used to estimate the policy gradient ∇ θ J γ (θ) (e.g. Sutton et al., 1999; Silver et al., 2014) . Finally, we define as p µ (s, a t -→ s ) the probability of transitioning from state s to s after t steps when taking action a in state s and following policy µ afterwards.

2.2. GOAL-CONDITIONED REINFORCEMENT LEARNING

Goal-conditioned Reinforcement Learning aims to teach an agent to solve multiple variations of a task, identified by a goal vector g and the corresponding reward function r g (s, a). Solving each variation requires the agent to learn a goal-conditioned policy, which we write as µ g θ (s). Here, we condition the reward and policy explicitly to emphasize that they can be conditioned on different goals. The goal-conditioned value function (GCVF) in this setting is then given by Q µ g r g (s, a) = E ∞ t=0 γ t r g (s t , a t )|s 0 = s, a 0 = a, µ g . To solve such tasks, Schaul et al. (2015) introduce the concept of a Universal Value Function Approximator (UVFA), a learned model Q ω (s, a; g) approximating Q µ g r g (s, a), i.e. all value functions where the policy and reward are conditioned on the same goal. We extend this notion to models of value-functions which use a goal-conditioned reward with a single fixed policy. To distinguish such models visually, we write Qω (s, a; g) to refer to models which approximate Q µ r g (s, a) for a given µ.



We write the policy as a deterministic function, but all findings hold for stochastic policies as well.

